Bulk RNAseq Demonstration
Introduction
Bulk RNA sequencing (RNAseq) experiments result in raw sequencing reads, typically in FASTQ format. In order to use this data to gain insights into gene expression patterns, identify novel transcripts, etc., it must undergo a series of processing steps – including quality control, read alignment, quantification of gene expression – and then further analysis depending on the aims of the experiment.
To import an example bulk RNAseq workflow, including data types, example datasets, pipelines and runs, analysis environments, and analysis notebooks, click on the Import a template
button on your Mantle dashboard and select Bulk-RNASeq
.
Raw data
FASTQ data cannot be easily stored in typical databases. An individual FASTQ file is semi-structured, and storing FASTQ files within a database as binary large objects (BLOBs) can be challenging due to their large size (typically tens of gigabytes per FASTQ file).
Within the Mantle data lake, you can store data files and associated metadata together as datasets. In the our bulk RNAseq demonstration workflow, FASTQs are stored as datasets of the bulk_rnaseq
data type, which has the following properties:
The sample name or ID
auto
, forward
, or reverse
FASTQ file corresponding to read 1
FASTQ file corresponding to read 2 (optional)
In this example workflow, we’ve included a subset of the data from this manuscript:
Wu, A. C. K., Patel, H., Chia, M., Moretto, F., Frith, D., Snijders, A. P., & van Werven, F. J. (2018). Repression of divergent noncoding transcription by a sequence-specific transcription factor. Molecular Cell, 72(6), 942-954.e7. https://doi.org/10.1016/j.molcel.2018.10.018
Data processing
In our demonstration workflow, bulk RNAseq FASTQs are processed using the nf-core rnaseq
pipeline.
We adapted the pipeline for compatibility with Mantle by adding a process at the beginning of the workflow to convert Mantle bulk_rnaseq
datasets into an input samplesheet, and to collect results data files into Mantle rnaseq_multiqc_report
and bulk_rnaseq_all_outputs
datasets.
The rnaseq_multiqc_report
data type has the following properties:
The example workflow contains the P000xxx_rap_yeast_multiqc
dataset of this type.
The bulk_rnaseq_all_outputs
data type has the following properties:
The samplesheet created by the Mantle process and unsed as input to the nf-core pipeline
The example workflow contains the P000xxx_rap_yeast_outputs
dataset of this type.
Analyzing processed data
We analyzed data contained in the P000xxx_rap_yeast_outputs
dataset in the rnaseq_analysis
notebook, which ran in the mantle_rnaseq
analysis environment.
Within the notebook, we used the Mantle SDK to download the directory contained in the star_salmon
property and mark the dataset as an input to the analysis notebook. Then, we used the Scanpy package to perform differential expression analysis.
Wrapping up
Mantle’s compatibility with Nextflow pipelines – especially from the nf-core community – streamlines the processing of bioinformatics data, including bulk RNAseq data. It’s also easy to further analyze the data within a custom analysis environment, where you can import the bioinformatics-specific packages you need. Feel free to try out this workflow with your own bulk RNAseq data!