Bulk RNAseq Tutorial
Introduction
Bulk RNA sequencing (RNAseq) experiments result in raw sequencing reads, typically in FASTQ format. In order to use this data to gain insights into gene expression patterns, identify novel transcripts, etc., it must undergo a series of processing steps – including quality control, read alignment, quantification of gene expression – and then further analysis depending on the aims of the experiment.
Mantle can be used for data management, preprocessing, and downstream analysis of bulk RNAseq data.
All data, pipelines, and analysis environments in this tutorial are available in your Mantle account.
Raw data management
FASTQ data cannot be easily stored in generic databases. An individual FASTQ file is semi-structured, and storing FASTQ files within a database as binary large objects (BLOBs) can be challenging due to their large size (typically tens of gigabytes per FASTQ file).
In Mantle, you can store data files and associated metadata together as datasets. Here, FASTQ files are stored as datasets of the rnaseq_fastq
data type. Expand to view the properties of rnaseq_fastq
datasets:
For demonstration purposes, we’ve used a downsampled subset of the data from this manuscript:
Wu, A. C. K., Patel, H., Chia, M., Moretto, F., Frith, D., Snijders, A. P., & van Werven, F. J. (2018). Repression of divergent noncoding transcription by a sequence-specific transcription factor. Molecular Cell, 72(6), 942-954.e7. https://doi.org/10.1016/j.molcel.2018.10.018
Additionally, in order to perform alignment, a reference genome FASTA file and gene annotation GTF or GFF file must be provided. Additional reference data, such as a reference transcriptome or additional FASTA could be required depending on your analysis.
We’ve used the rnaseq_reference
data type to allow you to store all reference data together. Expand to view the properties of rnaseq_reference
datasets:
Data processing
Bulk RNAseq FASTQs need to be preprocessed to obtain count matrix or other forms of data that can be analyzed to draw insights. This can be accomplished using the nf-core rnaseq
pipeline, which is included in your Mantle account.
We adapted the pipeline for compatibility with Mantle by adding a process at the beginning of the workflow to convert Mantle rnaseq_fastq
datasets into an input samplesheet. Additionally, we added a process at the end of the workflow to collect some of the files that the pipeline creates into Mantle datasets, and to register the MultiQC report HTML file as an output of the pipeline on Mantle, which allows it to be displayed on the pipeline run page.
Analyzing processed data
We analyzed data contained in the gene-level raw counts matrix produced by the pipeline (stored as a count_matrix
dataset) in the bulk_rnaseq
notebook, which ran in the spatial-transcriptomics-analysis
environment.
Within the notebook, we used the Mantle SDK to download the count matrix CSV file from the matrix
property on the dataset and mark the dataset as an input to the analysis notebook. Then, we used the Scanpy package to perform differential expression analysis.
Wrapping up
With Mantle, you can easily run powerful pipelines, like the nf-core rnaseq pipeline, to process bioinformatics data. Seamless transition to downstream analysis in Mantle’s analysis environments, which include bioinformatics-specific packages. Feel free to try out this workflow with your own bulk RNAseq data!