Skip to main content

Introduction

Bulk RNA sequencing (RNAseq) experiments result in raw sequencing reads, typically in FASTQ format. In order to use this data to gain insights into gene expression patterns, identify novel transcripts, etc., it must undergo a series of processing steps – including quality control, read alignment, quantification of gene expression – and then further analysis depending on the aims of the experiment. Mantle can be used for data management, preprocessing, and downstream analysis of bulk RNAseq data. All data, pipelines, and analysis environments in this tutorial are available in your Mantle account.

Raw data management

FASTQ data cannot be easily stored in generic databases. An individual FASTQ file is semi-structured, and storing FASTQ files within a database as binary large objects (BLOBs) can be challenging due to their large size (typically tens of gigabytes per FASTQ file). In Mantle, you can store data files and associated metadata together as datasets. Here, FASTQ files are stored as datasets of the rnaseq_fastq data type. Expand to view the properties of rnaseq_fastq datasets:
sample
string
required
The sample name or ID
strandedness
string
required
auto, forward, or reverse
read1
file
required
FASTQ file corresponding to read 1
read2
file
FASTQ file corresponding to read 2 (optional)
For demonstration purposes, we’ve used a downsampled subset of the data from this manuscript:
Wu, A. C. K., Patel, H., Chia, M., Moretto, F., Frith, D., Snijders, A. P., & van Werven, F. J. (2018). Repression of divergent noncoding transcription by a sequence-specific transcription factor. Molecular Cell, 72(6), 942-954.e7. https://doi.org/10.1016/j.molcel.2018.10.018
Additionally, in order to perform alignment, a reference genome FASTA file and gene annotation GTF or GFF file must be provided. Additional reference data, such as a reference transcriptome or additional FASTA could be required depending on your analysis. We’ve used the rnaseq_reference data type to allow you to store all reference data together. Expand to view the properties of rnaseq_reference datasets:
source
string
required
Where did the reference data come from — e.g. ENSEMBL
genome_name
string
required
The name of the genome and/or other identifying information for the reference dataset
genome_species
string
Species of the genome (optional)
fasta
file
required
Genome FASTA file
gtf
file
GTF annotation file
gff
file
GFF annotation file (optional — specify if you don’t have a GTF file)
transcript_fasta
file
Transcriptome FASTA file (optional)
gencode
bool
Specify if your GTF annotation is in GENCODE format (optional — if your GTF file is in GENCODE format and you would like to run Salmon i.e. —pseudo_aligner salmon, you will need to provide this parameter in order to build the Salmon index appropriately)
gtf_group_features
string
The attribute type used to group features in the GTF file when running Salmon (optional)
featurecounts_group_type
string
The attribute type used to group feature types in the GTF file when generating the biotype plot with featureCounts (optional)
featurecounts_feature_type
string
By default, the pipeline assigns reads based on the ‘exon’ attribute within the GTF file (optional)
additional_fasta
file
FASTA file to concatenate to genome FASTA file e.g. containing spike-in sequences (optional)

Data processing

Bulk RNAseq FASTQs need to be preprocessed to obtain count matrix or other forms of data that can be analyzed to draw insights. This can be accomplished using the nf-core rnaseq pipeline, which is included in your Mantle account.
We adapted the pipeline for compatibility with Mantle by adding a process at the beginning of the workflow to convert Mantle rnaseq_fastq datasets into an input samplesheet. Additionally, we added a process at the end of the workflow to collect some of the files that the pipeline creates into Mantle datasets, and to register the MultiQC report HTML file as an output of the pipeline on Mantle, which allows it to be displayed on the pipeline run page.

Analyzing processed data

We analyzed data contained in the gene-level raw counts matrix produced by the pipeline (stored as a count_matrix dataset) in the bulk_rnaseq notebook, which ran in the spatial-transcriptomics-analysis environment.
Within the notebook, we used the Mantle SDK to download the count matrix CSV file from the matrix property on the dataset and mark the dataset as an input to the analysis notebook. Then, we used the Scanpy package to perform differential expression analysis.

Wrapping up

With Mantle, you can easily run powerful pipelines, like the nf-core rnaseq pipeline, to process bioinformatics data. Seamless transition to downstream analysis in Mantle’s analysis environments, which include bioinformatics-specific packages. Feel free to try out this workflow with your own bulk RNAseq data!