Bulk RNAseq Tutorial

Introduction

Bulk RNA sequencing (RNAseq) experiments result in raw sequencing reads, typically in FASTQ format. In order to use this data to gain insights into gene expression patterns, identify novel transcripts, etc., it must undergo a series of processing steps – including quality control, read alignment, quantification of gene expression – and then further analysis depending on the aims of the experiment.

Mantle can be used for data management, preprocessing, and downstream analysis of bulk RNAseq data.

All data, pipelines, and analysis environments in this tutorial are available in your Mantle account.

Raw data management

FASTQ data cannot be easily stored in generic databases. An individual FASTQ file is semi-structured, and storing FASTQ files within a database as binary large objects (BLOBs) can be challenging due to their large size (typically tens of gigabytes per FASTQ file).

In Mantle, you can store data files and associated metadata together as datasets. Here, FASTQ files are stored as datasets of the rnaseq_fastq data type. Expand to view the properties of rnaseq_fastq datasets:

Properties

For demonstration purposes, we’ve used a downsampled subset of the data from this manuscript:

Wu, A. C. K., Patel, H., Chia, M., Moretto, F., Frith, D., Snijders, A. P., & van Werven, F. J. (2018). Repression of divergent noncoding transcription by a sequence-specific transcription factor. Molecular Cell, 72(6), 942-954.e7. https://doi.org/10.1016/j.molcel.2018.10.018

Additionally, in order to perform alignment, a reference genome FASTA file and gene annotation GTF or GFF file must be provided. Additional reference data, such as a reference transcriptome or additional FASTA could be required depending on your analysis.

We’ve used the rnaseq_reference data type to allow you to store all reference data together. Expand to view the properties of rnaseq_reference datasets:

Properties

Data processing

Bulk RNAseq FASTQs need to be preprocessed to obtain count matrix or other forms of data that can be analyzed to draw insights. This can be accomplished using the nf-core rnaseq pipeline, which is included in your Mantle account.

We adapted the pipeline for compatibility with Mantle by adding a process at the beginning of the workflow to convert Mantle rnaseq_fastq datasets into an input samplesheet. Additionally, we added a process at the end of the workflow to collect some of the files that the pipeline creates into Mantle datasets, and to register the MultiQC report HTML file as an output of the pipeline on Mantle, which allows it to be displayed on the pipeline run page.

Analyzing processed data

We analyzed data contained in the gene-level raw counts matrix produced by the pipeline (stored as a count_matrix dataset) in the bulk_rnaseq notebook, which ran in the spatial-transcriptomics-analysis environment.

Within the notebook, we used the Mantle SDK to download the count matrix CSV file from the matrix property on the dataset and mark the dataset as an input to the analysis notebook. Then, we used the Scanpy package to perform differential expression analysis.

Wrapping up

With Mantle, you can easily run powerful pipelines, like the nf-core rnaseq pipeline, to process bioinformatics data. Seamless transition to downstream analysis in Mantle’s analysis environments, which include bioinformatics-specific packages. Feel free to try out this workflow with your own bulk RNAseq data!

Workflows

​Introduction

​Raw data management

​Data processing

​Analyzing processed data

​Wrapping up

Introduction

Raw data management

Data processing

Analyzing processed data

Wrapping up