Introduction

Bulk RNA sequencing (RNAseq) experiments result in raw sequencing reads, typically in FASTQ format. In order to use this data to gain insights into gene expression patterns, identify novel transcripts, etc., it must undergo a series of processing steps – including quality control, read alignment, quantification of gene expression – and then further analysis depending on the aims of the experiment.

To import an example bulk RNAseq workflow, including data types, example datasets, pipelines and runs, analysis environments, and analysis notebooks, click on the Import a template button on your Mantle dashboard and select Bulk-RNASeq.

Raw data

FASTQ data cannot be easily stored in typical databases. An individual FASTQ file is semi-structured, and storing FASTQ files within a database as binary large objects (BLOBs) can be challenging due to their large size (typically tens of gigabytes per FASTQ file).

Within the Mantle data lake, you can store data files and associated metadata together as datasets. In the our bulk RNAseq demonstration workflow, FASTQs are stored as datasets of the bulk_rnaseq data type, which has the following properties:

sample
string
required

The sample name or ID

strandedness
string
required

auto, forward, or reverse

read1
file
required

FASTQ file corresponding to read 1

read2
file

FASTQ file corresponding to read 2 (optional)

In this example workflow, we’ve included a subset of the data from this manuscript:

Wu, A. C. K., Patel, H., Chia, M., Moretto, F., Frith, D., Snijders, A. P., & van Werven, F. J. (2018). Repression of divergent noncoding transcription by a sequence-specific transcription factor. Molecular Cell, 72(6), 942-954.e7. https://doi.org/10.1016/j.molcel.2018.10.018

Data processing

In our demonstration workflow, bulk RNAseq FASTQs are processed using the nf-core rnaseq pipeline.

We adapted the pipeline for compatibility with Mantle by adding a process at the beginning of the workflow to convert Mantle bulk_rnaseq datasets into an input samplesheet, and to collect results data files into Mantle rnaseq_multiqc_report and bulk_rnaseq_all_outputs datasets.

The rnaseq_multiqc_report data type has the following properties:

multiqc_report_html
file
required
multiqc_report_data
file
required
multiqc_report_plots
file
required
multiqc_star
file
required
mqc_qualimap_genomic_origin_1
file
required
mqc_qualimap_gene_coverage_profile_Counts
file
required
mqc_qualimap_gene_coverage_profile_Normalised
file
required

The example workflow contains the P000xxx_rap_yeast_multiqc dataset of this type.

The bulk_rnaseq_all_outputs data type has the following properties:

bbsplit
file
fastqc
file
required
pipeline_info
file
required
salmon
file
required
samplesheet
file
required

The samplesheet created by the Mantle process and unsed as input to the nf-core pipeline

star_salmon
file
required
trimgalore
file
required

The example workflow contains the P000xxx_rap_yeast_outputs dataset of this type.

Analyzing processed data

We analyzed data contained in the P000xxx_rap_yeast_outputs dataset in the rnaseq_analysis notebook, which ran in the mantle_rnaseq analysis environment.

Within the notebook, we used the Mantle SDK to download the directory contained in the star_salmon property and mark the dataset as an input to the analysis notebook. Then, we used the Scanpy package to perform differential expression analysis.

Wrapping up

Mantle’s compatibility with Nextflow pipelines – especially from the nf-core community – streamlines the processing of bioinformatics data, including bulk RNAseq data. It’s also easy to further analyze the data within a custom analysis environment, where you can import the bioinformatics-specific packages you need. Feel free to try out this workflow with your own bulk RNAseq data!