fastq2EZbakR for isoform analyses

Introduction

The first step towards an isoform-level analysis of your NR-seq data is to process your data with fastq2EZbakR. Crucially, fastq2EZbakR will count mutations in your sequencing reads (necessary for all NR-seq analyses) and combine this with information as to the set of transcript isoforms in your annotation with which each read is compatible (TEC assignment; necessary for isoform-level analyses). The main output of this pipeline, which will be found in the results/cB/ after completion, will be the input for the EZbakR isoform-level analysis.

Running the pipeline

Step 0: Setup

fastq2EZbakR uses conda/mamba to automatically install all necessary dependencies the first time you run the pipeline. Thus, you should install one of these before running fastq2EZbakR.

Historically, I have suggested installing mamba over conda, as Mamba was often orders of magnitude faster than conda for fastq2EZbakR dependency installation. Recently though, conda has added support for using the same underlying solver that mamba uses (libmamba). Thus, if you already have conda installed, check out this documentation for how to configure the faster solver.

Once you have mamba/conda installed, create a conda environment with Snakemake and Snakedeploy installed:

mamba create -c conda-forge -c bioconda --name deploy_snakemake 'snakemake<8.0.0' snakedeploy
# Or conda create ...

Installing a Snakemake version < 8.0.0 is not strictly necessary for using fastq2EZbakR, but it insures compatibility with some of the HPC deployment strategies documented here.

Step 1: Deploy

Active your conda environment and “deploy” fastq2EZbakR like so:

conda activate deploy_snakemake
snakedeploy deploy-workflow https://github.com/isaacvock/fastq2EZbakR.git . --branch main

This will create two new directories: config and workflow. config will include the config.yaml file that specifies what your input to the pipeline will be and sets a number of parameters that tunes how the pipeline is run. workflow contains a singular Snakefile that specifies the link at which all of the pipeline code is hosted. This is a convenient way to deploy a Snakemake pipeline, as it avoids you having to clone the entire repo and thus makes it easier to stay up-to-date with the most current pipeline version.

Step 2: Edit the config file

Assuming you are in the directory created when you clone the tutorial data repo as described in the intro, you should have a folder called data which shoud contain:

  1. hg38_chr6_subseq.fa: FASTA file to align reads to
  2. hg38_refseq_subseq.gtf: RefSeq GTF file with genomic feature coordinates, used for feature assignment of reads and transcript isoform quantification.
  3. fastqs: directory containing a number of sub-directories, each containing a pair of FASTQ files.

In practice, you will have to download a relevant genome FASTA file and annotation GTF file yourself. Furthermore, in the preprint we advocate for using your RNA-seq data to augment and filter a conservative reference (like RefSeq for human data; Ensembl annotations filtered for high support level isoforms would also be acceptable) with StringTie or similar tools. Our personal annotation assembly strategy (StringTie + post-hoc trimming/isoform flagging discussed in preprint and here) is implemented in a Snakemake pipeline called AnnotationCleaner.

You need to edit the following sections of your config file to specify input data and desired pipeline settings for isoform-level analyses:

samples

Paths to FASTQ file containing directories. Each directory should contain a single FASTQ file (for single-end data) or a pair of FASTQ files (for paired-end data).

samples:
    DMSO_8hr_1: data/fastqs/DMSO_8hr_1
    DMSO_8hr_2: data/fastqs/DMSO_8hr_2
    DMSO_8hr_3: data/fastqs/DMSO_8hr_3
    DMSO_8hr_nos4U: data/fastqs/DMSO_8hr_nos4U
    SMG1i_8hr_1: data/fastqs/SMG1i_8hr_1
    SMG1i_8hr_2: data/fastqs/SMG1i_8hr_2
    SMG1i_8hr_3: data/fastqs/SMG1i_8hr_3
    SMG1i_8hr_nos4U: data/fastqs/SMG1i_8hr_nos4U

control_samples

This specifies which samples are -s4U samples and should thus be used for SNP calling. In practice, if you have lots of -s4U samples, you only need to specify 1 or 2 of them:

control_samples: ['SMG1i_8hr_nos4U', 'DMSO_8hr_nos4U']

PE

Specify whether library is paired-end. Soon this will be automatically detected, but for now you need to tell fastq2EZbakR that this is a paired-end dataset

PE: True

genome

Specify the path to the genome FASTA file. This can be relative to the directory in which you will run Snakemake (i.e., the top of the Isoform_Analysis_Tutorial direcotry), or an absolute path to the file:

genome: data/hg38_chr6_subseq.fa

annotation

Specify the path to the genome annotation GTF file. Again, relative or absolute paths are acceptable. The example here is a relative path as in the genome example:

annotation: data/hg38_refseq_chr6subseq.gtf

indices

Relative or absolute path to, and name of, directory created to contain alignment indicies. You need to use STAR (the default fastq2EZbakR aligner) for transcript-isoform level analyses and so I have chosen to name it to reflect that aligner choice:

indices: data/star_chr6_subseq

strandedness

Strandedness of your library. This data is reverse stranded, meaning that the first read in a pair represents the reverse complement of the original RNA’s sequence:

strandedness: "reverse"

features

This is the key part. For transcript isoform level analyses, you need to turn on assignment of reads to their transcript equivalence class (TEC; tec in config). Here, I have also turned on two other non-default feature assignment strategies: assignment of reads to exonic bins (like in DEXSeq) and exon-exon junctions. We will use these two strategies in the EZbakR portion of the tutorial as they provide complementary information to the TEC analysis:

features:
  genes: True
  exons: True
  tec: True
  exonic_bins: True
  junctions: True
  eej: False
  eij: False

flat_annotation

Scroll down a bit and you will see this parameter. Since I have specified exonic_bins to True in features, a DEXSeq flattened annotation will be created. The flat_annotation parameter allows you to specify the name and location of this annotation:

flat_annotation: "data/hg38_chr6subseq_flat.gtf"

final_output

Scroll down further and you will see the final_output parameter. The default will work fine here, but to get you some useful experience, I would suggest also turning on the arrow output, as this can be helpful when working with larger datasets using EZbakR. I’ll show you how in the EZbakR portion of this tutorial:

final_output:
  cB: True
  cUP: False
  arrow: True

star_sam_tags

Keep scrolling down and you will see the star_sam_tags parameter. This specifies which additional bam file tags to include in the STAR alignment output. Since I specified the junction feature assignment (junctions: True in features), you should specify the STAR-specific junction information tags (jI and jM):

star_sam_tags: ["NH", "HI", "AS", "NM", "MD", "nM", "jI", "jM"]

Other parameters

Everything else can be left as is for this analysis, but some other parameters you may want to edit for your own analyses include:

  1. fastp_adapters and fastp_parameters: The former specifies adapter sequences for fastp to trim, and the latter specifies optional trimming parameters. I often like to do some conservative trimming, which may look like setting fastp_parameters to "--trim_poly_x --cut_tail --cut_front --trim_tail1 3 --trim_front1 3 --trim_front2 3 --trim_tail2 3", which hard trims the ends of all reads and also does some polyA/G and low quality base trimming.
  2. star_align_params: This sets optional arguments for STAR alignment. The default config file has some standard settings that I would suggest, including lenient mismatch penalization and end-to-end alignment (rather than allowing soft-clipping). The latter is important for assignment of reads to exons, as soft-clipping can mess with this strategy (as discussed in the docs).

Step 3: Run pipeline

Navigate to the top of the Isoform_Analysis_Tutorial directory and do a dry run of the pipeline with :

snakemake -n --cores all --use-conda --rerun-triggers mtime --keep-going

The -n flag makes this a “dry run”, which means that it just checks to see if the pipeline can be run as configured, throws an informative error if it finds something wrong, and tells you all of the steps that will be run if everything looks good. You can then actually run the pipeline by removing this flag:

snakemake --cores all --use-conda --rerun-triggers mtime --keep-going

With this downsampled dataset and a 4 core laptop, the full pipeline should take about 20 minutes to run with this strategy. By default, all of the steps in the pipeline will run serially (i.e., one after another). This is often not the best way to run fastq2EZbakR, or any Snakemake pipeline. If you are running the pipeline in an HPC environment, there are several ways to get Snakemake to more effectively use all of the resources available to you, and to thus run each step of the pipeline as a separate job, with several jobs able to be queued up in parallel in many cases. See here and here for some details. If you are using Snakemake version >= 8.0.0, checkout the executor plug-ins here.

Step 4 (Optional): load tracks

Follow the steps here to load the colored tracks created by fastq2EZbakR (found in the results/tracks directory). You can then zoom in on a particular locus, like SRSF3 to see the dramatic effect of SMG1 inhibition at particular loci

SRSF3 tracks

Next steps

If everything ran smoothly, you are ready to move on to the EZbakR portion of this tutorial! The cheatsheet directory of this tutorial repo contains the important expected output, if you want to compare what you got to the “right answers”. You can also use the data in cheatsheet in case you want to skip right to the EZbakR portion without waiting for fastq2EZbakR to finish running.