EstimateFractions
EstimateFractions.Rmd
Introduction
The first step of almost any NR-seq analysis is estimating how much
of each mutational population is present for each feature in each
sample. For example, in a standard TimeLapse-seq/SLAM-seq/TUC-seq/etc.
experiment, you need to first estimate the fraction of reads from each
feature that have high T-to-C mutational content.
EstimateFractions()
is designed for exactly this task. In
this vignette, I will walk through the basics of using
EstimateFractions()
, while also diving into some of its
unique functionality.
library(EZbakR)
# Going to show one tidyr trick to help with cB filtering
library(tidyr)
# Going to use a bit of dplyr magic in my demonstrations
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
The basics
First, let’s simulate some data to showcase how
EstimateFractions()
works:
simdata <- EZSimulate(nfeatures = 300, nreps = 2)
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)
Technically, all you need to do is run
EstimateFractions()
, providing it your
EZbakRData
object, and all will be fine:
ezbdo <- EstimateFractions(ezbdo)
#> Estimating mutation rates
#> Summarizing data for feature(s) of interest
#> Averaging out the nucleotide counts for improved efficiency
#> Estimating fractions
#> Processing output
Despite this, it is important to realize that
EstimateFractions()
is automating a lot of things under the
hood. In particular, it is making the following assumptions about your
data:
- You want to analyze every mutational population tracked in your cB file. In this case, that is just the “TC” column of the simulated cB.
- All possible mutational populations are present in your data.
- The feature sets for which you want to estimate fractions includes each unique combination of all feature columns in your cB. In this case, that is just the “feature” column of the simulated cB.
- You want to filter out rows for which all feature columns are the string “__no_feature” or “NA”.
- Rows with multiple feature assignments in a given feature column should not be split into separate rows for each feature assignment.
In the rest of this section, I will discuss how to adjust all of these behaviors.
Tuning the mutational modeling
If you have multiple mutation-type columns in your cB, but would only
like to analyze a subset of them, this subset can be specified using the
mutrate_populations
argument:
ezbdo <- EstimateFractions(ezbdo, mutrate_populations = "TC")
#> Estimating mutation rates
#> Summarizing data for feature(s) of interest
#> Averaging out the nucleotide counts for improved efficiency
#> Estimating fractions
#> Processing output
If you check ?EstimateFractions()
, you will see that its
default value is “all”, which means to use all present mutation-type
columns.
In addition, under the hood, EstimateFractions()
creates
what I have termed a fractions design matrix. This “design matrix” is
really just a data frame with n + 1 columns, where n is the number of
mutation-types you are analyzing. EZbakR provides some lazily loaded
examples for you to check out.
“Standard” TC-only fractions design matrix:
# Observe contents of cB
standard_fraction_design
TC | present |
---|---|
TRUE | TRUE |
FALSE | TRUE |
There is one column for each mutation-type, and one column called
present
. Each value under a mutation-type column is either
TRUE
or FALSE
. TRUE
indicates
that the row describes a population with high levels of that
mutation-type. So the first row in this example describes the high
T-to-C mutation content population. The value under the
present
column denotes whether or not that population is
expected to be present in your data. The first row has a
present
value of TRUE
because we expect there
to be high T-to-C mutation content reads in our +label data (don’t worry
about -label controls, they are properly handled automatically). The
second row has a TC
value of FALSE
, indicating
that it pertains to the population of reads with low T-to-C mutation
content. This population is also expected to exist in a standard NR-seq
dataset, so its present
value is also `TRUE.
For a more complicated example, consider the fractions design matirx for TILAC. TILAC is a method where an s4U labeled RNA population is mixed with an s6G labeled population. In this case, there are expected to be reads with either high T-to-C content (new reads from the s4U labeled sample), high G-to-A content (new reads from the s6G labeled sample), and reads with low T-to-C and G-to-A mutational content (old reads from either sample). However, you will NEVER expect a read with both high T-to-C and high G-to-A content, as there are no samples subjected to both labels:
# Observe contents of cB
tilac_fraction_design
TC | GA | present |
---|---|---|
TRUE | TRUE | FALSE |
TRUE | FALSE | TRUE |
FALSE | TRUE | TRUE |
FALSE | FALSE | TRUE |
Because of this the present
value for the
TC == TRUE
and GA == TRUE
row is
FALSE
. The dually high mutation population is not present
in this data. All other populations are present though, so their
present
values are TRUE
.
You can automatically generate a fraction design table as such:
# Three populations for fun:
fd_table <- create_fraction_design(c("TC", "GA", "CG"))
fd_table
TC | GA | CG | present |
---|---|---|---|
TRUE | TRUE | TRUE | TRUE |
FALSE | TRUE | TRUE | TRUE |
TRUE | FALSE | TRUE | TRUE |
FALSE | FALSE | TRUE | TRUE |
TRUE | TRUE | FALSE | TRUE |
FALSE | TRUE | FALSE | TRUE |
TRUE | FALSE | FALSE | TRUE |
FALSE | FALSE | FALSE | TRUE |
By default, create_fraction_design
assumes all possible
populations are present. You can edit this table to better reflect the
true circumstances of your particular experiment. If you don’t provide a
fraction design table, EstimateFractions()
will use the
default output of create_fraction_design()
to ensure that
all possible populations are modeled. This is rarely going to be a truly
accurate fraction design table for multi-label experiments, but its a
conservative default that you can easily adjust.
Tuning the feature set choice and filtering
fastq2EZbakR, the upstream pipeline I developed to conveniently
produce output compatible with EZbakR, is able to assign reads to lots
of different “features”. These include genes, exonic regions of genes,
transcript equivalence classes, exon-exon junctions, and more. To make
full effective use of this diverse set of feature assignments, its
important to understand how EstimateFractions()
will treat
the various feature columns in your cB. Often, you will be interested in
analyzing subsets of these features separately. For example, it might
make sense to estimate fractions for gene + transcript equivalence class
combos for performing isoform-level analyses (discussed later in this
vignette), gene + exon-exon junction combos, and the standard
exonic-gene feature, all with the same cB. You can specify which
features to use in a given analysis via setting the
features
argument:
ezbdo <- EstimateFractions(ezbdo, features = 'feature')
#> Estimating mutation rates
#> Summarizing data for feature(s) of interest
#> Averaging out the nucleotide counts for improved efficiency
#> Estimating fractions
#> Processing output
In this example, there is only one feature column called
feature
, but in practice, the features
argument can be provided a vector of feature column names.
Sometimes, a read was not assignable to a given feature. In that
case, there is likely a characteristic string that denotes failed
assignment. In the current version of fastq2EZbakR
, this is
"__no_feature"
(though in older versions was
NA
, which is a bit harder to deal with; more on that
later). The default is thus for EstimateFractions()
to
filter out any rows of the cB for which all analyzed features have this
value. This is set by the remove_features
argument, which
is a vector of strings that should be considered ripe for filtering:
ezbdo <- EstimateFractions(ezbdo, remove_features = c("__no_feature", "feature1"))
#> Estimating mutation rates
#> Summarizing data for feature(s) of interest
#> Averaging out the nucleotide counts for improved efficiency
#> Estimating fractions
#> Processing output
You can also use this to filter out certain features, like “feature1”
in the simulated example. As I mentioned though, all analyzed feature
columns need to have one of the remove_features
strings in
order for them to get filtered. This behavior can be changed to the
opposite extreme, where only a single feature needs to fail this test
for the entire row to get filtered out:
ezbdo <- EstimateFractions(ezbdo, filter_condition = `|`)
#> Estimating mutation rates
#> Summarizing data for feature(s) of interest
#> Averaging out the nucleotide counts for improved efficiency
#> Estimating fractions
#> Processing output
As I mentioned, older versions of fastq2EZbakR and bam2bakR would
denote failed assignment with an NA
. While the default
value for remove_features
includes the string
"NA"
, this does not properly handle filtering out of
columns with the actual value NA
. You can convert NA’s in
your cB file with tidyr::replace_na()
to whatever string
you please:
example_df <- data.frame(feature = c('A', NA, 'C'),
other_feature = c(NA, 'Y', 'Z'))
replaced_df <- replace_na(example_df,
list(feature = '__no_feature',
other_feature = 'NA'))
replaced_df
feature | other_feature |
---|---|
A | NA |
__no_feature | Y |
C | Z |
Finally, some feature assignments will be ambiguous. That is, one
read will assign to multiple instances of a given feature. For example,
when assigning reads to exon-exon junctions, one read may overlap
multiple exon-exon junctions. In fastq2EZbakR, such instances are
handled by including the names of all features a read assigned to,
separated by “+”. You can see an example of this if you inspect the cB
created by EZbakR
’s isoform simulator,
SimulateIsoforms()
. In some cases, you will want to split
these rows into multiple rows for each feature assignment. To do that,
you can specify the split_multi_features
and
multi_feature_cols
arguments:
ezbdo <- EstimateFractions(ezbdo,
split_multi_features = TRUE,
multi_feature_cols = "feature")
#> Estimating mutation rates
#> Summarizing data for feature(s) of interest
#> Averaging out the nucleotide counts for improved efficiency
#> Estimating fractions
#> Processing output
split_multi_features = TRUE
will copy the data for
multi-feature assignment rows for any of the feature columns denoted in
multi_feature_cols
. If the feature names for multi-feature
assigned reads is different from “+”, this can be addressed by altering
the multi_feature_sep
argument.
Improving estimates
Two things can impact the accuracy of fractions estimation:
- Dropout: this is the phenomenon by which metabolically labeled RNA or sequencing reads derived from such RNA are lost either during library preparation or during computational processing of the raw sequencing data. Several labs have independently identified and discussed the prevalence of dropout in NR-seq and other metabolic labeling RNA-seq experiments. Recent work suggests that there are three potential causes for this: 1) Loss of labeled RNA on plastic surfaces during RNA extraction, 2) RT falloff due to modifications of the metabolic labeling made by some NR-seq chemistries, and 3) loss of reads with many NR-induced mutations due to poor read alignment.
- Inaccurate mutation rate estimates
In this section, we will discuss some strategies you can use to address these two challenges.
Correcting dropout
If you have -label data to compare to, EZbakR can use this data to
quantify and correct for dropout. The strategy used by EZbakR is
identical to that previously implemented in bakR. You can read more
about the mathematical details of this strategy here.
To implement dropout correction, run CorrectDropout()
,
specifying which metadf sample characteristic columns should be used to
associate -label and +label samples:
ezbdo <- CorrectDropout(ezbdo,
grouping_factors = "treatment")
#> Estimated rates of dropout are:
#> sample pdo
#> 1 sample1 0.035608648
#> 2 sample2 0.000000000
#> 3 sample3 0.000000000
#> 4 sample4 0.006755018
grouping_factors
is technically not a required argument,
as EZbakR will try to infer it automatically. EZbakR assumes that all
metadf sample characteristic columns should be used for grouping though,
so its important to know that this default can be altered as
necessary.
Inferring background mutation rates from -label data
In some cases (e.g., if label incorporation rates are low, or if a
very short label time was used), it can be difficult to estimate the
high and low mutation rates in a given sample. To curtail this
challenge, you can use -label data to infer background mutation rates,
and then estimate the chemically induced mutation rate fixing the
background mutation rate to the -label rate. To do this, you just need
to set pold_from_nolabel
to TRUE
:
ezbdo <- EstimateFractions(ezbdo,
pold_from_nolabel = TRUE)
#> Estimating mutation rates
#> Summarizing data for feature(s) of interest
#> Averaging out the nucleotide counts for improved efficiency
#> Estimating fractions
#> Processing output
By default, a single background mutation rate (pold) will be
estimated using all of the -label data. Alternatively, you can specify
sample characteristics by which to stratify -label data, so that one
pold is estimated per group of samples with the same value for the
metadf columns specified in grouping_factors
:
ezbdo <- EstimateFractions(ezbdo,
pold_from_nolabel = TRUE,
grouping_factors = 'treatment')
#> Estimating mutation rates
#> Summarizing data for feature(s) of interest
#> Averaging out the nucleotide counts for improved efficiency
#> Estimating fractions
#> Processing output
Fancy functionalities
In the first section, we discussed the basic functionality of
EstimateFractions()
and how to alter its key behaviors. In
this section, I will discuss some of the cooler, more niche analysis
strategies that EstimateFractions()
can implement.
Hierarchical mutation rate model
The two-component mixture modeling strategy implemented in tools like bakR and GRAND-SLAM, is largely considered the “gold-standard” in analyzing NR-seq data. Like any statistical method though, it makes assumptions, and real data will often violate these assumptions. Namely, these models typically assume that every RNA synthesized in the presence of metabolic label has an equal probability of incorporating said label. What if this is not true though?This may seem like a pedantic statistical question, but there is evidence that some RNA’s deviate strongly from this assumption. For example, it has been noted (paper from the Churchman lab here) that mitochondrial RNA have much lower mutation rates in reads from new RNA than other RNA.
To account for this heterogeneity, EstimateFractions()
can implement what I will term a “hierarchical” two-component mixture
model. “Hierarchical” means that a new read mutation rate will be
estimated for each feature, but this estimate will be strongly informed
by the sample-wide average new read mutation rate. In other words, if a
given feature has enough coverage and strong evidence for its new read
mutation rate being different from the sample-wide average, then this
feature-specific mutation rate will be estimated and used in estimating
the fraction of high mutation content reads for this feature. If a
feature has limited coverage though, its new read mutation rate estimate
will be strongly pushed towards to the sample-wide average. In other
words, the feature-specific mutation rates are “regularized” towards the
sample-wide average. This strategy will take a bit longer to run and can
be implemented as so:
ezbdo <- EstimateFractions(ezbdo,
strategy = 'hierarchical')
#> Estimating mutation rates
#> Summarizing data for feature(s) of interest
#> Averaging out the nucleotide counts for improved efficiency
#> Estimating fractions
#> FITTING HIERARCHICAL TWO-COMPONENT MIXTURE MODEL:
#> Estimating distribution of feature-specific pnews
#> Estimating fractions with feature-specific pnews
#> Processing output
When doing this, the feature-specific and sample-wide mutation rates
will both be saved in the ezbdo$mutation_rates
list.As
usual, let’s check the results:
est <- EZget(ezbdo, type = 'fractions')
truth <- simdata$PerRepTruth
compare <- dplyr::inner_join(est, truth, by = c('sample', 'feature'))
plot(compare$true_fraction_highTC,
compare$fraction_highTC)
abline(0,1)
Still good! For now, I suggest using this strategy with caution though, as it is technically still “experimental”. In particular, watch out for extreme feature-specific mutation rates, which can be IDed in the
Isoform deconvolution
For many years now, we have had tools to estimate transcript isoform abundances from RNA-seq data. These include RSEM, kallisto, Salmon, and many more. Wouldn’t it be nice if we could similarly estimate synthesis and degradation rate constants for individual transcript isoforms with NR-seq data? EZbakR now implements such an analysis strategy! To use it, you will need to have a cB in which reads have been assigned to their transcript equivalence class, which is just a fancy way of saying “the set of isoforms with which they are completely consistent”. fastq2EZbakR is able to do this, so check out its documentation for details. We can simulate such a cB ourselves for demonstration purposes:
# Simulates a single sample worth of data
simdata_iso <- SimulateIsoforms(nfeatures = 300)
# We have to manually create the metadf in this case
metadf <- data.frame(sample = 'sampleA',
tl = 4,
condition = 'A')
ezbdo <- EZbakRData(simdata_iso$cB,
metadf)
If you inspect the cB simulated by SimulateIsoforms()
,
you will note that there is a column called transcripts
,
which represents transcript equivalence classes. The transcript IDs for
all of the transcripts a read is consistent with are separated by “+”’s,
though in this case, we don’t want to separate and copy the data for
each isoform. Whether, we want to estimate fractions for each unique
equivalence class:
ezbdo <- EstimateFractions(ezbdo)
#> Estimating mutation rates
#> Summarizing data for feature(s) of interest
#> Averaging out the nucleotide counts for improved efficiency
#> Estimating fractions
#> Processing output
To get isoform-specific estimates, we will need to run a new
function, called EstimateIsoformFractions()
. Before doing
this though, we need to import transcript isoform quantification
estimates from tools like RSEM, which will be used by
EstimateIsoformFractions()
. To do this, you can use
ImportIsoformQuant()
. ImportIsoformQuant()
has
three parameters:
-
obj
: TheEZbakRData
object you want to which you want to add isoform quantification information -
files
: A named vector of paths to each isoform quantification output file you want to import. The names should be the relevant sample names as they appear in your cB. -
quant_tool
: The tool you used for isoform quantification.
ImportIsoformQuant()
is just a convenient wrapper for
tximport::tximport
, so I urge you to read that tools’s
documentation for more information (documentation here).
In this example, I will hack a solution, since I don’t have any
isoform quantification output to use. You could technically do something
similar if you are having a tough time using
ImportIsoformQuant()
, but I wouldn’t recommend it:
### Hack in the true, simulated isoform levels
reads <- simdata_iso$ground_truth %>%
dplyr::select(transcript_id, true_count, true_TPM) %>%
dplyr::mutate(sample = 'sampleA',
effective_length = 10000) %>%
dplyr::rename(expected_count = true_count,
TPM = true_TPM)
# Name of table needs to have "isoform_quant" in it
ezbdo[['readcounts']][['simulated_isoform_quant']] <- reads
### Perform deconvolution
ezbdo <- EstimateIsoformFractions(ezbdo)
#> Analyzing sample sampleA...
We can then see how it did by comparing to ground truth:
est <- EZget(ezbdo,
type = 'fractions',
features = "transcript_id")
truth <- simdata_iso$ground_truth
compare <- truth %>%
dplyr::inner_join(est, by = c("feature", "transcript_id"))
plot(compare$true_fn,
compare$fraction_highTC)
abline(0,1)
This simulation is a bit extreme as around 50% of all isoform abundance differences are assumed to be completely driven by differences in isoform stability, which leads to underestimation of some of the crazy extreme fraction news. Other than that though, it looks good!
Using the Apache Arrow backend
NOTE: The documentation for this section lacks good examples as I don’t yet have an example external dataset included with EZbakR’s installation. Thus, I will be describing the details of how the analysis would go without showing a real analysis. Look for this to change in the not-to-distant future.
What if you have a massive dataset with 10s of samples that you would like to analyze? Loading the entire cB table for such a dataset into memory will surely crash most laptops. If you have access to some sort of HPC cluster, you could of course use that, but there is nothing like the convenience and interactivity of working with a dataset on your personal computer. Because of this, EZbakR is able to use Apache Arrow’s R frontend (i.e., the arrow package) to help with large datasets. The steps for this process are described below.
Step 1: Create an Arrow Dataset
You need to create an Arrow Dataset partioned by the “sample” column
of your cB. This will create a set of .parquet files, with one file
created for each sample. This will allow
EstimateFractions()
to load your data in one sample at a
time, so as to only hold a single sample of the cB in RAM at a time:
library(arrow)
### Move into the directory with your cB file
setwd("Path/to/cB/containing/directory")
### This will not load the cB into memory
# You can run `read_csv("cB.csv.gz", n_max = 1)` to check to see what
# the order of the columns are, as this order needs to match your provided
# schema.
ds <- open_dataset("cB.csv.gz",
format = "csv",
schema = schema(
sample = string(),
rname = string(),
GF = string(),
XF = string(),
exon_bin = string(),
bamfile_transcripts = string(),
junction_start = string(),
junction_end = string(),
TC = int64(),
nT = int32(),
sj = bool(),
n = int64()
),
skip_rows=1)
### Create Arrow dataset
setwd("Path/to/where/you/want/to/write/Arrow/Dataset")
ds %>%
group_by(sample) %>%
write_dataset("fulldataset/",
format = "parquet")
See the arrow documentation for a lot more details about how to tune the dataset creation process. This will never load the entire cB into memory, though may use a bit too much RAM if you try to add some of the custom filtering or summarization discussed in the arrows docs.
Step 2: Create EZbakRArrowData object
You can then create an EZbakRArrowData
object similarly
to how you would create a standard EZbakRData
object:
ds <- arrow::open_dataset("Path/to/where/you/want/to/Arrow/Dataset/")
metadf <- tibble(sample = c("WT_1", "WT_2", "WT_ctl",
"KO_1", "KO_2", "KO_ctl"),
tl = c(2, 2, 0,
2, 2, 0),
genotype = rep(c('WT', 'KO'), each = 3))
ezbado <- EZbakRArrowData(ds, metadf)
Step 3: Run EstimateFractions like normal
Finally, just run EstimateFractions()
with all of the
settings you would normally use if you were working with an
EZbakRData
object:
ezbado <- EstimateFractions(ezbado,
features = c("GF", "XF",
"junction_start", "junction_end"),
filter_cols = c("XF", "junction_start",
"junction_end"),
filter_condition = `|`,
split_multi_features = TRUE,
multi_feature_cols = c("junction_start",
"junction_end"))