Estimate fractions of each RNA population
EstimateFractions.Rd
Estimate fractions of each RNA population
Usage
EstimateFractions(
obj,
features = "all",
mutrate_populations = "all",
fraction_design = NULL,
Poisson = TRUE,
strategy = c("standard", "hierarchical"),
filter_cols = "all",
filter_condition = `&`,
remove_features = c("NA", "__no_feature"),
split_multi_features = FALSE,
multi_feature_cols = NULL,
multi_feature_sep = "+",
pnew_prior_mean = -2.94,
pnew_prior_sd = 0.3,
pold_prior_mean = -6.5,
pold_prior_sd = 0.5,
hier_readcutoff = 300,
init_pnew_prior_sd = 0.8,
pnew_prior_sd_min = 0.01,
pold_est = NULL,
pold_from_nolabel = FALSE,
grouping_factors = NULL,
character_limit = 20,
overwrite = TRUE
)
Arguments
- obj
EZbakRData
orEZbakRArrowData
object- features
Character vector of the set of features you want to stratify reads by and estimate proportions of each RNA population. The default of "all" will use all feature columns in the
obj
's cB file.- mutrate_populations
Character vector of the set of mutational populations that you want to infer the rates of mutations for. By default, all mutation rates are estimated for all populations present in cB.
- fraction_design
"Design matrix" specifying which RNA populations exist in your samples. By default, this will be created automatically and will assume that all combinations of the
mutrate_populations
you have requested to analyze are present in your data. If this is not the case for your data, then you will have to create one manually.If you call the function
create_fraction_design(...)
, providing a vector of mutational population names as input, it will create afraction_design
table for you, with the assumption that every single possible combination of mutational populations is present in your data. You can then edit thepresent
column as necessary to get an appropriatefraction_design
for your use case. See below for details on the required contents offraction_design
and its interpretation.fraction_design
must have one column per element ofmutrate_populations
, with these columns sharing the name of themutrate_populations
. It must also have one additional column namedpresent
. All elements of fraction_design should be booleans (TRUE
orFALSE
). It should include all possible combinations ofTRUE
andFALSE
for themutrate_populations
columns. ATRUE
in one of these columns represents a population of RNA that is expected to have above background mutation rates of that type.present
will denote whether or not that population of RNA is expected to exist in your data.For example, assume you are doing a typical TimeLapse-seq/SLAM-seq/TUC-seq/etc. experiment where you have fed cells with s^4U and recoded any incorporated s^4U to a nucleotide that reverse transcriptase will read as a cytosine. That means that
mutrate_populations
will be "TC", since you want to estimate the fraction of RNA that was s^4U labeled, i.e., the fraction with high T-to-C mutation content.fraction_design
will thus have two columns:TC
andpresent
. It will also have two rows. One of these rows must have a value ofTRUE
forTC
, and the other must have a value ofFALSE
. The row with a value ofTRUE
forTC
represents the population of reads with high T-to-C mutation content, i.e., the reads from RNA that were synthesized while s^4U was present. The row with a value ofFALSE
forTC
reprsents the population of reads with low T-to-C mutation content, i.e., the reads from RNA that existed prior to s^4U labeling. Both of these populations exist in your data, so the value of thepresent
column should beTRUE
for both of these. See the lazily loadedstandard_fraction_design
object for an example of what this tibble could look like. ("lazily loadedstandard_fraction_design
object" means that if you runprint(standard_fraction_design)
after loadingEZbakR
withlibrary(EZbakR)
, then you can see its contents. More specifically, lazily loaded means that this table is not loaded into memory until you ask for it, via something like aprint()
call.)As another example, consider TILAC, a NR-seq extension developed by the Simon lab. TILAC was originally described in Courvan et al., 2022. In this method, two populations of RNA, one from s^4U fed cells and one from s^6G fed cells, are pooled and prepped for sequencing together. This allows for internally controlled comparisons of RNA abundance without spike-ins. s^4U is recoded to a cytosine analog by TimeLapse chemistry (or similar chemistry) and s^6G is recoded to an adenine analog. Thus,
fraction_design
includes columns calledTC
andGA
. A unique aspect of the TILACfraction_design
table is that one of the possible populations,TC
andGA
bothTRUE
, is denoted as not present (present
=FALSE
). This is because there is no RNA that was exposed to both s^4U and s^6G, thus a population of reads with both high T-to-C and G-to-A mutational content should not exist. To see an example of what a TILACfraction_design
table could look like, see the lazily loadedtilac_fraction_design
object.- Poisson
If
TRUE
, use U-content adjusted Poisson mixture modeling strategy. Often provides significant speed gain without sacrificing accuracy.- strategy
String denoting which new read mutation rate estimation strategy to use. Options include:
standard: Estimate a single new read and old read mutation rate for each sample. This is done via a binomial mixture model aggregating over
hierarchical: Estimate feature-specific new read mutation rate, regularizing the feature-specific estimate with a sample-wide prior. Currently only compatible with single mutation type mixture modeling.
- filter_cols
Which feature columns should be used to filter out feature-less reads. The default value of "all" checks all feature columns for whether or not a read failed to get assigned to said feature.
- filter_condition
Only two possible values for this make sense:
`&`
and`|`
. If set to`&`
, then all features infilter_cols
must have a "null" value (i.e., a value included inremove_features
) for the row to get filtered out. If set to`|`
, then only a single feature infilter_cols
needs to have one of these "null" values to get filtered out.- remove_features
All of the feature names that could indicate failed assignment of a read to a given feature. the fastq2EZbakR pipeline uses a value of '__no_feature'.
- split_multi_features
If a set of reads maps ambiguously to multiple features, should data for such reads be copied for each feature in the ambiguous set? If this is
TRUE
, thenmulti_feature_cols
also must be set. Examples where this should be set toTRUE
includes when analyzing exonic bins (concept defined in original DEXSeq paper), exon-exon junctions, etc.- multi_feature_cols
Character vector of columns that have the potential to include assignment to multiple features. Only these columns will have their features split if
split_multi_features
isTRUE
.- multi_feature_sep
String representing how ambiguous feature assignments are distinguished in the feature names. For example, the default value of "+" denotes that if a read maps to multiple features (call them featureA and featureB, for example), then the feature column will have a value of "featureA+featureB" for that read.
- pnew_prior_mean
Mean for logit(pnew) prior.
- pnew_prior_sd
Standard deviation for logit(pnew) prior.
- pold_prior_mean
Mean for logit(pold) prior.
- pold_prior_sd
Standard deviation for logit(pold) prior.
- hier_readcutoff
If
strategy
==hierarchical
, only features with this many reads are used to infer the distribution of feature-specific labeled read mutation rates.- init_pnew_prior_sd
If
strategy
==hierarchical
, this is the initial logit(pnew) prior standard deviation to regularize feature-specific labeled read mutation rate estimates.- pnew_prior_sd_min
The minimum logit(pnew) prior standard deviation when
strategy
is set to "hierarchcial". EZbakR will try to estimate this empirically as the standard deviation of initial feature-specific logit(pnew) estimates using high coverage features, minus the average uncertainty in the logit(pnew) estimates. As this difference can sometimes be negative, a value ofpnew_prior_sd_min
will be imputed in that case.- pold_est
Background mutation rate estimates if you have them. Can either be a single number applied to all samples or a named vector of values, where the names should be sample names.
- pold_from_nolabel
Fix background mutation rate estimate to mutation rates seen in -label samples. By default, a single background rate is used for all samples, inferred from the average mutation rate across all -label samples. The
grouping_factors
argument can be specified to use certain -label samples to infer background mutation rates for certain sets of +label samples.- grouping_factors
If
pold_from_nolabel
is TRUE, thengrouping_factors
will specify the sample-detail columns in the metadf that should be used to group -label samples by. Average mutation rates in each group of -label samples will be used as the background mutation rate estimate in +label samples with the same values for the relevant metadf columns.- character_limit
Maximum number of characters for naming out fractions output. EZbakR will try to name this as a "_" separated character vector of all of the features analyzed. If this name is greater than
character_limit
, then it will default to "fraction#", where "#" represents a simple numerical ID for the table.- overwrite
If TRUE and a fractions estimate output already exists that would possess the same metadata (features analyzed, populations analyzed, and fraction_design), then it will get overwritten with the new output. Else, it will be saved as a separate output with the same name + "_#" where "#" is a numerical ID to distinguish the similar outputs.