Estimate isoform-specific fraction news (or more generally "fractions").
EstimateIsoformFractions.Rd
Combines the output of EstimateFractions
with transcript isoform
quantification performed by an outside tool (e.g., RSEM, kallisto, salmon, etc.)
to infer transcript isoform-specific fraction news (or more generally fraction
of reads coming from a particular mutation population).
Usage
EstimateIsoformFractions(
obj,
features = NULL,
populations = NULL,
fraction_design = NULL,
repeatID = NULL,
exactMatch = TRUE,
fraction_name = NULL,
quant_name = NULL,
gene_to_transcript = NULL,
overwrite = TRUE,
TPM_min = 1,
count_min = 10
)
Arguments
- obj
An
EZbakRData
object- features
Character vector of the set of features you want to stratify reads by and estimate proportions of each RNA population. The default of "all" will use all feature columns in the
obj
's cB.- populations
Mutational populations that were analyzed to generate the fractions table to use. For example, this would be "TC" for a standard s4U-based nucleotide recoding experiment.
- fraction_design
"Design matrix" specifying which RNA populations exist in your samples. By default, this will be created automatically and will assume that all combinations of the
mutrate_populations
you have requested to analyze are present in your data. If this is not the case for your data, then you will have to create one manually. See docs forEstimateFractions
(run ?EstimateFractions()) for more details.- repeatID
If multiple
fractions
tables exist with the same metadata, then this is the numerical index by which they are distinguished.- exactMatch
If TRUE, then
features
andpopulations
have to exactly match those for a given fractions table for that table to be used. Means that you can't specify a subset of features or populations by default, since this is TRUE by default.- fraction_name
Name of fraction estimate table to use. Should be stored in the
obj$fractions
list under this name. Can also rely on specifyingfeatures
and/orpopulations
and havingEZget()
find it.- quant_name
Name of transcript isoform quantification table to use. Should be stored in the obj$readcounts list under this name. Use
ImportIsoformQuant()
to create this table. Ifquant_name
isNULL
, it will search for tables containing the string "isoform_quant" in their name, as that is the naming convention used byImportIsoformQuant()
. If more than one such table exists, an error will be thrown and you will have to specify the exact name inquant_name
.- gene_to_transcript
Table with columns
transcript_id
and all feature related columns that appear in the relevant fractions table. This is only relevant as a hack to to deal with the case where STAR includes in its transcriptome alignment transcripts on the opposite strand from where the RNA actually originated. This table will be used to filter out such transcript-feature combinations that should not exist.- overwrite
If TRUE and a fractions estimate output already exists that would possess the same metadata (features analyzed, populations analyzed, and fraction_design), then it will get overwritten with the new output. Else, it will be saved as a separate output with the same name + "_#" where "#" is a numerical ID to distinguish the similar outputs.
- TPM_min
Minimum TPM for a transcript to be kept in analysis.
- count_min
Minimum expected_count for a transcript to be kept in analysis.
Value
An EZbakRData
object with an additional table under the "fractions"
list. Has the same form as the output of EstimateFractions()
, and will have the
feature column "transcript_id".
Details
EstimateIsoformFractions
expects as input a "fractions" table with estimates
for transcript equivalence class (TEC) fraction news. A transcript equivalence class
is the set of transcript isoforms with which a sequencing read is compatible.
fastq2EZbakR is able to assign
reads to these equivalence classes so that EZbakR can estimate the fraction of
reads in each TEC that are from labeled RNA.
EstimateIsoformFractions
estimates transcript isoform fraction news
by fitting a linear mixing model to the TEC fraction new estimates + transcript
isoform abundance estimates. In other words, each TEC fraction new (data) is modeled
as a weighted average of transcript isoform fraction news (parameters to estimate),
with the weights determined by the relative abundances of the transcript isoforms
in the TEC (data). The TEC fraction new is modeled as a Beta distribution with mean
equal to the weighted transcript isoform fraction new average and variance related
to the number of reads in the TEC.
Transcript isoforms with estimated TPMs less than with an estimated
TPM greater than TPM_min
(1 by default) or an estimated number of read
counts less than count_min
(10 by default) are removed from TECs and will
not have their fraction news estimated.
Examples
# Load dependencies
library(dplyr)
# Simulates a single sample worth of data
simdata_iso <- SimulateIsoforms(nfeatures = 300)
# We have to manually create the metadf in this case
metadf <- tibble(sample = 'sampleA',
tl = 4,
condition = 'A')
ezbdo <- EZbakRData(simdata_iso$cB,
metadf)
ezbdo <- EstimateFractions(ezbdo)
#> Estimating mutation rates
#> Summarizing data for feature(s) of interest
#> Averaging out the nucleotide counts for improved efficiency
#> Estimating fractions
#> Processing output
### Hack in the true, simulated isoform levels
reads <- simdata_iso$ground_truth %>%
dplyr::select(transcript_id, true_count, true_TPM) %>%
dplyr::mutate(sample = 'sampleA',
effective_length = 10000) %>%
dplyr::rename(expected_count = true_count,
TPM = true_TPM)
# Name of table needs to have "isoform_quant" in it
ezbdo[['readcounts']][['simulated_isoform_quant']] <- reads
### Perform deconvolution
ezbdo <- EstimateIsoformFractions(ezbdo)
#> Analyzing sample sampleA...