Average parameter estimates across replicates, and regularize variance estimates
AverageAndRegularize.Rd
AverageAndRegularize
fits a generalized linear model to your data to effectively
average parameter estimates across replicates and get overall uncertainty estimates
for those parameters. The linear model to which your data is fit is specified via
an R formula object supplied to the formula_mean
parameter. Uncertainty estimates
are regularized via a hierarchical modeling strategy originally introduced with
bakR, though slightly improved upon since then.
Usage
AverageAndRegularize(
obj,
features = NULL,
parameter = "log_kdeg",
type = "kinetics",
kstrat = NULL,
populations = NULL,
fraction_design = NULL,
exactMatch = TRUE,
repeatID = NULL,
formula_mean = NULL,
sd_grouping_factors = NULL,
include_all_parameters = TRUE,
sd_reg_factor = 10,
error_if_singular = TRUE,
min_reads = 10,
convert_tl_to_factor = TRUE,
force_lm = FALSE,
force_optim = force_lm,
conservative = FALSE,
character_limit = 20,
feature_lengths = NULL,
overwrite = TRUE
)
Arguments
- obj
An
EZbakRFractions
orEZbakRKinetics
object, which is anEZbakRData
object on whichEstimateFractions()
orEstimateKinetics()
has been run.- features
Character vector of the set of features you want to stratify reads by and estimate proportions of each RNA population. The default of "all" will use all feature columns in the
obj
's cB.- parameter
Parameter to average across replicates of a given condition.
- type
What type of table is the parameter found in? Default is "kinetics", but can also set to "fractions".
- kstrat
If
type == "kinetics"
, thenkstrat
specifies the kinetic parameter inference strategy.- populations
Character vector of the set of mutational populations that you want to infer the fractions of. Only relevant if type == "fractions".
- fraction_design
"Design matrix" specifying which RNA populations exist in your samples. Only relevant if type == "fractions".
- exactMatch
If TRUE, then
features
andpopulations
have to exactly match those for a given fractions table for that table to be used. Means that you can't specify a subset of features or populations by default, since this is TRUE by default.- repeatID
If multiple
kinetics
orfractions
tables exist with the same metadata, then this is the numerical index by which they are distinguished.- formula_mean
An R formula object specifying how the
parameter
of interest depends on the sample characteristics specified inobj
's metadf. The most common formula will be~ treatment
or~ treatment:duration
, wheretreatment
andduration
would be replaced with whatever you called the relevant sample characteristics in your metadf.~ treatment
means that an average value ofparameter
should be estimated for each set of samples with the same value fortreatment
in the metadf.~ treatment:duration
specifies that an average value ofparameter
should be estimated for each set of samples with the same combination oftreatment
andduration
values in the metadf. An example of the latter case is a situation where you have two or more treatments (e.g., drug treated and untreated control) which were applied for different durations of time (e.g., 4 and 8 hours).NOTE: EZbakR automatically removes any intercept terms from the model. That way, there is no ambiguity about what parameter is defined as the reference.
- sd_grouping_factors
What metadf columns should data be grouped by when estimating standard deviations across replicates? If this is NULL, then EZbakR will check to see if the
formula_mean
specifies a formula that cleanly stratifies samples into disjoint groups. For example, the formula~ treatment
will assign each sample to a single factor (its value for the metadf'streatment
column). In this case, standard deviations can be calculated for sets of replicates in eachtreatment
group. If such a stratification does not exist, a single standard deviation will be estimated for each feature (i.e., homoskedasticity will be assumed as in standard linear modeling).- include_all_parameters
If TRUE, an additional table will be saved with the prefix
fullfit_
, which includes all of the parameters estimated throughout the course of linear modeling and regularization. This can be nice for visualizing the regularized mean-variance trend.- sd_reg_factor
Determines how strongly variance estimates are shrunk towards trend. Higher numbers lead to more regularization. Eventually, this will be replaced with estimation of how much variance there seems to be in the population of variances.
- error_if_singular
If TRUE, linear model will throw an error if parameters cannot be uniquely identified. This is most often caused by parameters that cannot be estimated from the data, e.g., due to limited replicate numbers or correlated sample characteristics (i.e., all treatment As also correspond to batch As, and all treatment Bs correspond to batch Bs).
- min_reads
Minimum number of reads in all samples for a feature to be kept.
- convert_tl_to_factor
If a label time variable is included in the
formula_mean
, convert its values to factors so as to avoid performing continuous regression on label times. Defaults to TRUE as including label time in the regression is often meant to stratify samples by their label time if, for example, you are averaging logit(fractions).- force_lm
Certain formula lend them selves to efficient approximations of the full call to
lm()
. Namely, formulas that stratify samples into disjoint groups where a single parameter of the model is effectively estimated from each group can be tackled via simple averaging of data from each from group. If you would like to force EZbakR to fit the fully rigorous linear model though, setforce_lm
toTRUE
.- force_optim
Old parameter that is now passed the value
force_lm
and will be deprecated in later releases- conservative
If TRUE, conservative variance regularation will be performed. In this case, variances below the trend will be regularized up to the trend, and variances above the trend will be left unregularized. This avoids underestimation of variances.
- character_limit
Limit on the number of characters of the name given to the output table. Will attempt to concatenate the parameter name with the names of all of the features. If this is too long, only the parameter name will be used.
- feature_lengths
Table of effective lengths for each feature combination in your data. For example, if your analysis includes features named GF and XF, this should be a data frame with columns GF, XF, and length.
- overwrite
If TRUE, identical, existing output will be overwritten.