Skip to contents

EZDynamics() estimates parameters of a user-specified dynamical systems model. The dynamical system model is specified through an adjacency matrix, which is an NxN matrix described below (see graph documentation). Modeling can either be done for species all assayed in each sample, or species that are assayed across a set of independent samples (e.g., subcellular fractionation involves assaying different species in different samples).

Usage

EZDynamics(
  obj,
  graph,
  sub_features,
  grouping_features,
  scale_factors = NULL,
  sample_feature = NULL,
  modeled_to_measured = NULL,
  parameter_names = paste0("k", 1:max(graph)),
  unassigned_name = "__no_feature",
  type = "averages",
  features = NULL,
  populations = NULL,
  fraction_design = NULL,
  parameter = NULL,
  repeatID = NULL,
  exactMatch = TRUE,
  feature_lengths = NULL,
  overwrite = TRUE
)

Arguments

obj

Currently must be an EZbakRData object on which AverageAndRegularize has been run. In the future, will also support (in case where all species are assayed in every sample) providing output of just EstimateFractions() as input, acting as a generalization of EstimateKinetics() in that case.

graph

An NxN adjacency matrix, where N represents the number of species being modeled. One of these species must be called "0" and represent the "no RNA" species. This is the species from which some species are synthesized (e.g., 0 -> P, means premature RNA is synthesized from no RNA), and the species to which some species are degraded (e.g., M -> 0 means mature RNA is converted to "no RNA" via degradation). The rows and columns of this matrix must be the names of all modled species, and rownames(graph) == colnames(graph). Entry i,j of the matrix is either 0 if species i cannot be converted into species j under your model, and an integer from 1:npars (where npars = total number of parameters to be estimated) if it can.

For example, the model 0 -> P -> M -> 0 would have the graph: matrix(c(0, 1, 0, 0, 0, 2, 3, 0, 0), nrow = 3, ncol = 3, byrow = TRUE.

sub_features

Which feature columns distinguish between the different measured species? Note, the measured species need not have the same name, and may not be directly equivalent to, the modeled species. The relationship between the modeled species in graph and sub_features needs to be specified in modeled_to_measured if the names are not equivalent though.

grouping_features

Which features are the overarching feature assignments by which sub_features should be grouped? This will usually be the feature columns corresponding to full-gene assignments, as well as any higher order assignments (e.g., chromosome). A sub_feature can be included in grouping_features if it never has the value of unassigned_name ("__no_feature" by default). Only one sub_feature should ever fulfill this criterion though.

scale_factors

Data frame mapping samples to factors by which to multiply read counts so as ensure proper normalization between different RNA populations. Only relevant if you are modeling relationships between distinct RNA populations, for example RNA from nuclear and cytoplasmic fractions. Will eventually be inferred automatically.

sample_feature

If different samples involve assaying different species, this must be the name of the metadf column that distinguishes the different classes of samples. For example, if analyzing a subcellular fractionation dataset, you likely included a column in your metadf called "compartment". This would then be your sample_feature, assuming you ran AverageAndRegularize(), with a mean_formula that included compartment as a term.

modeled_to_measured

If sub_features is not identical to the non "0" species names in graph, then you must specify the relationship between sub_features, sample_feature (if specified), and the species in graph. This is done through a list of formulas, whose left side is a sub_feature and whose right side is a function of species in graph. If sample_feature is not specified, then modeled_to_measured should be a nameless list of formulae. If sample_feature is specified, then modeled_to_measured must be a named list of formulas where the names correspond to unique values of sample_feature. In this latter case, the elements, should be the mapping of measured features (sub_features) to modeled species (names in graph that aren't "0").

For example, if your model is 0 -> P -> M -> 0, where P stands for premature RNA and M stands for mature RNA, and you have a column called GF that corresponds to assignment anywhere in the gene, and XF that corresponds to assignment of exclusively exon mapping reads, then your modeled_to_measured should be list(GF ~ P + M, XF ~ M)..

As another example, if your model is 0 -> N -> C -> 0, where N stands for "nuclear RNA" and C stands for "cytoplasmic RNA", and your sample_feature takes on values of "nuclear", "cytoplasmic", and "total", and you have a single sub_feature called XF, then your modeled_to_measured shouild be list(nuclear = GF ~ N, cytoplasmic = GF ~ C, total = GF ~ C + N). This is interpreted as meaning in groups of samples for which sample_feature == "nuclear", reads assigned to a GF are from the N species (nuclear RNA). When sample_feature == "cytoplasmic", reads assigned to GF correspond to the C species (cytoplasmic RNA). When sample_feature == "total", reads assigned to GF correspond to a combination of N and C (nuclear and cytoplasmic RNA).

parameter_names

Vector of names you would like to give to the estimated parameters. ith element should correspond to name of parameter given the ID i in graph. By default, this is just ki, where i is this numerical index.

unassigned_name

What value will a sub_feature column have if a read was not assigned to said feature? "__no_feature" by default.

type

What type of table would you like to use? Currently only supports "averages", but will support "fractions" in the near future.

features

Character vector of the set of features you want to stratify reads by and estimate proportions of each RNA population. The default of "all" will use all feature columns in the obj's cB.

populations

Mutational populations that were analyzed to generate the fractions table to use. For example, this would be "TC" for a standard s4U-based nucleotide recoding experiment.

fraction_design

"Design matrix" specifying which RNA populations exist in your samples. By default, this will be created automatically and will assume that all combinations of the mutrate_populations you have requested to analyze are present in your data. If this is not the case for your data, then you will have to create one manually. See docs for EstimateFractions (run ?EstimateFractions()) for more details.

parameter

Parameter to average across replicates of a given condition. Has to be "logit_fraction_high", where is the type of mutation modeled in EstimateFractions() (e.g, TC) in this case.

repeatID

If multiple fractions tables exist with the same metadata, then this is the numerical index by which they are distinguished.

exactMatch

If TRUE, then features and populations have to exactly match those for a given fractions table for that table to be used. Means that you can't specify a subset of features or populations by default, since this is TRUE by default.

feature_lengths

Table of effective lengths for each feature combination in your data. For example, if your analysis includes features named GF and XF, this should be a data frame with columns GF, XF, and length.

overwrite

If TRUE and a fractions estimate output already exists that would possess the same metadata (features analyzed, populations analyzed, and fraction_design), then it will get overwritten with the new output. Else, it will be saved as a separate output with the same name + "_#" where "#" is a numerical ID to distinguish the similar outputs.

Details

When running AverageAndRegularize() to produce input for EZDynamics(), you must set parameter to "logit_fraction_high" ( = type of mutation modeled by EstimateFractions(), e.g., TC), and you must include the label time in your regression formula. EZDynamics() models the logit(fraction high <muttype), and this will depend on the label time (longer label time = higher fraction), which is why these two conditions must be met.