Convenient fetching of EZbakR outputs: EZget() • EZbakR

Introduction

This vignette shows how to use the EZget() function provided by EZbakR. In cases where you have multiple tables of a particular type in your EZbakRData object, this can greatly facilitate extracting the table of interest. As a part of this vignette, I will also describe how an EZbakRData object is organized.

library(EZbakR)

EZbakRData objects

Let’s first analyze some simulated data to generate an EZbakRData object that we can explore the contents of:

simdata <- EZSimulate(nfeatures = 300, nreps = 2)

# Make initial EZbakRData object
ezbdo <- EZbakRData(simdata$cB, simdata$metadf)

# Estimate fractions twice, and don't overwrite the first analysis
# Second run will use different model; see EstimateFractions vignette for details
ezbdo <- EstimateFractions(ezbdo)
#> Estimating mutation rates
#> Summarizing data for feature(s) of interest
#> Averaging out the nucleotide counts for improved efficiency
#> Estimating fractions
#> Processing output
ezbdo <- EstimateFractions(ezbdo, strategy = 'hierarchical', overwrite = FALSE)
#> Estimating mutation rates
#> Summarizing data for feature(s) of interest
#> Averaging out the nucleotide counts for improved efficiency
#> Estimating fractions
#> FITTING HIERARCHICAL TWO-COMPONENT MIXTURE MODEL:
#> Estimating distribution of feature-specific pnews
#> Estimating fractions with feature-specific pnews
#> Processing output

# Estimate kinetic parameters with three different strategies
# See EstimateKinetics vignettes for details.
ezbdo <- EstimateKinetics(ezbdo, repeatID = 1)
ezbdo <- EstimateKinetics(ezbdo, repeatID = 1, strategy = "shortfeed")
ezbdo <- EstimateKinetics(ezbdo, repeatID = 2, strategy = "shortfeed")

An EZbakRData object is a list that can contain the following items:

cB: The cB table you provided upon object creation.
metadf: The metadf table you provided upon object creation.
fractions: List of fractions estimates generated by EstimateFractions().
kinetics: List of kinetic parameter estimates generated by EstimateKinetics().
averages: List of parameter replicate averages generated by AverageAndRegularize().
comparisons: List of comparisons of parameter averages, generated by CompareParameters().
dynamics: List of dynamical systems model parameter estimated, generateld by EZDynamics().
readcounts: List of tables of read counts generated by various EZbakR functions.
metadata: List with elements corresponding to the lists of tables described above. Describes various features of the tables so that they can be fetched with EZget().

As an EZbakRData object is a list, its elements can be accessed in a few ways:

# `$` notation:
ezbdo$fractions$feature
#> # A tibble: 1,800 × 6
#>    sample  feature fraction_highTC logit_fraction_highTC se_logit_fraction_hig…¹
#>    <chr>   <chr>             <dbl>                 <dbl>                   <dbl>
#>  1 sample1 Gene1            0.102                  -2.17                  0.0620
#>  2 sample1 Gene10           0.222                  -1.25                  0.111 
#>  3 sample1 Gene100          0.188                  -1.46                  0.0719
#>  4 sample1 Gene101          0.184                  -1.49                  0.0338
#>  5 sample1 Gene102          0.141                  -1.81                  0.0926
#>  6 sample1 Gene103          0.116                  -2.03                  0.0871
#>  7 sample1 Gene104          0.152                  -1.72                  0.0398
#>  8 sample1 Gene105          0.0827                 -2.41                  0.0875
#>  9 sample1 Gene106          0.177                  -1.54                  0.0893
#> 10 sample1 Gene107          0.161                  -1.65                  0.0466
#> # ℹ 1,790 more rows
#> # ℹ abbreviated name: ¹se_logit_fraction_highTC
#> # ℹ 1 more variable: n <int>

# `[[]]` notation with element names
ezbdo[['fractions']][['feature']]
#> # A tibble: 1,800 × 6
#>    sample  feature fraction_highTC logit_fraction_highTC se_logit_fraction_hig…¹
#>    <chr>   <chr>             <dbl>                 <dbl>                   <dbl>
#>  1 sample1 Gene1            0.102                  -2.17                  0.0620
#>  2 sample1 Gene10           0.222                  -1.25                  0.111 
#>  3 sample1 Gene100          0.188                  -1.46                  0.0719
#>  4 sample1 Gene101          0.184                  -1.49                  0.0338
#>  5 sample1 Gene102          0.141                  -1.81                  0.0926
#>  6 sample1 Gene103          0.116                  -2.03                  0.0871
#>  7 sample1 Gene104          0.152                  -1.72                  0.0398
#>  8 sample1 Gene105          0.0827                 -2.41                  0.0875
#>  9 sample1 Gene106          0.177                  -1.54                  0.0893
#> 10 sample1 Gene107          0.161                  -1.65                  0.0466
#> # ℹ 1,790 more rows
#> # ℹ abbreviated name: ¹se_logit_fraction_highTC
#> # ℹ 1 more variable: n <int>

# `[[]]` notation with numeric indices
ezbdo[[4]][[1]]
#> # A tibble: 1,800 × 6
#>    sample  feature fraction_highTC logit_fraction_highTC se_logit_fraction_hig…¹
#>    <chr>   <chr>             <dbl>                 <dbl>                   <dbl>
#>  1 sample1 Gene1            0.102                  -2.17                  0.0620
#>  2 sample1 Gene10           0.222                  -1.25                  0.111 
#>  3 sample1 Gene100          0.188                  -1.46                  0.0719
#>  4 sample1 Gene101          0.184                  -1.49                  0.0338
#>  5 sample1 Gene102          0.141                  -1.81                  0.0926
#>  6 sample1 Gene103          0.116                  -2.03                  0.0871
#>  7 sample1 Gene104          0.152                  -1.72                  0.0398
#>  8 sample1 Gene105          0.0827                 -2.41                  0.0875
#>  9 sample1 Gene106          0.177                  -1.54                  0.0893
#> 10 sample1 Gene107          0.161                  -1.65                  0.0466
#> # ℹ 1,790 more rows
#> # ℹ abbreviated name: ¹se_logit_fraction_highTC
#> # ℹ 1 more variable: n <int>

Using EZget

EZget() provides an alternative strategy for getting a particular table. It has two required arguments:

obj: The EZbakRData object you would like to get a table from.
type: The type of table you are looking for. Options are “fractions”, “kinetics”, “readcounts”, “averages”, and “comparisons”, the lists of tables described above.

Most of the remaining parameters are search criteria that you specify. The full list can be seen in the function docs (?EZget()). These all except strings or vectors of strings as input, and all metadata will be checked to see if the provided string is contained in the respective metadata slot. For example, we can extract the kinetics table generated from the standard analysis like so:

kinetics <- EZget(ezbdo,
                  type = "kinetics",
                  kstrat = "standard")

In some cases, multiple tables with the exact same metadata exist. For example, the metadata for fractions tables is:

The feature columns by which reads were grouped. This is “feature” for both of our fractions tables.
The mutational populations analyzed. This is “TC” for both of our fractions tables.
The fraction_design table used. This is the standard fraction_design for a single mutation type analysis for both of our fractions tables.

Since we set overwrite = FALSE in our second run of EstimateFractions, these tables were both saved. What distinguishes them is a final piece of metadata saved for all tables: repeatID. This is a numerical ID that distinguishes multiple instances of the same table. The ID is 1 for the first such object created, 2 for the second, etc. Thus, the analysis with the standard mixture model has a repeatID of 1, and the analysis with the hierarchical mixture model has a repeatID of 2. We can thus access the latter as such:

h_fxn <- EZget(ezbdo, 
               type = 'fractions',
               repeatID = 2)

There are three parameters that tune EZget()’s behavior. These are:

returnNameOnly: If TRUE, then only the names of the tables consistent with the search criterion you specify will be returned. This will throw a warning if there is more than one table that passes your criteria, but it will not error in this case. If returnNameOnly is FALSE, then an error is thrown if there is more than one table that matches your search criteria.
exactMatch: The features and populations arguments are the two arguments that can be vectors of strings. Setting exactMatch to TRUE will force the provided features and populations vectors to exactly match those in a table’s metadata for that table to be returned. The alternative (default) behavior, is that the provided feature(s) and population(s) only have to all be contained in a table’s metadata.
alwaysCheck: If only a single table of the relevant type is present in your EZbakRData object, EZget() automatically returns that table without checking to see if the search criteria match. If you set alwaysCheck to TRUE, then the table is searched for as normal and will only be returned if its metadata match the search criteria.