Simulate one replicate of multi-label NR-seq data
SimulateMultiLabel.Rd
Generalizes SimulateOneRep() to simulate any combination of mutation types. Currently, no kinetic model is used to relate certain parameters to the fractions of reads belonging to each simulated mutational population. Instead these fractions are drawn from a Dirichlet distribution with gene-specific parameters.
Usage
SimulateMultiLabel(
nfeatures,
populations = c("TC"),
fraction_design = create_fraction_design(populations),
fractions_matrix = NULL,
read_vect = NULL,
sample_name = "sampleA",
feature_prefix = "Gene",
kdeg_vect = NULL,
ksyn_vect = NULL,
logkdeg_mean = -1.9,
logkdeg_sd = 0.7,
logksyn_mean = 2.3,
logksyn_sd = 0.7,
phighs = stats::setNames(rep(0.05, times = length(populations)), populations),
plows = stats::setNames(rep(0.002, times = length(populations)), populations),
seqdepth = nfeatures * 2500,
readlength = 200,
alpha_min = 3,
alpha_max = 6,
Ucont = 0.25,
Acont = 0.25,
Gcont = 0.25,
Ccont = 0.25
)
Arguments
- nfeatures
Number of "features" (e.g., genes) to simulate data for
- populations
Vector of mutation populations you want to simulate.
- fraction_design
Fraction design matrix, specifying which potential mutational populations should actually exist. See ?EstimateFractions for more details.
- fractions_matrix
Matrix of fractions of each mutational population to simulate. If not provided, this will be simulated. One row for each feature, one column for each mutational population, rows should sum to 1.
- read_vect
Vector of length =
nfeatures
; specifies the number of reads to be simulated for each feature. If this is not provided, the number of reads simulated is equal toround(seqdepth * (ksyn_i/kdeg_i)/sum(ksyn/kdeg))
. In other words, the normalized steady-state abundance of a feature is multiplied by the total number of reads to be simulated and rounded to the nearest integer.- sample_name
Character vector to assign to
sample
column of output simulated data table (the cB table).- feature_prefix
Name given to the i-th feature is
paste0(feature_prefix, i)
. Shows up in thefeature
column of the output simulated data table.- kdeg_vect
Vector of length =
nfeatures
; specifies the degradation rate constant to use for each feature's simulation. If this is not provided andfn_vect
is, thenkdeg_vect = -log(1 - fn_vect)/label_time
. If bothkdeg_vect
andfn_vect
are not provided, each feature'skdeg_vect
value is drawn from a log-normal distrubition with meanlog =logkdeg_mean
and sdlog =logkdeg_sd
.kdeg_vect
is actually only simulated in the case whereread_vect
is also not provided, as it will be used to simulate read counts as described above.- ksyn_vect
Vector of length =
nfeatures
; specifies the synthesis rate constant to use for each feature's simulation. If this is not provided, andread_vect
is also not provided, then each feature'sksyn_vect
value is drawn from a log-normal distribution with meanlog =logksyn_mean
and sdlog =logksyn_sd
. ksyn's do not need to be simulated ifread_vect
is provided, as they only influence read counts.- logkdeg_mean
If necessary, meanlog of a log-normal distribution from which kdegs are simulated
- logkdeg_sd
If necessary, sdlog of a log-normal distribution from which kdegs are simulated
- logksyn_mean
If necessary, meanlog of a log-normal distribution from which ksyns are simulated
- logksyn_sd
If necessary, sdlog of a log-normal distribution from which ksyns are simulated
- phighs
Vector of probabilities of mutation rates in labeled reads of each type denoted in
populations
. Should be a named vector, with names being the correspondingpopulation
.- plows
Vector of probabilities of mutation rates in unlabeled reads of each type denoted in
populations
. Should be a named vector, with names being the correspondingpopulation
.- seqdepth
Only relevant if
read_vect
is not provided; in that case, this is the total number of reads to simulate.- readlength
Length of simulated reads. In this simple simulation, all reads are simulated as being exactly this length.
- alpha_min
Minimum possible value of alpha element of Dirichlet random variable
- alpha_max
Maximum possible value of alpha element of Dirichlet random variable
- Ucont
Probability that a nucleotide in a simulated read is a U.
- Acont
Probability that a nucleotide in a simulated read is an A.
- Gcont
Probability that a nucleotide in a simulated read is a G.
- Ccont
Probability that a nucleotide in a simulated read is a C.
Examples
simdata <- SimulateMultiLabel(3)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
#> `.name_repair` is omitted as of tibble 2.0.0.
#> ℹ Using compatibility `.name_repair`.
#> ℹ The deprecated feature was likely used in the EZbakR package.
#> Please report the issue to the authors.