Generate Toy AIRR-Seq Data — simulateToyData • NAIR

Generates toy data that can be used to test or demonstrate the behavior of functions in the NAIR package. Created as a lightweight tool for use in tests, examples and vignettes. This function is not intended to simulate realistic data.

Usage

simulateToyData(
  samples = 2,
  chains = 1,
  sample_size = 100,
  prefix_length = 7,
  prefix_chars = c("G", "A", "T", "C"),
  prefix_probs = rbind(
    "sample1" = c(12, 4, 1, 1),
    "sample2" = c(4, 12, 1, 1)),
  affixes = c("AATTGG", "AATCGG", "AATTCG",
              "AATTGC", "AATTG", "AATTC"),
  affix_probs = rbind(
    "sample1" = c(10, 4, 2, 2, 1, 1),
    "sample2" = c(1, 1, 1, 2, 2.5, 2.5)),
  num_edits = 0,
  edit_pos_probs = function(seq_length) {
    stats::dnorm(seq(-4, 4, length.out = seq_length))
  },
  edit_ops = c("insertion", "deletion", "transmutation"),
  edit_probs = c(5, 1, 4),
  new_chars = prefix_chars,
  new_probs = prefix_probs,
  output_dir = NULL,
  no_return = FALSE
)

Arguments

samples: The number of distinct samples to include in the data.
chains: The number of chains (either 1 or 2) for which to generate receptor sequences.
sample_size: The number of observations to generate per sample.
prefix_length: The length of the random prefix generated for each observed sequence. Specifically, the number of elements of prefix_chars that are sampled with replacement and concatenated to form each prefix.
prefix_chars: A character vector containing characters or strings from which to sample when generating the prefix for each observed sequence.
prefix_probs: A numeric matrix whose column dimension matches the length of prefix_chars and with row dimension matching the value of samples. The \(i\)th row specifies the relative probability weights assigned to each element of prefix_chars when sampling to form the prefix for each sequence in the \(i\)th sample.
affixes: A character vector containing characters or strings from which to sample when generating the suffix for each observed sequence.
affix_probs: A numeric matrix whose column dimension matches the length of affixes and with row dimension matching the value of samples. The \(i\)th row specifies the relative probability weights assigned to each element of affixes when sampling to form the suffix for each sequence in the \(i\)th sample.
num_edits: A nonnegative integer specifying the number of random edit operations to perform on each observed sequence after its initial generation.
edit_pos_probs: A function that accepts a nonnegative integer (the character length of a sequence) as its argument and returns a vector of this length containing probability weights. Each time an edit operation is performed on a sequence, the character position at which to perform the operation is randomly determined according to the probabilities given by this function.
edit_ops: A character vector specifying the possible operations that can be performed for each edit. The default value includes all valid operations (insertion, deletion, transmutation).
edit_probs: A numeric vector of the same length as edit_ops, specifying the relative probability weights assigned to each edit operation.
new_chars: A character vector containing characters or strings from which to sample when performing an insertion edit operation.
new_probs: A numeric matrix whose column dimension matches the length of new_chars and with row dimension matching the value of samples. The \(i\)th row specifies, for the \(i\)th sample, the relative probability weights assigned to each element of new_chars when performing a transmutation or insertion as a random edit operation.
output_dir: An optional character string specifying a file directory to save the generated data. One file will be generated per sample.
no_return: A logical flag that can be used to prevent the function from returning the generated data. If TRUE, the function will instead return TRUE once all processes are complete.

Details

Each observed sequence is obtained by separately generating a prefix and suffix according to the specified settings, then joining the two and performing sequential rounds of edit operations randomized according to the user's specifications.

Count data is generated for each observation; note that this count data is generated independently from the observed sequences and has no relationship to them.

Value

If no_return = FALSE (the default), a data.frame whose contents depend on the value of the chains argument.

For chains = 1, the data frame contains the following variables:

CloneSeq: The "receptor sequence" for each observation.
CloneFrequency: The "clone frequency" for each observation (clone count as a proportion of the aggregate clone count within each sample).
CloneCount: The "clone count" for each observation.
SampleID: The sample ID for each observation.

For chains = 2, the data frame contains the following variables:

AlphaSeq: The "alpha chain" receptor sequence for each observation.
AlphaSeq: The "beta chain" receptor sequence for each observation.
UMIs: The "unique molecular identifier count" for each observation.
Count: The "count" for each observation.
SampleID: The sample ID for each observation.

If no_return = TRUE, the function returns TRUE upon completion.

Author

Brian Neal (Brian.Neal@ucsf.edu)

Examples

set.seed(42)

# Bulk data from two samples
dat1 <- simulateToyData()

# Single-cell data with alpha and beta chain sequences
dat2 <- simulateToyData(chains = 2)

# Write data to file, return nothing
simulateToyData(sample_size = 500,
                num_edits = 10,
                no_return = TRUE,
                output_dir = tempdir())
#> [1] TRUE

# Example customization
dat4 <-
  simulateToyData(
    samples = 5,
    sample_size = 50,
    prefix_length = 0,
    prefix_chars = "",
    prefix_probs = matrix(1, nrow = 5),
    affixes = c("CASSLGYEQYF", "CASSLGETQYF",
                "CASSLGTDTQYF", "CASSLGTEAFF",
                "CASSLGGTEAFF", "CAGLGGRDQETQYF",
                "CASSQETQYF", "CASSLTDTQYF",
                "CANYGYTF", "CANTGELFF",
                "CSANYGYTF"),
    affix_probs = matrix(1, ncol = 11, nrow = 5),
  )

## Simulate 30 samples with a mix of public/private sequences ##
samples <- 30
sample_size <- 30 # (seqs per sample)
base_seqs <- c(
  "CASSIEGQLSTDTQYF", "CASSEEGQLSTDTQYF", "CASSSVETQYF",
  "CASSPEGQLSTDTQYF", "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF",
  "CASSLTSGYNEQFF", "CASSETGYNEQFF", "CASSLTGGNEQFF", "CASSYLTGYNEQFF",
  "CASSLTGNEQFF", "CASSLNGYNEQFF", "CASSFPWDGYGYTF", "CASTLARQGGELFF",
  "CASTLSRQGGELFF", "CSVELLPTGPLETSYNEQFF", "CSVELLPTGPSETSYNEQFF",
  "CVELLPTGPSETSYNEQFF", "CASLAGGRTQETQYF", "CASRLAGGRTQETQYF",
  "CASSLAGGRTETQYF", "CASSLAGGRTQETQYF", "CASSRLAGGRTQETQYF",
  "CASQYGGGNQPQHF", "CASSLGGGNQPQHF", "CASSNGGGNQPQHF", "CASSYGGGGNQPQHF",
  "CASSYGGGQPQHF", "CASSYKGGNQPQHF", "CASSYTGGGNQPQHF",
  "CAWSSQETQYF", "CASSSPETQYF", "CASSGAYEQYF", "CSVDLGKGNNEQFF")
# Relative generation probabilities
pgen <- cbind(
  stats::toeplitz(0.6^(0:(sample_size - 1))),
  matrix(1, nrow = samples, ncol = length(base_seqs) - samples))
dat5 <-
  simulateToyData(
    samples = samples,
    sample_size = sample_size,
    prefix_length = 1,
    prefix_chars = c("", ""),
    prefix_probs = cbind(rep(1, samples), rep(0, samples)),
    affixes = base_seqs,
    affix_probs = pgen,
    num_edits = 0
  )

## Simulate 30 samples from two groups (treatment/control) ##
samples_c <- samples_t <- 15 # Number of samples by control/treatment group
samples <- samples_c + samples_t
sample_size <- 30 # (seqs per sample)
base_seqs <- # first five are associated with treatment
  c("CASSGAYEQYF", "CSVDLGKGNNEQFF", "CASSIEGQLSTDTQYF",
    "CASSEEGQLSTDTQYF", "CASSPEGQLSTDTQYF",
    "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF")
# Relative generation probabilities by control/treatment group
pgen_c <- matrix(rep(c(rep(1, 5), rep(30, 3)), times = samples_c),
                 nrow = samples_c, byrow = TRUE)
pgen_t <- matrix(rep(c(1, 1, rep(1/3, 3), rep(2, 3)), times = samples_t),
                 nrow = samples_t, byrow = TRUE)
pgen <- rbind(pgen_c, pgen_t)
dat6 <-
  simulateToyData(
    samples = samples,
    sample_size = sample_size,
    prefix_length = 1,
    prefix_chars = c("", ""),
    prefix_probs =
      cbind(rep(1, samples), rep(0, samples)),
    affixes = base_seqs,
    affix_probs = pgen,
    num_edits = 0
  )