Skip to contents

Given a data frame with a column containing receptor sequences, filter data rows by sequence length and sequence content. Keep all data columns or choose which columns to keep.

Usage

filterInputData(
  data,
  seq_col,
  min_seq_length = NULL,
  drop_matches = NULL,
  subset_cols = NULL,
  count_col = NULL,
  verbose = FALSE
)

Arguments

data

A data frame.

seq_col

Specifies the column(s) of data containing the receptor sequences. Accepts a character or numeric vector of length 1 or 2, containing either column names or column indices. Each column specified will be coerced to a character vector. Data rows containing a value of NA in any of the specified columns will be dropped.

min_seq_length

Observations whose receptor sequences have fewer than min_seq_length characters are dropped.

drop_matches

Accepts a character string containing a regular expression (see regex). Checks values in the receptor sequence column for a pattern match using grep(). Rows in which a match is found are dropped.

subset_cols

Specifies which columns of the AIRR-Seq data are included in the output. Accepts a character vector of column names or a numeric vector of column indices. The default NULL includes all columns. The receptor sequence column is always included regardless of this argument's value.

count_col

Optional. Specifies the column of data containing a measure of abundance, e.g., clone count or unique molecular identifier (UMI) count. Accepts either the column name as a character string or the numeric column index. If provided, data rows with NA count values will be removed.

verbose

Logical. If TRUE, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to stderr().

Value

A data frame.

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Author

Brian Neal (Brian.Neal@ucsf.edu)

Examples

set.seed(42)
raw_data <- simulateToyData()

# Remove sequences shorter than 13 characters,
# as well as sequences containing the subsequence "GGGG".
# Keep variables for clone sequence, clone frequency and sample ID
filterInputData(
  raw_data,
  seq_col = "CloneSeq",
  min_seq_length = 13,
  drop_matches = "GGGG",
  subset_cols =
    c("CloneSeq", "CloneFrequency", "SampleID"),
  verbose = TRUE
)
#> Input data contains 200 rows.
#> Removing sequences with length fewer than 13 characters...
#>  Done. 136 rows remaining.
#> Removing sequences containing matches to “GGGG”...
#>  Done. 105 rows remaining.
#>          CloneSeq CloneFrequency SampleID
#> 1   TTGAGGAAATTCG    0.007873775  Sample1
#> 2   GGAGATGAATCGG    0.007777102  Sample1
#> 3   GTCGGGTAATTGG    0.009094910  Sample1
#> 4   GCCGGGTAATTCG    0.010160859  Sample1
#> 5   GAAAGAGAATTCG    0.009336593  Sample1
#> 6   AGGTGGGAATTCG    0.010369470  Sample1
#> 7   GCGCAGCAATTGG    0.007939920  Sample1
#> 9   AGGGACAAATTGG    0.008362229  Sample1
#> 10  GAGGAAGAATCGG    0.012432679  Sample1
#> 14  GGTTAGGAATTCG    0.011582972  Sample1
#> 16  CAAGGGAAATTCG    0.012557336  Sample1
#> 19  TCGATGGAATTGG    0.014465359  Sample1
#> 20  TAGAGAGAATCGG    0.011305673  Sample1
#> 21  GGGAAAGAATTGG    0.010964773  Sample1
#> 22  CGGAGAGAATCGG    0.011819567  Sample1
#> 23  AAGGGATAATTGG    0.008166339  Sample1
#> 26  GAGAATAAATTGC    0.008418198  Sample1
#> 27  GGAATAGAATTGG    0.013554596  Sample1
#> 29  GGAAAGAAATTGG    0.011921328  Sample1
#> 30  GGGCGGGAATCGG    0.010537376  Sample1
#> 37  GGGAGGGAATTGG    0.006339725  Sample1
#> 38  GAGGCGGAATCGG    0.007074950  Sample1
#> 39  AGTAGAGAATCGG    0.014307629  Sample1
#> 42  GGGAAGGAATCGG    0.012397062  Sample1
#> 45  GGGCGGGAATTGC    0.006779842  Sample1
#> 46  TGGTCGGAATTGG    0.012854988  Sample1
#> 47  GGGTGAAAATTGG    0.010926612  Sample1
#> 48  ATGGGAGAATTCG    0.008573384  Sample1
#> 50  GGGCGATAATTGG    0.009135615  Sample1
#> 53  GGGTGGGAATTGG    0.013697062  Sample1
#> 54  GAGACGGAATTGG    0.011175927  Sample1
#> 56  AGCGGAGAATTGG    0.011249704  Sample1
#> 57  GGAGCTGAATCGG    0.013335810  Sample1
#> 64  TGAGGGAAATTGG    0.004574167  Sample1
#> 66  GGATAGGAATCGG    0.006746770  Sample1
#> 68  AGGTCGGAATTGG    0.008301173  Sample1
#> 69  GTGAGGAAATTGG    0.012079058  Sample1
#> 70  GTGTGGGAATCGG    0.005960664  Sample1
#> 71  CGAGGGAAATTGG    0.010135419  Sample1
#> 73  GTGGTGGAATTGG    0.008352053  Sample1
#> 75  GGATAGGAATTCG    0.007827983  Sample1
#> 76  AGTGGAGAATTGG    0.008003521  Sample1
#> 79  AGGGTGAAATTCG    0.007204695  Sample1
#> 82  GGAGGCGAATCGG    0.012758315  Sample1
#> 84  TAGGGCCAATTGG    0.009295889  Sample1
#> 85  GAGAGCAAATCGG    0.009331505  Sample1
#> 87  AGGGAGGAATTGG    0.009099998  Sample1
#> 88  GGGAGATAATCGG    0.009171231  Sample1
#> 90  AGAGGAGAATTGG    0.008303717  Sample1
#> 92  GAAGGGAAATTGG    0.012023090  Sample1
#> 93  GAGGCTGAATCGG    0.011318393  Sample1
#> 95  GGATGAGAATCGG    0.011381994  Sample1
#> 96  AGGAGGAAATTGG    0.008306261  Sample1
#> 98  GGAACGAAATTGG    0.009013501  Sample1
#> 99  GGAGGGAAATTGC    0.007718589  Sample1
#> 100 AGAGCTGAATTGG    0.007046965  Sample1
#> 102 AAAATAAAATTGG    0.011245432  Sample2
#> 104 AAAACAAAATTGG    0.009633772  Sample2
#> 106 AAGTAGGAATTGC    0.010566252  Sample2
#> 107 AAAAAAGAATTGC    0.012219664  Sample2
#> 111 GGAGAGAAATTGC    0.012035952  Sample2
#> 113 TAAAGGAAATTGC    0.014154213  Sample2
#> 115 GGAAAAAAATTGG    0.013889778  Sample2
#> 117 CTGGCAAAATTGC    0.012369975  Sample2
#> 119 AAAAACTAATTGC    0.011498732  Sample2
#> 120 TGAAAAGAATTGG    0.007715925  Sample2
#> 121 AAAAGAAAATTCG    0.009733979  Sample2
#> 124 AAGATAAAATTGC    0.003908066  Sample2
#> 127 AGAAGAAAATTGC    0.006833548  Sample2
#> 128 GGAGGAAAATTCG    0.007493243  Sample2
#> 129 AAAAGAGAATTCG    0.007888503  Sample2
#> 130 AAAAGAGAATCGG    0.010104187  Sample2
#> 132 AGAGAAAAATCGG    0.005962306  Sample2
#> 134 AAAAATAAATTGC    0.005619932  Sample2
#> 136 AAGCAGAAATTGG    0.009703360  Sample2
#> 139 TAAAAAAAATTGC    0.008868303  Sample2
#> 140 CAGTAAAAATTCG    0.005539210  Sample2
#> 141 ACAGGACAATTGG    0.004926835  Sample2
#> 142 AGATAGAAATTCG    0.009060366  Sample2
#> 146 AGGATGTAATCGG    0.014257203  Sample2
#> 148 AAGGAAAAATTGC    0.014524421  Sample2
#> 149 AAAAATAAATCGG    0.014911331  Sample2
#> 150 GAAAAAAAATTGC    0.012589873  Sample2
#> 152 AGAAAAAAATCGG    0.007958091  Sample2
#> 153 AAAAAAAAATTGC    0.007571182  Sample2
#> 155 ACAAAAGAATTGC    0.012208530  Sample2
#> 156 GAAATAGAATTCG    0.012308737  Sample2
#> 160 AGAAAAGAATTGC    0.012266984  Sample2
#> 162 AGAGAAGAATTGG    0.010986564  Sample2
#> 163 GAAAAAGAATTCG    0.007799430  Sample2
#> 166 AAAATGGAATTCG    0.013664313  Sample2
#> 168 AAGAAGGAATTGG    0.005739624  Sample2
#> 169 AGAAGAAAATTCG    0.010374189  Sample2
#> 172 AGAGCAAAATTGG    0.007610151  Sample2
#> 175 GAAATTAAATTGC    0.014028954  Sample2
#> 176 GGGTAAAAATTGC    0.006410453  Sample2
#> 179 AAGAAGAAATTGG    0.013230083  Sample2
#> 181 AGAAGAAAATTGC    0.012854308  Sample2
#> 183 AAAAACAAATTGC    0.009338719  Sample2
#> 185 AAATAAAAATCGG    0.013355342  Sample2
#> 189 GAAGAACAATTGC    0.009104903  Sample2
#> 192 CAAAATAAATTCG    0.012854308  Sample2
#> 196 AAAAAGAAATTGC    0.011674094  Sample2
#> 197 AGAAGAAAATTGC    0.012768018  Sample2
#> 200 AAAAGAAAATTGC    0.010546767  Sample2