Filter Data Rows and Subset Data Columns
filterInputData.Rd
Given a data frame with a column containing receptor sequences, filter data rows by sequence length and sequence content. Keep all data columns or choose which columns to keep.
Usage
filterInputData(
data,
seq_col,
min_seq_length = NULL,
drop_matches = NULL,
subset_cols = NULL,
count_col = NULL,
verbose = FALSE
)
Arguments
- data
A data frame.
- seq_col
Specifies the column(s) of
data
containing the receptor sequences. Accepts a character or numeric vector of length 1 or 2, containing either column names or column indices. Each column specified will be coerced to a character vector. Data rows containing a value ofNA
in any of the specified columns will be dropped.- min_seq_length
Observations whose receptor sequences have fewer than
min_seq_length
characters are dropped.- drop_matches
Accepts a character string containing a regular expression (see
regex
). Checks values in the receptor sequence column for a pattern match usinggrep()
. Rows in which a match is found are dropped.- subset_cols
Specifies which columns of the AIRR-Seq data are included in the output. Accepts a character vector of column names or a numeric vector of column indices. The default
NULL
includes all columns. The receptor sequence column is always included regardless of this argument's value.- count_col
Optional. Specifies the column of
data
containing a measure of abundance, e.g., clone count or unique molecular identifier (UMI) count. Accepts either the column name as a character string or the numeric column index. If provided, data rows withNA
count values will be removed.- verbose
Logical. If
TRUE
, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent tostderr()
.
References
Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825
Author
Brian Neal (Brian.Neal@ucsf.edu)
Examples
set.seed(42)
raw_data <- simulateToyData()
# Remove sequences shorter than 13 characters,
# as well as sequences containing the subsequence "GGGG".
# Keep variables for clone sequence, clone frequency and sample ID
filterInputData(
raw_data,
seq_col = "CloneSeq",
min_seq_length = 13,
drop_matches = "GGGG",
subset_cols =
c("CloneSeq", "CloneFrequency", "SampleID"),
verbose = TRUE
)
#> Input data contains 200 rows.
#> Removing sequences with length fewer than 13 characters...
#> Done. 136 rows remaining.
#> Removing sequences containing matches to “GGGG”...
#> Done. 105 rows remaining.
#> CloneSeq CloneFrequency SampleID
#> 1 TTGAGGAAATTCG 0.007873775 Sample1
#> 2 GGAGATGAATCGG 0.007777102 Sample1
#> 3 GTCGGGTAATTGG 0.009094910 Sample1
#> 4 GCCGGGTAATTCG 0.010160859 Sample1
#> 5 GAAAGAGAATTCG 0.009336593 Sample1
#> 6 AGGTGGGAATTCG 0.010369470 Sample1
#> 7 GCGCAGCAATTGG 0.007939920 Sample1
#> 9 AGGGACAAATTGG 0.008362229 Sample1
#> 10 GAGGAAGAATCGG 0.012432679 Sample1
#> 14 GGTTAGGAATTCG 0.011582972 Sample1
#> 16 CAAGGGAAATTCG 0.012557336 Sample1
#> 19 TCGATGGAATTGG 0.014465359 Sample1
#> 20 TAGAGAGAATCGG 0.011305673 Sample1
#> 21 GGGAAAGAATTGG 0.010964773 Sample1
#> 22 CGGAGAGAATCGG 0.011819567 Sample1
#> 23 AAGGGATAATTGG 0.008166339 Sample1
#> 26 GAGAATAAATTGC 0.008418198 Sample1
#> 27 GGAATAGAATTGG 0.013554596 Sample1
#> 29 GGAAAGAAATTGG 0.011921328 Sample1
#> 30 GGGCGGGAATCGG 0.010537376 Sample1
#> 37 GGGAGGGAATTGG 0.006339725 Sample1
#> 38 GAGGCGGAATCGG 0.007074950 Sample1
#> 39 AGTAGAGAATCGG 0.014307629 Sample1
#> 42 GGGAAGGAATCGG 0.012397062 Sample1
#> 45 GGGCGGGAATTGC 0.006779842 Sample1
#> 46 TGGTCGGAATTGG 0.012854988 Sample1
#> 47 GGGTGAAAATTGG 0.010926612 Sample1
#> 48 ATGGGAGAATTCG 0.008573384 Sample1
#> 50 GGGCGATAATTGG 0.009135615 Sample1
#> 53 GGGTGGGAATTGG 0.013697062 Sample1
#> 54 GAGACGGAATTGG 0.011175927 Sample1
#> 56 AGCGGAGAATTGG 0.011249704 Sample1
#> 57 GGAGCTGAATCGG 0.013335810 Sample1
#> 64 TGAGGGAAATTGG 0.004574167 Sample1
#> 66 GGATAGGAATCGG 0.006746770 Sample1
#> 68 AGGTCGGAATTGG 0.008301173 Sample1
#> 69 GTGAGGAAATTGG 0.012079058 Sample1
#> 70 GTGTGGGAATCGG 0.005960664 Sample1
#> 71 CGAGGGAAATTGG 0.010135419 Sample1
#> 73 GTGGTGGAATTGG 0.008352053 Sample1
#> 75 GGATAGGAATTCG 0.007827983 Sample1
#> 76 AGTGGAGAATTGG 0.008003521 Sample1
#> 79 AGGGTGAAATTCG 0.007204695 Sample1
#> 82 GGAGGCGAATCGG 0.012758315 Sample1
#> 84 TAGGGCCAATTGG 0.009295889 Sample1
#> 85 GAGAGCAAATCGG 0.009331505 Sample1
#> 87 AGGGAGGAATTGG 0.009099998 Sample1
#> 88 GGGAGATAATCGG 0.009171231 Sample1
#> 90 AGAGGAGAATTGG 0.008303717 Sample1
#> 92 GAAGGGAAATTGG 0.012023090 Sample1
#> 93 GAGGCTGAATCGG 0.011318393 Sample1
#> 95 GGATGAGAATCGG 0.011381994 Sample1
#> 96 AGGAGGAAATTGG 0.008306261 Sample1
#> 98 GGAACGAAATTGG 0.009013501 Sample1
#> 99 GGAGGGAAATTGC 0.007718589 Sample1
#> 100 AGAGCTGAATTGG 0.007046965 Sample1
#> 102 AAAATAAAATTGG 0.011245432 Sample2
#> 104 AAAACAAAATTGG 0.009633772 Sample2
#> 106 AAGTAGGAATTGC 0.010566252 Sample2
#> 107 AAAAAAGAATTGC 0.012219664 Sample2
#> 111 GGAGAGAAATTGC 0.012035952 Sample2
#> 113 TAAAGGAAATTGC 0.014154213 Sample2
#> 115 GGAAAAAAATTGG 0.013889778 Sample2
#> 117 CTGGCAAAATTGC 0.012369975 Sample2
#> 119 AAAAACTAATTGC 0.011498732 Sample2
#> 120 TGAAAAGAATTGG 0.007715925 Sample2
#> 121 AAAAGAAAATTCG 0.009733979 Sample2
#> 124 AAGATAAAATTGC 0.003908066 Sample2
#> 127 AGAAGAAAATTGC 0.006833548 Sample2
#> 128 GGAGGAAAATTCG 0.007493243 Sample2
#> 129 AAAAGAGAATTCG 0.007888503 Sample2
#> 130 AAAAGAGAATCGG 0.010104187 Sample2
#> 132 AGAGAAAAATCGG 0.005962306 Sample2
#> 134 AAAAATAAATTGC 0.005619932 Sample2
#> 136 AAGCAGAAATTGG 0.009703360 Sample2
#> 139 TAAAAAAAATTGC 0.008868303 Sample2
#> 140 CAGTAAAAATTCG 0.005539210 Sample2
#> 141 ACAGGACAATTGG 0.004926835 Sample2
#> 142 AGATAGAAATTCG 0.009060366 Sample2
#> 146 AGGATGTAATCGG 0.014257203 Sample2
#> 148 AAGGAAAAATTGC 0.014524421 Sample2
#> 149 AAAAATAAATCGG 0.014911331 Sample2
#> 150 GAAAAAAAATTGC 0.012589873 Sample2
#> 152 AGAAAAAAATCGG 0.007958091 Sample2
#> 153 AAAAAAAAATTGC 0.007571182 Sample2
#> 155 ACAAAAGAATTGC 0.012208530 Sample2
#> 156 GAAATAGAATTCG 0.012308737 Sample2
#> 160 AGAAAAGAATTGC 0.012266984 Sample2
#> 162 AGAGAAGAATTGG 0.010986564 Sample2
#> 163 GAAAAAGAATTCG 0.007799430 Sample2
#> 166 AAAATGGAATTCG 0.013664313 Sample2
#> 168 AAGAAGGAATTGG 0.005739624 Sample2
#> 169 AGAAGAAAATTCG 0.010374189 Sample2
#> 172 AGAGCAAAATTGG 0.007610151 Sample2
#> 175 GAAATTAAATTGC 0.014028954 Sample2
#> 176 GGGTAAAAATTGC 0.006410453 Sample2
#> 179 AAGAAGAAATTGG 0.013230083 Sample2
#> 181 AGAAGAAAATTGC 0.012854308 Sample2
#> 183 AAAAACAAATTGC 0.009338719 Sample2
#> 185 AAATAAAAATCGG 0.013355342 Sample2
#> 189 GAAGAACAATTGC 0.009104903 Sample2
#> 192 CAAAATAAATTCG 0.012854308 Sample2
#> 196 AAAAAGAAATTGC 0.011674094 Sample2
#> 197 AGAAGAAAATTGC 0.012768018 Sample2
#> 200 AAAAGAAAATTGC 0.010546767 Sample2