Filter Data Rows and Subset Data Columns
filterInputData.RdGiven a data frame with a column containing receptor sequences, filter data rows by sequence length and sequence content. Keep all data columns or choose which columns to keep.
Usage
filterInputData(
data,
seq_col,
min_seq_length = NULL,
drop_matches = NULL,
subset_cols = NULL,
count_col = NULL,
verbose = FALSE
)Arguments
- data
A data frame.
- seq_col
Specifies the column(s) of
datacontaining the receptor sequences. Accepts a character or numeric vector of length 1 or 2, containing either column names or column indices. Each column specified will be coerced to a character vector. Data rows containing a value ofNAin any of the specified columns will be dropped.- min_seq_length
Observations whose receptor sequences have fewer than
min_seq_lengthcharacters are dropped.- drop_matches
Accepts a character string containing a regular expression (see
regex). Checks values in the receptor sequence column for a pattern match usinggrep(). Rows in which a match is found are dropped.- subset_cols
Specifies which columns of the AIRR-Seq data are included in the output. Accepts a character vector of column names or a numeric vector of column indices. The default
NULLincludes all columns. The receptor sequence column is always included regardless of this argument's value.- count_col
Optional. Specifies the column of
datacontaining a measure of abundance, e.g., clone count or unique molecular identifier (UMI) count. Accepts either the column name as a character string or the numeric column index. If provided, data rows withNAcount values will be removed.- verbose
Logical. If
TRUE, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent tostderr().
References
Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825
Author
Brian Neal (Brian.Neal@ucsf.edu)
Examples
set.seed(42)
raw_data <- simulateToyData()
# Remove sequences shorter than 13 characters,
# as well as sequences containing the subsequence "GGGG".
# Keep variables for clone sequence, clone frequency and sample ID
filterInputData(
raw_data,
seq_col = "CloneSeq",
min_seq_length = 13,
drop_matches = "GGGG",
subset_cols =
c("CloneSeq", "CloneFrequency", "SampleID"),
verbose = TRUE
)
#> Input data contains 200 rows.
#> Removing sequences with length fewer than 13 characters...
#> Done. 136 rows remaining.
#> Removing sequences containing matches to “GGGG”...
#> Done. 105 rows remaining.
#> CloneSeq CloneFrequency SampleID
#> 1 TTGAGGAAATTCG 0.007873775 Sample1
#> 2 GGAGATGAATCGG 0.007777102 Sample1
#> 3 GTCGGGTAATTGG 0.009094910 Sample1
#> 4 GCCGGGTAATTCG 0.010160859 Sample1
#> 5 GAAAGAGAATTCG 0.009336593 Sample1
#> 6 AGGTGGGAATTCG 0.010369470 Sample1
#> 7 GCGCAGCAATTGG 0.007939920 Sample1
#> 9 AGGGACAAATTGG 0.008362229 Sample1
#> 10 GAGGAAGAATCGG 0.012432679 Sample1
#> 14 GGTTAGGAATTCG 0.011582972 Sample1
#> 16 CAAGGGAAATTCG 0.012557336 Sample1
#> 19 TCGATGGAATTGG 0.014465359 Sample1
#> 20 TAGAGAGAATCGG 0.011305673 Sample1
#> 21 GGGAAAGAATTGG 0.010964773 Sample1
#> 22 CGGAGAGAATCGG 0.011819567 Sample1
#> 23 AAGGGATAATTGG 0.008166339 Sample1
#> 26 GAGAATAAATTGC 0.008418198 Sample1
#> 27 GGAATAGAATTGG 0.013554596 Sample1
#> 29 GGAAAGAAATTGG 0.011921328 Sample1
#> 30 GGGCGGGAATCGG 0.010537376 Sample1
#> 37 GGGAGGGAATTGG 0.006339725 Sample1
#> 38 GAGGCGGAATCGG 0.007074950 Sample1
#> 39 AGTAGAGAATCGG 0.014307629 Sample1
#> 42 GGGAAGGAATCGG 0.012397062 Sample1
#> 45 GGGCGGGAATTGC 0.006779842 Sample1
#> 46 TGGTCGGAATTGG 0.012854988 Sample1
#> 47 GGGTGAAAATTGG 0.010926612 Sample1
#> 48 ATGGGAGAATTCG 0.008573384 Sample1
#> 50 GGGCGATAATTGG 0.009135615 Sample1
#> 53 GGGTGGGAATTGG 0.013697062 Sample1
#> 54 GAGACGGAATTGG 0.011175927 Sample1
#> 56 AGCGGAGAATTGG 0.011249704 Sample1
#> 57 GGAGCTGAATCGG 0.013335810 Sample1
#> 64 TGAGGGAAATTGG 0.004574167 Sample1
#> 66 GGATAGGAATCGG 0.006746770 Sample1
#> 68 AGGTCGGAATTGG 0.008301173 Sample1
#> 69 GTGAGGAAATTGG 0.012079058 Sample1
#> 70 GTGTGGGAATCGG 0.005960664 Sample1
#> 71 CGAGGGAAATTGG 0.010135419 Sample1
#> 73 GTGGTGGAATTGG 0.008352053 Sample1
#> 75 GGATAGGAATTCG 0.007827983 Sample1
#> 76 AGTGGAGAATTGG 0.008003521 Sample1
#> 79 AGGGTGAAATTCG 0.007204695 Sample1
#> 82 GGAGGCGAATCGG 0.012758315 Sample1
#> 84 TAGGGCCAATTGG 0.009295889 Sample1
#> 85 GAGAGCAAATCGG 0.009331505 Sample1
#> 87 AGGGAGGAATTGG 0.009099998 Sample1
#> 88 GGGAGATAATCGG 0.009171231 Sample1
#> 90 AGAGGAGAATTGG 0.008303717 Sample1
#> 92 GAAGGGAAATTGG 0.012023090 Sample1
#> 93 GAGGCTGAATCGG 0.011318393 Sample1
#> 95 GGATGAGAATCGG 0.011381994 Sample1
#> 96 AGGAGGAAATTGG 0.008306261 Sample1
#> 98 GGAACGAAATTGG 0.009013501 Sample1
#> 99 GGAGGGAAATTGC 0.007718589 Sample1
#> 100 AGAGCTGAATTGG 0.007046965 Sample1
#> 102 AAAATAAAATTGG 0.011245432 Sample2
#> 104 AAAACAAAATTGG 0.009633772 Sample2
#> 106 AAGTAGGAATTGC 0.010566252 Sample2
#> 107 AAAAAAGAATTGC 0.012219664 Sample2
#> 111 GGAGAGAAATTGC 0.012035952 Sample2
#> 113 TAAAGGAAATTGC 0.014154213 Sample2
#> 115 GGAAAAAAATTGG 0.013889778 Sample2
#> 117 CTGGCAAAATTGC 0.012369975 Sample2
#> 119 AAAAACTAATTGC 0.011498732 Sample2
#> 120 TGAAAAGAATTGG 0.007715925 Sample2
#> 121 AAAAGAAAATTCG 0.009733979 Sample2
#> 124 AAGATAAAATTGC 0.003908066 Sample2
#> 127 AGAAGAAAATTGC 0.006833548 Sample2
#> 128 GGAGGAAAATTCG 0.007493243 Sample2
#> 129 AAAAGAGAATTCG 0.007888503 Sample2
#> 130 AAAAGAGAATCGG 0.010104187 Sample2
#> 132 AGAGAAAAATCGG 0.005962306 Sample2
#> 134 AAAAATAAATTGC 0.005619932 Sample2
#> 136 AAGCAGAAATTGG 0.009703360 Sample2
#> 139 TAAAAAAAATTGC 0.008868303 Sample2
#> 140 CAGTAAAAATTCG 0.005539210 Sample2
#> 141 ACAGGACAATTGG 0.004926835 Sample2
#> 142 AGATAGAAATTCG 0.009060366 Sample2
#> 146 AGGATGTAATCGG 0.014257203 Sample2
#> 148 AAGGAAAAATTGC 0.014524421 Sample2
#> 149 AAAAATAAATCGG 0.014911331 Sample2
#> 150 GAAAAAAAATTGC 0.012589873 Sample2
#> 152 AGAAAAAAATCGG 0.007958091 Sample2
#> 153 AAAAAAAAATTGC 0.007571182 Sample2
#> 155 ACAAAAGAATTGC 0.012208530 Sample2
#> 156 GAAATAGAATTCG 0.012308737 Sample2
#> 160 AGAAAAGAATTGC 0.012266984 Sample2
#> 162 AGAGAAGAATTGG 0.010986564 Sample2
#> 163 GAAAAAGAATTCG 0.007799430 Sample2
#> 166 AAAATGGAATTCG 0.013664313 Sample2
#> 168 AAGAAGGAATTGG 0.005739624 Sample2
#> 169 AGAAGAAAATTCG 0.010374189 Sample2
#> 172 AGAGCAAAATTGG 0.007610151 Sample2
#> 175 GAAATTAAATTGC 0.014028954 Sample2
#> 176 GGGTAAAAATTGC 0.006410453 Sample2
#> 179 AAGAAGAAATTGG 0.013230083 Sample2
#> 181 AGAAGAAAATTGC 0.012854308 Sample2
#> 183 AAAAACAAATTGC 0.009338719 Sample2
#> 185 AAATAAAAATCGG 0.013355342 Sample2
#> 189 GAAGAACAATTGC 0.009104903 Sample2
#> 192 CAAAATAAATTCG 0.012854308 Sample2
#> 196 AAAAAGAAATTGC 0.011674094 Sample2
#> 197 AGAAGAAAATTGC 0.012768018 Sample2
#> 200 AAAAGAAAATTGC 0.010546767 Sample2