Aggregate Counts/Frequencies for Clones With Identical Receptor Sequences

Given bulk Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data with clones indexed by row, returns a data frame containing one row for each unique receptor sequence. Includes the number of clones sharing each sequence, as well as aggregate values for clone count and clone frequency across all clones sharing each sequence. Clones can be grouped according to metadata, in which case aggregation is performed within (but not across) groups.

Usage

aggregateIdenticalClones(
  data,
  clone_col,
  count_col,
  freq_col,
  grouping_cols = NULL,
  verbose = FALSE
)

Arguments

data: A data frame containing the bulk AIRR-Seq data, with clones indexed by row.
clone_col: Specifies the column of data containing the receptor sequences. Accepts a character string containing the column name or a numeric scalar containing the column index.
count_col: Specifies the column of data containing the clone counts. Accepts a character string containing the column name or a numeric scalar containing the column index.
freq_col: Specifies the column of data containing the clone frequencies. Accepts a character string containing the column name or a numeric scalar containing the column index.
grouping_cols: An optional character vector of column names or numeric vector of column indices, specifying one or more columns of data used to assign clones to groups. If provided, aggregation occurs within groups, but not across groups. See details.
verbose: Logical. If TRUE, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to stderr().

Details

If grouping_cols is left unspecified, the returned data frame will contain one row for each unique receptor sequence appearing in data.

If one or more columns of data are specified using the grouping_cols argument, then each clone (row) in data is assigned to a group based on its combination of values in these columns. If two clones share the same receptor sequence but belong to different groups, their receptor sequence will appear multiple times in the returned data frame, with one row for each group in which the sequence appears. In each such row, the aggregate clone count, aggregate clone frequency, and number of clones sharing the sequence are reported within the group for that row.

Value

A data frame whose first column contains the receptor sequences and has the same name as the column of data specified by clone_col. One additional column will be present for each column of data that is specified using the grouping_cols argument, with each having the same column name. The remaining columns are as follows:

AggregatedCloneCount: The aggregate clone count across all clones (within the same group, if applicable) that share the receptor sequence in that row.
AggregatedCloneFrequency: The aggregate clone frequency across all clones (within the same group, if applicable) that share the receptor sequence in that row.
UniqueCloneCount: The number of clones (rows) in data (within the same group, if applicable) possessing the receptor sequence for the current row.

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Author

Brian Neal (Brian.Neal@ucsf.edu)

Examples

my_data <- data.frame(
  clone_seq = c("ATCG", rep("ACAC", 2), rep("GGGG", 4)),
  clone_count = rep(1, 7),
  clone_freq = rep(1/7, 7),
  time_point = c("t_0", rep(c("t_0", "t_1"), 3)),
  subject_id = c(rep(1, 5), rep(2, 2))
)
my_data
#>   clone_seq clone_count clone_freq time_point subject_id
#> 1      ATCG           1  0.1428571        t_0          1
#> 2      ACAC           1  0.1428571        t_0          1
#> 3      ACAC           1  0.1428571        t_1          1
#> 4      GGGG           1  0.1428571        t_0          1
#> 5      GGGG           1  0.1428571        t_1          1
#> 6      GGGG           1  0.1428571        t_0          2
#> 7      GGGG           1  0.1428571        t_1          2

aggregateIdenticalClones(
  my_data,
  "clone_seq",
  "clone_count",
  "clone_freq",
)
#>   clone_seq AggregatedCloneCount AggregatedCloneFrequency UniqueCloneCount
#> 1      ACAC                    2                0.2857143                1
#> 2      ATCG                    1                0.1428571                1
#> 3      GGGG                    4                0.5714286                1

# group clones by time point
aggregateIdenticalClones(
  my_data,
  "clone_seq",
  "clone_count",
  "clone_freq",
  grouping_cols = "time_point"
)
#>   clone_seq time_point AggregatedCloneCount AggregatedCloneFrequency
#> 1      ACAC        t_0                    1                0.1428571
#> 2      ACAC        t_1                    1                0.1428571
#> 3      ATCG        t_0                    1                0.1428571
#> 4      GGGG        t_0                    2                0.2857143
#> 5      GGGG        t_1                    2                0.2857143
#>   UniqueCloneCount
#> 1                1
#> 2                1
#> 3                1
#> 4                1
#> 5                1

# group clones by subject ID
aggregateIdenticalClones(
  my_data,
  "clone_seq",
  "clone_count",
  "clone_freq",
  grouping_cols = "subject_id"
)
#>   clone_seq subject_id AggregatedCloneCount AggregatedCloneFrequency
#> 1      ACAC          1                    2                0.2857143
#> 2      ATCG          1                    1                0.1428571
#> 3      GGGG          1                    2                0.2857143
#> 4      GGGG          2                    2                0.2857143
#>   UniqueCloneCount
#> 1                1
#> 2                1
#> 3                1
#> 4                1

# group clones by time point and subject ID
aggregateIdenticalClones(
  my_data,
  "clone_seq",
  "clone_count",
  "clone_freq",
  grouping_cols =
    c("subject_id", "time_point")
)
#>   clone_seq subject_id time_point AggregatedCloneCount AggregatedCloneFrequency
#> 1      ACAC          1        t_0                    1                0.1428571
#> 2      ACAC          1        t_1                    1                0.1428571
#> 3      ATCG          1        t_0                    1                0.1428571
#> 4      GGGG          1        t_0                    1                0.1428571
#> 5      GGGG          1        t_1                    1                0.1428571
#> 6      GGGG          2        t_0                    1                0.1428571
#> 7      GGGG          2        t_1                    1                0.1428571
#>   UniqueCloneCount
#> 1                1
#> 2                1
#> 3                1
#> 4                1
#> 5                1
#> 6                1
#> 7                1