Aggregate Counts/Frequencies for Clones With Identical Receptor Sequences
aggregateIdenticalClones.Rd
Given bulk Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data with clones indexed by row, returns a data frame containing one row for each unique receptor sequence. Includes the number of clones sharing each sequence, as well as aggregate values for clone count and clone frequency across all clones sharing each sequence. Clones can be grouped according to metadata, in which case aggregation is performed within (but not across) groups.
Usage
aggregateIdenticalClones(
data,
clone_col,
count_col,
freq_col,
grouping_cols = NULL,
verbose = FALSE
)
Arguments
- data
A data frame containing the bulk AIRR-Seq data, with clones indexed by row.
- clone_col
Specifies the column of
data
containing the receptor sequences. Accepts a character string containing the column name or a numeric scalar containing the column index.- count_col
Specifies the column of
data
containing the clone counts. Accepts a character string containing the column name or a numeric scalar containing the column index.- freq_col
Specifies the column of
data
containing the clone frequencies. Accepts a character string containing the column name or a numeric scalar containing the column index.- grouping_cols
An optional character vector of column names or numeric vector of column indices, specifying one or more columns of
data
used to assign clones to groups. If provided, aggregation occurs within groups, but not across groups. See details.- verbose
Logical. If
TRUE
, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent tostderr()
.
Details
If grouping_cols
is left unspecified, the returned data frame will contain
one row for each unique receptor sequence appearing in data
.
If one or more columns of data
are specified using the grouping_cols
argument, then each clone (row) in data
is assigned to a group based on its
combination of values in these columns. If two clones share the same receptor sequence
but belong to different groups, their receptor sequence will appear multiple times
in the returned data frame, with one row for each group in which the sequence appears.
In each such row, the aggregate clone count, aggregate clone frequency, and number of
clones sharing the sequence are reported within the group for that row.
Value
A data frame whose first column contains the receptor sequences and has the
same name as the column of data
specified by clone_col
. One
additional column will be present for each column of data
that is
specified using the grouping_cols
argument, with each having the same
column name. The remaining columns are as follows:
- AggregatedCloneCount
The aggregate clone count across all clones (within the same group, if applicable) that share the receptor sequence in that row.
- AggregatedCloneFrequency
The aggregate clone frequency across all clones (within the same group, if applicable) that share the receptor sequence in that row.
- UniqueCloneCount
The number of clones (rows) in
data
(within the same group, if applicable) possessing the receptor sequence for the current row.
References
Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825
Author
Brian Neal (Brian.Neal@ucsf.edu)
Examples
my_data <- data.frame(
clone_seq = c("ATCG", rep("ACAC", 2), rep("GGGG", 4)),
clone_count = rep(1, 7),
clone_freq = rep(1/7, 7),
time_point = c("t_0", rep(c("t_0", "t_1"), 3)),
subject_id = c(rep(1, 5), rep(2, 2))
)
my_data
#> clone_seq clone_count clone_freq time_point subject_id
#> 1 ATCG 1 0.1428571 t_0 1
#> 2 ACAC 1 0.1428571 t_0 1
#> 3 ACAC 1 0.1428571 t_1 1
#> 4 GGGG 1 0.1428571 t_0 1
#> 5 GGGG 1 0.1428571 t_1 1
#> 6 GGGG 1 0.1428571 t_0 2
#> 7 GGGG 1 0.1428571 t_1 2
aggregateIdenticalClones(
my_data,
"clone_seq",
"clone_count",
"clone_freq",
)
#> clone_seq AggregatedCloneCount AggregatedCloneFrequency UniqueCloneCount
#> 1 ACAC 2 0.2857143 1
#> 2 ATCG 1 0.1428571 1
#> 3 GGGG 4 0.5714286 1
# group clones by time point
aggregateIdenticalClones(
my_data,
"clone_seq",
"clone_count",
"clone_freq",
grouping_cols = "time_point"
)
#> clone_seq time_point AggregatedCloneCount AggregatedCloneFrequency
#> 1 ACAC t_0 1 0.1428571
#> 2 ACAC t_1 1 0.1428571
#> 3 ATCG t_0 1 0.1428571
#> 4 GGGG t_0 2 0.2857143
#> 5 GGGG t_1 2 0.2857143
#> UniqueCloneCount
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 1
# group clones by subject ID
aggregateIdenticalClones(
my_data,
"clone_seq",
"clone_count",
"clone_freq",
grouping_cols = "subject_id"
)
#> clone_seq subject_id AggregatedCloneCount AggregatedCloneFrequency
#> 1 ACAC 1 2 0.2857143
#> 2 ATCG 1 1 0.1428571
#> 3 GGGG 1 2 0.2857143
#> 4 GGGG 2 2 0.2857143
#> UniqueCloneCount
#> 1 1
#> 2 1
#> 3 1
#> 4 1
# group clones by time point and subject ID
aggregateIdenticalClones(
my_data,
"clone_seq",
"clone_count",
"clone_freq",
grouping_cols =
c("subject_id", "time_point")
)
#> clone_seq subject_id time_point AggregatedCloneCount AggregatedCloneFrequency
#> 1 ACAC 1 t_0 1 0.1428571
#> 2 ACAC 1 t_1 1 0.1428571
#> 3 ATCG 1 t_0 1 0.1428571
#> 4 GGGG 1 t_0 1 0.1428571
#> 5 GGGG 1 t_1 1 0.1428571
#> 6 GGGG 2 t_0 1 0.1428571
#> 7 GGGG 2 t_1 1 0.1428571
#> UniqueCloneCount
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 1
#> 6 1
#> 7 1