Find Public Clusters Among RepSeq Samples
findPublicClusters.Rd
Part of the workflow Searching for Public TCR/BCR Clusters.
Given multiple samples of bulk Adaptive Immune Receptor Repertoire Sequencing (AIRR-Seq) data, construct the repertoire network for each sample. Within each sample's network, perform cluster analysis and filter the clusters based on node count and aggregate clone count.
Usage
findPublicClusters(
## Input ##
file_list,
input_type,
data_symbols = NULL,
header, sep, read.args,
sample_ids =
paste0("Sample", 1:length(file_list)),
seq_col,
count_col = NULL,
## Search Criteria ##
min_seq_length = 3,
drop_matches = "[*|_]",
top_n_clusters = 20,
min_node_count = 10,
min_clone_count = 100,
## Optional Visualization ##
plots = FALSE,
print_plots = FALSE,
plot_title = "auto",
color_nodes_by = "cluster_id",
## Output ##
output_dir,
output_type = "rds",
## Optional Output ##
output_dir_unfiltered = NULL,
output_type_unfiltered = "rds",
verbose = FALSE,
...
)
Arguments
- file_list
A character vector of file paths, or a list containing
connections
and file paths. Each element corresponds to a single file containing the data for a single sample. Passed toloadDataFromFileList()
.- input_type
A character string specifying the file format of the sample data files. Options are
"table"
,"txt"
,"tsv"
,"csv"
,"rds"
and"rda"
. Passed toloadDataFromFileList()
.- data_symbols
Used when
input_type = "rda"
. Specifies the name of each sample's data frame within its respective Rdata file. Passed toloadDataFromFileList()
.- header
For values of
input_type
other than"rds"
and"rda"
, this argument can be used to specify a non-default value of theheader
argument toread.table()
,read.csv()
, etc.- sep
For values of
input_type
other than"rds"
and"rda"
, this argument can be used to specify a non-default value of thesep
argument toread.table()
,read.csv()
, etc.- read.args
For values of
input_type
other than"rds"
and"rda"
, this argument can be used to specify non-default values of optional arguments toread.table()
,read.csv()
, etc. Accepts a named list of argument values. Values ofheader
andsep
in this list take precedence over values specified via theheader
andsep
arguments.- sample_ids
A character or numeric vector of sample IDs, whose length matches that of
file_list
. The values should be valid for use as filenames and should avoid using the forward slash or backslash characters (/
or\
).- seq_col
Specifies the column of each sample's data frame containing the TCR/BCR sequences. Accepts a character string containing the column name or a numeric scalar containing the column index.
- count_col
Specifies the column of each sample's data frame containing the clone count (measure of clonal abundance). Accepts a character string containing the column name or a numeric scalar containing the column index. If
NULL
, the clusters in each sample's network will be selected solely based upon node count.- min_seq_length
Passed to
buildRepSeqNetwork()
when constructing the network for each sample.- drop_matches
Passed to
buildRepSeqNetwork()
when constructing the network for each sample. Accepts a character string containing a regular expression (seeregex
). Checks TCR/BCR sequences for a pattern match usinggrep()
. Those returning a match are dropped. By default, sequences containing any of the characters*
,|
or_
are dropped.- top_n_clusters
The number of clusters from each sample to be automatically be included among the filtered clusters, based on greatest node count.
- min_node_count
Clusters with at least this many nodes will be included among the filtered clusters.
- min_clone_count
Clusters with an aggregate clone count of at least this value will be included among the filtered clusters. A value of
NULL
ignores this criterion and does not select additional clusters based on clone count.- plots
Passed to
buildRepSeqNetwork()
when constructing the network for each sample.- print_plots
Passed to
buildRepSeqNetwork()
when constructing the network for each sample.- plot_title
Passed to
buildRepSeqNetwork()
when constructing the network for each sample.- color_nodes_by
Passed to
buildRepSeqNetwork()
when constructing the network for each sample.- output_dir
The file path of the directory for saving the output. The directory will be created if it does not already exist.
- output_type
A character string specifying the file format to use for saving the output. Valid options include
"csv"
,"rds"
and"rda"
.- output_dir_unfiltered
An optional directory for saving the unfiltered network data for each sample. By default, only the filtered results are saved.
- output_type_unfiltered
A character string specifying the file format to use for saving the unfiltered network data for each sample. Only applicable if
output_dir_unfiltered
is non-null. Passed tobuildRepSeqNetwork()
when constructing the network for each sample.- verbose
Logical. If
TRUE
, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent tostderr()
.- ...
Other arguments to
buildRepSeqNetwork
when constructing the network for each sample, not includingnode_stats
,stats_to_include
,cluster_stats
,cluster_id_name
oroutput_name
(see details).
Details
Each sample's network is constructed using an individual call to
buildNet()
with
node_stats = TRUE
, stats_to_include = "all"
,
cluster_stats = TRUE
and cluster_id_name = "ClusterIDInSample"
.
The node-level properties are renamed to reflect their
correspondence to the sample-level network. Specifically, the properties are named:
SampleLevelNetworkDegree
SampleLevelTransitivity
SampleLevelCloseness
SampleLevelCentralityByCloseness
SampleLevelCentralityByEigen
SampleLevelEigenCentrality
SampleLevelBetweenness
SampleLevelCentralityByBetweenness
SampleLevelAuthorityScore
SampleLevelCoreness
SampleLevelPageRank
A variable SampleID
is added to both the node-level and cluster-level meta data for each sample.
After the clusters in each sample are filtered, the node-level and cluster-level
metadata are saved in the respective subdirectories node_meta_data
and
cluster_meta_data
of the output directory specified by output_dir
.
The unfiltered network results for each sample can also be saved by supplying a
directory to output_dir_unfiltered
, if these results are desired for
downstream analysis. Each sample's unfiltered network results will then be saved
to its own subdirectory created within this directory.
The files containing the node-level metadata for the filtered clusters can be
supplied to buildPublicClusterNetwork()
in order to construct a global
network of public clusters. If the full global network is too large to practically
construct, the files containing the cluster-level meta data for the filtered
clusters can be supplied to
buildPublicClusterNetworkByRepresentative()
to build a global network using only a single representative sequence from each
cluster. This allows prominent public clusters to still be identified.
See the Searching for Public TCR/BCR Clusters article on the package website.
References
Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825
Author
Brian Neal (Brian.Neal@ucsf.edu)
Examples
set.seed(42)
## Simulate 30 samples with a mix of public/private sequences ##
samples <- 30
sample_size <- 30 # (seqs per sample)
base_seqs <- c(
"CASSIEGQLSTDTQYF", "CASSEEGQLSTDTQYF", "CASSSVETQYF",
"CASSPEGQLSTDTQYF", "RASSLAGNTEAFF", "CASSHRGTDTQYF", "CASDAGVFQPQHF",
"CASSLTSGYNEQFF", "CASSETGYNEQFF", "CASSLTGGNEQFF", "CASSYLTGYNEQFF",
"CASSLTGNEQFF", "CASSLNGYNEQFF", "CASSFPWDGYGYTF", "CASTLARQGGELFF",
"CASTLSRQGGELFF", "CSVELLPTGPLETSYNEQFF", "CSVELLPTGPSETSYNEQFF",
"CVELLPTGPSETSYNEQFF", "CASLAGGRTQETQYF", "CASRLAGGRTQETQYF",
"CASSLAGGRTETQYF", "CASSLAGGRTQETQYF", "CASSRLAGGRTQETQYF",
"CASQYGGGNQPQHF", "CASSLGGGNQPQHF", "CASSNGGGNQPQHF", "CASSYGGGGNQPQHF",
"CASSYGGGQPQHF", "CASSYKGGNQPQHF", "CASSYTGGGNQPQHF",
"CAWSSQETQYF", "CASSSPETQYF", "CASSGAYEQYF", "CSVDLGKGNNEQFF")
# Relative generation probabilities
pgen <- cbind(
stats::toeplitz(0.6^(0:(sample_size - 1))),
matrix(1, nrow = samples, ncol = length(base_seqs) - samples)
)
simulateToyData(
samples = samples,
sample_size = sample_size,
prefix_length = 1,
prefix_chars = c("", ""),
prefix_probs = cbind(rep(1, samples), rep(0, samples)),
affixes = base_seqs,
affix_probs = pgen,
num_edits = 0,
output_dir = tempdir(),
no_return = TRUE
)
#> [1] TRUE
sample_files <-
file.path(tempdir(),
paste0("Sample", 1:samples, ".rds")
)
findPublicClusters(
file_list = sample_files,
input_type = "rds",
seq_col = "CloneSeq",
count_col = "CloneCount",
min_seq_length = NULL,
drop_matches = NULL,
top_n_clusters = 3,
min_node_count = 5,
min_clone_count = 15000,
output_dir = tempdir()
)
# \dontshow{
# Clean up temporary files
file.remove(
file.path(tempdir(),
c(paste0("Sample", 1:samples, ".rds"))
)
)
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
unlink(
file.path(tempdir(), c("node_meta_data", "cluster_meta_data")),
recursive = TRUE
)
# }