Compute Cluster-Level Network Properties
addClusterStats.Rd
Given a list of network objects returned by
buildRepSeqNetwork()
or
generateNetworkObjects()
,
computes cluster-level network properties,
performing clustering first if needed.
The list of network objects is returned
with the cluster properties added as a data frame.
Usage
addClusterStats(
net,
cluster_id_name = "cluster_id",
seq_col = NULL,
count_col = NULL,
degree_col = "degree",
cluster_fun = "fast_greedy",
overwrite = FALSE,
verbose = FALSE,
...
)
Arguments
- net
A
list
of network objects conforming to the output ofbuildRepSeqNetwork()
orgenerateNetworkObjects()
. See details.- cluster_id_name
A character string specifying the name of the cluster membership variable in
net$node_data
that identifies the cluster to which each node belongs. If the variable does not exist, it will be added by callingaddClusterMembership()
. If the variable does exist, its values will be used unlessoverwrite = TRUE
, in which case its values will be overwritten and the new values used.- seq_col
Specifies the column(s) of
net$node_data
containing the receptor sequences upon whose similarity the network is based. Accepts a character or numeric vector of length 1 or 2, containing either column names or column indices. If provided, related cluster-level properties will be computed. The defaultNULL
will use the value contained innet$details$seq_col
if it exists and is valid.- count_col
Specifies the column of
net$node_data
containing a measure of abundance (such as clone count or UMI count). Accepts a character string containing the column name or a numeric scalar containing the column index. If provided, related cluster-level properties will be computed.- degree_col
Specifies the column of
net$node_data
containing the network degree of each node. Accepts a character string containing the column name. If the column does not exist, it will be added.- cluster_fun
A character string specifying the clustering algorithm to use when adding or overwriting the cluster membership variable in
net$node_data
specified bycluster_id_name
. Passed toaddClusterMembership()
.- overwrite
Logical. If
TRUE
andnet
already contains an element namedcluster_data
, it will be overwritten. Similarly, ifoverwrite = TRUE
andnet$node_data
contains a variable whose name matches the value ofcluster_id_name
, then its values will be overwritten with new cluster membership values (obtained usingaddClusterMembership()
with the specified value ofcluster_fun
), and cluster properties will be computed based on the new values.- verbose
Logical. If
TRUE
, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent tostderr()
.- ...
Named optional arguments to the function specified by
cluster_fun
.
Details
The list net
must contain the named elements
igraph
(of class igraph
),
adjacency_matrix
(a matrix
or
dgCMatrix
encoding edge connections),
and node_data
(a data.frame
containing node metadata),
all corresponding to the same network. The lists returned by
buildRepSeqNetwork()
and
generateNetworkObjects()
are examples of valid inputs for the net
argument.
If the network graph has previously been partitioned into clusters using
addClusterMembership()
and the user
wishes to compute network properties for these clusters, the name of the
cluster membership variable in net$node_data
should be provided to
the cluster_id_name
argument.
If the value of cluster_id_name
is not the name of a variable
in net$node_data
, then clustering is performed using
addClusterMembership()
with the specified value of cluster_fun
,
and the cluster membership values are written to net$node_data
using
the value of cluster_id_name
as the variable name.
If overwrite = TRUE
, this is done even if this variable already exists.
Value
A modified copy of net
, with cluster properties contained in the element
cluster_data
. This is a data.frame
containing
one row for each cluster in the network and the following variables:
- cluster_id
The cluster ID number.
- node_count
The number of nodes in the cluster.
- mean_seq_length
The mean sequence length in the cluster. Only present when
length(seq_col) == 1
.- A_mean_seq_length
The mean first sequence length in the cluster. Only present when
length(seq_col) == 2
.- B_mean_seq_length
The mean second sequence length in the cluster. Only present when
length(seq_col) == 2
.- mean_degree
The mean network degree in the cluster.
- max_degree
The maximum network degree in the cluster.
- seq_w_max_degree
The receptor sequence possessing the maximum degree within the cluster. Only present when
length(seq_col) == 1
.- A_seq_w_max_degree
The first sequence of the node possessing the maximum degree within the cluster. Only present when
length(seq_col) == 2
.- B_seq_w_max_degree
The second sequence of the node possessing the maximum degree within the cluster. Only present when
length(seq_col) == 2
.- agg_count
The aggregate count among all nodes in the cluster (based on the counts in
count_col
).- max_count
The maximum count among all nodes in the cluster (based on the counts in
count_col
).- seq_w_max_count
The receptor sequence possessing the maximum count within the cluster. Only present when
length(seq_col) == 1
.- A_seq_w_max_count
The first sequence of the node possessing the maximum count within the cluster. Only present when
length(seq_col) == 2
.- B_seq_w_max_count
The second sequence of the node possessing the maximum count within the cluster. Only present when
length(seq_col) == 2
.- diameter_length
The longest geodesic distance in the cluster, computed as the length of the vector returned by
get_diameter()
.- assortativity
The assortativity coefficient of the cluster's graph, based on the degree (minus one) of each node in the cluster (with the degree computed based only upon the nodes within the cluster). Computed using
assortativity_degree()
.- global_transitivity
The transitivity (i.e., clustering coefficient) for the cluster's graph, which estimates the probability that adjacent vertices are connected. Computed using
transitivity()
withtype = "global"
.- edge_density
The number of edges in the cluster as a fraction of the maximum possible number of edges. Computed using
edge_density()
.- degree_centrality_index
The centrality index of the cluster's graph based on within-cluster network degree. Computed as the
centralization
element of the output fromcentr_degree()
.- closeness_centrality_index
The centrality index of the cluster's graph based on closeness, i.e., distance to other nodes in the cluster. Computed using
centralization()
.- eigen_centrality_index
The centrality index of the cluster's graph based on the eigenvector centrality scores, i.e., values of the first eigenvector of the adjacency matrix for the cluster. Computed as the
centralization
element of the output fromcentr_eigen()
.- eigen_centrality_eigenvalue
The eigenvalue corresponding to the first eigenvector of the adjacency matrix for the cluster. Computed as the
value
element of the output fromeigen_centrality()
.
If net$node_data
did not previously contain a variable whose name matches
the value of cluster_id_name
, then this variable will be present
and will contain values for cluster membership, obtained through a call to
using the clustering algorithm specified by cluster_fun
.
If net$node_data
did previously contain a variable whose name matches
the value of cluster_id_name
and overwrite = TRUE
, then the
values of this variable will be overwritten with new values for cluster membership,
obtained as above based on cluster_fun
.
If net$node_data
did not previously contain a variable whose name matches
the value of degree_col
, then this variable will be present
and will contain values for network degree.
Additionally, if net
contains a list named details
, then the
following elements will be added to net$details
, or overwritten if they
already exist:
cluster_data_goes_with
A character string containing the value of
cluster_id_name
. Whennet$node_data
contains multiple cluster membership variables (e.g., from applying different clustering methods),cluster_data_goes_with
allows the user to distinguish which of these variables corresponds tonet$cluster_data
.count_col_for_cluster_data
A character string containing the value of
count_col
. Ifnet$node_data
contains multiple count variables, this allows the user to distinguish which of these variables corresponds to the count-related properties innet$cluster_data
, such asmax_count
. Ifcount_col = NULL
, then the value will beNA
.
References
Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825
Author
Brian Neal (Brian.Neal@ucsf.edu)
Examples
set.seed(42)
toy_data <- simulateToyData()
net <- generateNetworkObjects(
toy_data, "CloneSeq"
)
net <- addClusterStats(
net,
count_col = "CloneCount"
)
head(net$cluster_data)
#> cluster_id node_count eigen_centrality_eigenvalue eigen_centrality_index
#> 1 1 14 3.627940 0.6572455
#> 2 2 28 11.831606 0.5524239
#> 3 3 9 2.238772 0.6748055
#> 4 4 6 2.278414 0.5237142
#> 5 5 6 2.228328 0.5707806
#> 6 6 25 5.885769 0.6291788
#> closeness_centrality_index degree_centrality_index edge_density
#> 1 0.5584465 0.3076923 0.2307692
#> 2 0.4703335 0.3333333 0.2962963
#> 3 0.2311674 0.1250000 0.2500000
#> 4 0.4266234 0.2000000 0.4000000
#> 5 0.3301948 0.2000000 0.4000000
#> 6 0.4012791 0.1983333 0.1766667
#> global_transitivity assortativity diameter_length max_degree mean_degree
#> 1 0.5454545 -0.13886606 5 9 3.36
#> 2 0.6084437 -0.05857037 6 18 8.43
#> 3 0.2727273 -0.68750000 7 4 2.22
#> 4 0.3750000 -0.50000000 4 9 3.33
#> 5 0.4285714 -0.09090909 5 3 2.17
#> 6 0.3435115 -0.14219251 6 10 4.60
#> mean_seq_length seq_w_max_degree max_count agg_count seq_w_max_count
#> 1 13.00 AAAAAAAAATTGC 4618 52760 AGAAGAAAATTGC
#> 2 12.96 GGGGGGGAATTGG 6526 115851 GGGGGGGAATTGG
#> 3 12.67 AGAAGAAAATTC 4422 28477 GAAATAGAATTCG
#> 4 13.00 GGGGGGAAATTGG 5873 23120 AGGGGGAAATTGG
#> 5 12.00 AGGGAGGAATTC 5728 24291 AGGGGGGAATTC
#> 6 12.00 AAAAAAAAATTG 5393 89616 GAAAAAAAATTC
net$details
#> $seq_col
#> [1] "CloneSeq"
#>
#> $dist_type
#> [1] "hamming"
#>
#> $dist_cutoff
#> [1] 1
#>
#> $drop_isolated_nodes
#> [1] TRUE
#>
#> $nodes_in_network
#> [1] 122
#>
#> $clusters_in_network
#> fast_greedy
#> 20
#>
#> $cluster_id_variable
#> fast_greedy
#> "cluster_id"
#>
#> $cluster_data_goes_with
#> [1] "cluster_id"
#>
#> $count_col_for_cluster_data
#> [1] "CloneCount"
#>
# won't change net since net$cluster_data exists
net <- addClusterStats(
net,
count_col = "CloneCount",
cluster_fun = "leiden",
verbose = TRUE
)
#> Obtaining cluster properties...
#> ‘net$cluster_data’ already exists.
#> To overwrite, call ‘addClusterStats()’ with ‘overwrite = TRUE’
# overwrites values in net$cluster_data
# and cluster membership values in net$node_data$cluster_id
# with values obtained using "cluster_leiden" algorithm
net <- addClusterStats(
net,
count_col = "CloneCount",
cluster_fun = "leiden",
overwrite = TRUE
)
net$details
#> $seq_col
#> [1] "CloneSeq"
#>
#> $dist_type
#> [1] "hamming"
#>
#> $dist_cutoff
#> [1] 1
#>
#> $drop_isolated_nodes
#> [1] TRUE
#>
#> $nodes_in_network
#> [1] 122
#>
#> $clusters_in_network
#> leiden
#> 57
#>
#> $cluster_id_variable
#> leiden
#> "cluster_id"
#>
#> $cluster_data_goes_with
#> [1] "cluster_id"
#>
#> $count_col_for_cluster_data
#> [1] "CloneCount"
#>
# overwrites existing values in net$cluster_data
# with values obtained using "cluster_louvain" algorithm
# saves cluster membership values to net$node_data$cluster_id_louvain
# (net$node_data$cluster_id retains membership values from "cluster_leiden")
net <- addClusterStats(
net,
count_col = "CloneCount",
cluster_fun = "louvain",
cluster_id_name = "cluster_id_louvain",
overwrite = TRUE
)
net$details
#> $seq_col
#> [1] "CloneSeq"
#>
#> $dist_type
#> [1] "hamming"
#>
#> $dist_cutoff
#> [1] 1
#>
#> $drop_isolated_nodes
#> [1] TRUE
#>
#> $nodes_in_network
#> [1] 122
#>
#> $clusters_in_network
#> leiden louvain
#> 57 19
#>
#> $cluster_id_variable
#> leiden louvain
#> "cluster_id" "cluster_id_louvain"
#>
#> $cluster_data_goes_with
#> [1] "cluster_id_louvain"
#>
#> $count_col_for_cluster_data
#> [1] "CloneCount"
#>
# perform clustering using "cluster_fast_greedy" algorithm,
# save cluster membership values to net$node_data$cluster_id_greedy
net <- addClusterMembership(
net,
cluster_fun = "fast_greedy",
cluster_id_name = "cluster_id_greedy"
)
# compute cluster properties for the clusters from previous step
# overwrites values in net$cluster_data
net <- addClusterStats(
net,
cluster_id_name = "cluster_id_greedy",
overwrite = TRUE
)
net$details
#> $seq_col
#> [1] "CloneSeq"
#>
#> $dist_type
#> [1] "hamming"
#>
#> $dist_cutoff
#> [1] 1
#>
#> $drop_isolated_nodes
#> [1] TRUE
#>
#> $nodes_in_network
#> [1] 122
#>
#> $clusters_in_network
#> leiden louvain fast_greedy
#> 57 19 20
#>
#> $cluster_id_variable
#> leiden louvain fast_greedy
#> "cluster_id" "cluster_id_louvain" "cluster_id_greedy"
#>
#> $cluster_data_goes_with
#> [1] "cluster_id_greedy"
#>
#> $count_col_for_cluster_data
#> [1] NA
#>