Compute Cluster-Level Network Properties

Given a list of network objects returned by buildRepSeqNetwork() or generateNetworkObjects(), computes cluster-level network properties, performing clustering first if needed. The list of network objects is returned with the cluster properties added as a data frame.

Usage

addClusterStats(
  net,
  cluster_id_name = "cluster_id",
  seq_col = NULL,
  count_col = NULL,
  degree_col = "degree",
  cluster_fun = "fast_greedy",
  overwrite = FALSE,
  verbose = FALSE,
  ...
)

Arguments

net: A list of network objects conforming to the output of buildRepSeqNetwork() or generateNetworkObjects(). See details.
cluster_id_name: A character string specifying the name of the cluster membership variable in net$node_data that identifies the cluster to which each node belongs. If the variable does not exist, it will be added by calling addClusterMembership(). If the variable does exist, its values will be used unless overwrite = TRUE, in which case its values will be overwritten and the new values used.
seq_col: Specifies the column(s) of net$node_data containing the receptor sequences upon whose similarity the network is based. Accepts a character or numeric vector of length 1 or 2, containing either column names or column indices. If provided, related cluster-level properties will be computed. The default NULL will use the value contained in net$details$seq_col if it exists and is valid.
count_col: Specifies the column of net$node_data containing a measure of abundance (such as clone count or UMI count). Accepts a character string containing the column name or a numeric scalar containing the column index. If provided, related cluster-level properties will be computed.
degree_col: Specifies the column of net$node_data containing the network degree of each node. Accepts a character string containing the column name. If the column does not exist, it will be added.
cluster_fun: A character string specifying the clustering algorithm to use when adding or overwriting the cluster membership variable in net$node_data specified by cluster_id_name. Passed to addClusterMembership().
overwrite: Logical. If TRUE and net already contains an element named cluster_data, it will be overwritten. Similarly, if overwrite = TRUE and net$node_data contains a variable whose name matches the value of cluster_id_name, then its values will be overwritten with new cluster membership values (obtained using addClusterMembership() with the specified value of cluster_fun), and cluster properties will be computed based on the new values.
verbose: Logical. If TRUE, generates messages about the tasks performed and their progress, as well as relevant properties of intermediate outputs. Messages are sent to stderr().
...: Named optional arguments to the function specified by cluster_fun.

Details

The list net must contain the named elements igraph (of class igraph), adjacency_matrix (a matrix or dgCMatrix encoding edge connections), and node_data (a data.frame containing node metadata), all corresponding to the same network. The lists returned by buildRepSeqNetwork() and generateNetworkObjects() are examples of valid inputs for the net argument.

If the network graph has previously been partitioned into clusters using addClusterMembership() and the user wishes to compute network properties for these clusters, the name of the cluster membership variable in net$node_data should be provided to the cluster_id_name argument.

If the value of cluster_id_name is not the name of a variable in net$node_data, then clustering is performed using addClusterMembership() with the specified value of cluster_fun, and the cluster membership values are written to net$node_data using the value of cluster_id_name as the variable name. If overwrite = TRUE, this is done even if this variable already exists.

Value

A modified copy of net, with cluster properties contained in the element cluster_data. This is a data.frame containing one row for each cluster in the network and the following variables:

cluster_id: The cluster ID number.
node_count: The number of nodes in the cluster.
mean_seq_length: The mean sequence length in the cluster. Only present when length(seq_col) == 1.
A_mean_seq_length: The mean first sequence length in the cluster. Only present when length(seq_col) == 2.
B_mean_seq_length: The mean second sequence length in the cluster. Only present when length(seq_col) == 2.
mean_degree: The mean network degree in the cluster.
max_degree: The maximum network degree in the cluster.
seq_w_max_degree: The receptor sequence possessing the maximum degree within the cluster. Only present when length(seq_col) == 1.
A_seq_w_max_degree: The first sequence of the node possessing the maximum degree within the cluster. Only present when length(seq_col) == 2.
B_seq_w_max_degree: The second sequence of the node possessing the maximum degree within the cluster. Only present when length(seq_col) == 2.
agg_count: The aggregate count among all nodes in the cluster (based on the counts in count_col).
max_count: The maximum count among all nodes in the cluster (based on the counts in count_col).
seq_w_max_count: The receptor sequence possessing the maximum count within the cluster. Only present when length(seq_col) == 1.
A_seq_w_max_count: The first sequence of the node possessing the maximum count within the cluster. Only present when length(seq_col) == 2.
B_seq_w_max_count: The second sequence of the node possessing the maximum count within the cluster. Only present when length(seq_col) == 2.
diameter_length: The longest geodesic distance in the cluster, computed as the length of the vector returned by get_diameter().
assortativity: The assortativity coefficient of the cluster's graph, based on the degree (minus one) of each node in the cluster (with the degree computed based only upon the nodes within the cluster). Computed using assortativity_degree().
global_transitivity: The transitivity (i.e., clustering coefficient) for the cluster's graph, which estimates the probability that adjacent vertices are connected. Computed using transitivity() with type = "global".
edge_density: The number of edges in the cluster as a fraction of the maximum possible number of edges. Computed using edge_density().
degree_centrality_index: The centrality index of the cluster's graph based on within-cluster network degree. Computed as the centralization element of the output from centr_degree().
closeness_centrality_index: The centrality index of the cluster's graph based on closeness, i.e., distance to other nodes in the cluster. Computed using centralization().
eigen_centrality_index: The centrality index of the cluster's graph based on the eigenvector centrality scores, i.e., values of the first eigenvector of the adjacency matrix for the cluster. Computed as the centralization element of the output from centr_eigen().
eigen_centrality_eigenvalue: The eigenvalue corresponding to the first eigenvector of the adjacency matrix for the cluster. Computed as the value element of the output from eigen_centrality().

If net$node_data did not previously contain a variable whose name matches the value of cluster_id_name, then this variable will be present and will contain values for cluster membership, obtained through a call to addClusterMembership() using the clustering algorithm specified by cluster_fun.

If net$node_data did previously contain a variable whose name matches the value of cluster_id_name and overwrite = TRUE, then the values of this variable will be overwritten with new values for cluster membership, obtained as above based on cluster_fun.

If net$node_data did not previously contain a variable whose name matches the value of degree_col, then this variable will be present and will contain values for network degree.

Additionally, if net contains a list named details, then the following elements will be added to net$details, or overwritten if they already exist:

cluster_data_goes_with: A character string containing the value of cluster_id_name. When net$node_data contains multiple cluster membership variables (e.g., from applying different clustering methods), cluster_data_goes_with allows the user to distinguish which of these variables corresponds to net$cluster_data.
count_col_for_cluster_data: A character string containing the value of count_col. If net$node_data contains multiple count variables, this allows the user to distinguish which of these variables corresponds to the count-related properties in net$cluster_data, such as max_count. If count_col = NULL, then the value will be NA.

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Author

Brian Neal (Brian.Neal@ucsf.edu)

Examples

set.seed(42)
toy_data <- simulateToyData()

net <- generateNetworkObjects(
  toy_data, "CloneSeq"
)

net <- addClusterStats(
  net,
  count_col = "CloneCount"
)

head(net$cluster_data)
#>   cluster_id node_count eigen_centrality_eigenvalue eigen_centrality_index
#> 1          1         14                    3.627940              0.6572455
#> 2          2         28                   11.831606              0.5524239
#> 3          3          9                    2.238772              0.6748055
#> 4          4          6                    2.278414              0.5237142
#> 5          5          6                    2.228328              0.5707806
#> 6          6         25                    5.885769              0.6291788
#>   closeness_centrality_index degree_centrality_index edge_density
#> 1                  0.5584465               0.3076923    0.2307692
#> 2                  0.4703335               0.3333333    0.2962963
#> 3                  0.2311674               0.1250000    0.2500000
#> 4                  0.4266234               0.2000000    0.4000000
#> 5                  0.3301948               0.2000000    0.4000000
#> 6                  0.4012791               0.1983333    0.1766667
#>   global_transitivity assortativity diameter_length max_degree mean_degree
#> 1           0.5454545   -0.13886606               5          9        3.36
#> 2           0.6084437   -0.05857037               6         18        8.43
#> 3           0.2727273   -0.68750000               7          4        2.22
#> 4           0.3750000   -0.50000000               4          9        3.33
#> 5           0.4285714   -0.09090909               5          3        2.17
#> 6           0.3435115   -0.14219251               6         10        4.60
#>   mean_seq_length seq_w_max_degree max_count agg_count seq_w_max_count
#> 1           13.00    AAAAAAAAATTGC      4618     52760   AGAAGAAAATTGC
#> 2           12.96    GGGGGGGAATTGG      6526    115851   GGGGGGGAATTGG
#> 3           12.67     AGAAGAAAATTC      4422     28477   GAAATAGAATTCG
#> 4           13.00    GGGGGGAAATTGG      5873     23120   AGGGGGAAATTGG
#> 5           12.00     AGGGAGGAATTC      5728     24291    AGGGGGGAATTC
#> 6           12.00     AAAAAAAAATTG      5393     89616    GAAAAAAAATTC
net$details
#> $seq_col
#> [1] "CloneSeq"
#> 
#> $dist_type
#> [1] "hamming"
#> 
#> $dist_cutoff
#> [1] 1
#> 
#> $drop_isolated_nodes
#> [1] TRUE
#> 
#> $nodes_in_network
#> [1] 122
#> 
#> $clusters_in_network
#> fast_greedy 
#>          20 
#> 
#> $cluster_id_variable
#>  fast_greedy 
#> "cluster_id" 
#> 
#> $cluster_data_goes_with
#> [1] "cluster_id"
#> 
#> $count_col_for_cluster_data
#> [1] "CloneCount"
#> 

# won't change net since net$cluster_data exists
net <- addClusterStats(
  net,
  count_col = "CloneCount",
  cluster_fun = "leiden",
  verbose = TRUE
)
#> Obtaining cluster properties...
#> ‘net$cluster_data’ already exists.
#> To overwrite, call ‘addClusterStats()’ with ‘overwrite = TRUE’

# overwrites values in net$cluster_data
# and cluster membership values in net$node_data$cluster_id
# with values obtained using "cluster_leiden" algorithm
net <- addClusterStats(
  net,
  count_col = "CloneCount",
  cluster_fun = "leiden",
  overwrite = TRUE
)

net$details
#> $seq_col
#> [1] "CloneSeq"
#> 
#> $dist_type
#> [1] "hamming"
#> 
#> $dist_cutoff
#> [1] 1
#> 
#> $drop_isolated_nodes
#> [1] TRUE
#> 
#> $nodes_in_network
#> [1] 122
#> 
#> $clusters_in_network
#> leiden 
#>     57 
#> 
#> $cluster_id_variable
#>       leiden 
#> "cluster_id" 
#> 
#> $cluster_data_goes_with
#> [1] "cluster_id"
#> 
#> $count_col_for_cluster_data
#> [1] "CloneCount"
#> 

# overwrites existing values in net$cluster_data
# with values obtained using "cluster_louvain" algorithm
# saves cluster membership values to net$node_data$cluster_id_louvain
# (net$node_data$cluster_id retains membership values from "cluster_leiden")
net <- addClusterStats(
  net,
  count_col = "CloneCount",
  cluster_fun = "louvain",
  cluster_id_name = "cluster_id_louvain",
  overwrite = TRUE
)

net$details
#> $seq_col
#> [1] "CloneSeq"
#> 
#> $dist_type
#> [1] "hamming"
#> 
#> $dist_cutoff
#> [1] 1
#> 
#> $drop_isolated_nodes
#> [1] TRUE
#> 
#> $nodes_in_network
#> [1] 122
#> 
#> $clusters_in_network
#>  leiden louvain 
#>      57      19 
#> 
#> $cluster_id_variable
#>               leiden              louvain 
#>         "cluster_id" "cluster_id_louvain" 
#> 
#> $cluster_data_goes_with
#> [1] "cluster_id_louvain"
#> 
#> $count_col_for_cluster_data
#> [1] "CloneCount"
#> 

# perform clustering using "cluster_fast_greedy" algorithm,
# save cluster membership values to net$node_data$cluster_id_greedy
net <- addClusterMembership(
  net,
  cluster_fun = "fast_greedy",
  cluster_id_name = "cluster_id_greedy"
)

# compute cluster properties for the clusters from previous step
# overwrites values in net$cluster_data
net <- addClusterStats(
  net,
  cluster_id_name = "cluster_id_greedy",
  overwrite = TRUE
)

net$details
#> $seq_col
#> [1] "CloneSeq"
#> 
#> $dist_type
#> [1] "hamming"
#> 
#> $dist_cutoff
#> [1] 1
#> 
#> $drop_isolated_nodes
#> [1] TRUE
#> 
#> $nodes_in_network
#> [1] 122
#> 
#> $clusters_in_network
#>      leiden     louvain fast_greedy 
#>          57          19          20 
#> 
#> $cluster_id_variable
#>               leiden              louvain          fast_greedy 
#>         "cluster_id" "cluster_id_louvain"  "cluster_id_greedy" 
#> 
#> $cluster_data_goes_with
#> [1] "cluster_id_greedy"
#> 
#> $count_col_for_cluster_data
#> [1] NA
#>