Bounded Computation of Hamming Distance

Computes the Hamming distance between two strings subject to a specified upper bound.

Usage

hamDistBounded(a, b, k)

Arguments

a: A character string.
b: A character string to be compared to a.
k: The upper bound on the Hamming distance between a and b.

Details

For two character strings of equal length, the Hamming distance measures the total number of character differences between characters in corresponding positions. That is, for each position in one string, the character in that position is checked to see whether it differs from the character in the same position of the other string.

For two character strings of different lengths, the Hamming distance is not defined. However, hamDistBounded() will accommodate strings of different lengths, doing so in a conservative fashion that seeks to yield a meaningful result for the purpose of checking whether two strings are sufficiently similar. If the two strings differ in length, placeholder characters are appended to the shorter string until its length matches that of the longer string. Each appended placeholder character is treated as different from the character in the corresponding position of the longer string. This is effectively the same as truncating the end of the longer string and adding the number of deleted characters to the Hamming distance between the shorter string and the truncated longer string (which is what is actually done in practice, as the computation is faster).

The above method used by hamDistBounded() to accommodate unequal string lengths results in distance values whose meaning may be questionable, depending on context, when the two strings have different lengths. The decision to append placeholder characters to the end of the shorter string (as opposed to prepending them to the beginning) is ad hoc and somewhat arbitrary. In effect, it allows two strings of different lengths to be considered sufficiently similar if the content of the shorter string sufficiently matches the beginning content of the longer string and the difference in string length is not too great.

For comparing sequences of different lengths, the Levenshtein distance (see levDistBounded()) is more appropriate and meaningful than using hamDistBounded(), but comes at the cost of greater computational burden.

Computation is aborted early if the Hamming distance is determined to exceed the specified upper bound. This functionality is designed for cases when distinguishing between values above the upper bound is not meaningful, taking advantage of this fact to reduce the computational burden.

Value

An integer. If the Hamming distance exceeds the specified upper bound k, then a value of -1 is returned. Otherwise, returns the Hamming distance between a and b.

Note

The computed value may be invalid when the length of either string is close to or greater than the value of INT_MAX in the compiler that was used at build time (typically 2147483647).

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Author

Brian Neal (Brian.Neal@ucsf.edu)

Examples

# using an upper bound of 3
# (trivial since strings have length 3)
hamDistBounded("foo", "foo", 3)
#> [1] 0
hamDistBounded("foo", "fee", 3)
#> [1] 2
hamDistBounded("foo", "fie", 3)
#> [1] 2
hamDistBounded("foo", "foe", 3)
#> [1] 1
hamDistBounded("foo", "fum", 3)
#> [1] 2
hamDistBounded("foo", "bar", 3)
#> [1] 3

# using an upper bound of 1
# (most distances exceed the upper bound)
hamDistBounded("foo", "fee", 1)
#> [1] -1
hamDistBounded("foo", "fie", 1)
#> [1] -1
hamDistBounded("foo", "foe", 1)
#> [1] 1
hamDistBounded("foo", "fum", 1)
#> [1] -1
hamDistBounded("foo", "bar", 1)
#> [1] -1

# comparing strings of nonmatching length
hamDistBounded("foo", "fubar", 10)
#> [1] 4
hamDistBounded("foo", "foobar", 10)
#> [1] 3
hamDistBounded("foo", "barfoo", 10)
#> [1] 6