Bounded Computation of Hamming Distance
hamDistBounded.Rd
Computes the Hamming distance between two strings subject to a specified upper bound.
Details
For two character strings of equal length, the Hamming distance measures the total number of character differences between characters in corresponding positions. That is, for each position in one string, the character in that position is checked to see whether it differs from the character in the same position of the other string.
For two character strings of different lengths, the Hamming distance is not defined.
However, hamDistBounded()
will accommodate strings of different
lengths, doing so in a conservative fashion that seeks to yield a meaningful result
for the purpose of checking whether two strings are sufficiently similar. If the two
strings differ in length, placeholder characters are appended to the shorter string
until its length matches that of the longer string. Each appended placeholder
character is treated as different from the character in the corresponding position
of the longer string. This is effectively the same as truncating the end of the
longer string and adding the number of deleted characters to the Hamming distance
between the shorter string and the truncated longer string (which is what is
actually done in practice, as the computation is faster).
The above method used by hamDistBounded()
to accommodate unequal string
lengths results in distance values whose meaning may be questionable, depending
on context, when the two strings have different lengths. The decision to append
placeholder characters to the end of the shorter string (as opposed to prepending
them to the beginning) is ad hoc and somewhat arbitrary. In effect, it allows two
strings of different lengths to be considered sufficiently similar if the content
of the shorter string sufficiently matches the beginning content of the longer
string and the difference in string length is not too great.
For comparing sequences of different lengths, the Levenshtein distance (see
levDistBounded()
)
is more appropriate and meaningful than using
hamDistBounded()
, but comes at the cost of greater computational burden.
Computation is aborted early if the Hamming distance is determined to exceed the specified upper bound. This functionality is designed for cases when distinguishing between values above the upper bound is not meaningful, taking advantage of this fact to reduce the computational burden.
Value
An integer. If the Hamming distance exceeds the specified upper bound k
,
then a value of -1
is returned. Otherwise, returns the Hamming distance
between a
and b
.
Note
The computed value may be invalid when the length of either string is close to
or greater than the value of INT_MAX
in the compiler that was used at
build time (typically 2147483647).
References
Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825
Author
Brian Neal (Brian.Neal@ucsf.edu)
Examples
# using an upper bound of 3
# (trivial since strings have length 3)
hamDistBounded("foo", "foo", 3)
#> [1] 0
hamDistBounded("foo", "fee", 3)
#> [1] 2
hamDistBounded("foo", "fie", 3)
#> [1] 2
hamDistBounded("foo", "foe", 3)
#> [1] 1
hamDistBounded("foo", "fum", 3)
#> [1] 2
hamDistBounded("foo", "bar", 3)
#> [1] 3
# using an upper bound of 1
# (most distances exceed the upper bound)
hamDistBounded("foo", "fee", 1)
#> [1] -1
hamDistBounded("foo", "fie", 1)
#> [1] -1
hamDistBounded("foo", "foe", 1)
#> [1] 1
hamDistBounded("foo", "fum", 1)
#> [1] -1
hamDistBounded("foo", "bar", 1)
#> [1] -1
# comparing strings of nonmatching length
hamDistBounded("foo", "fubar", 10)
#> [1] 4
hamDistBounded("foo", "foobar", 10)
#> [1] 3
hamDistBounded("foo", "barfoo", 10)
#> [1] 6