Bounded Computation of Levenshtein Distance

Computes the Levenshtein distance between two strings subject to a specified upper bound.

Usage

levDistBounded(a, b, k)

Arguments

a: A character string.
b: A character string to be compared to a.
k: An integer specifying the upper bound on the Levenshtein distance between a and b.

Details

The Levenshtein distance (sometimes referred to as edit distance) between two character strings measures the minimum number of single-character edits (insertions, deletions and transformations) needed to transform one string into the other.

Compared to the Hamming distance (see hamDistBounded()), the Levenshtein distance is particularly useful for comparing sequences of different lengths, as it can account for insertions and deletions, whereas the Hamming distance only accounts for single-character transformations. However, the computational burden for the Levenshtein distance can be significantly greater than for the Hamming distance.

Computation is aborted early if the Levenshtein distance is determined to exceed the specified upper bound. This functionality is designed for cases when distinguishing between values above the upper bound is not meaningful, taking advantage of this fact to reduce the computational burden.

Value

An integer. If the Levenshtein distance exceeds the specified upper bound k, then a value of -1 is returned. Otherwise, returns the Levenshtein distance between a and b.

Note

The computed value may be invalid when the length of either string is close to or greater than the value of INT_MAX in the compiler that was used at build time (typically 2147483647).

References

Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825

Webpage for the NAIR package

Author

Brian Neal (Brian.Neal@ucsf.edu)

Examples

# equal string lengths,
# character transmutations only
levDistBounded("foo", "bar", 3)
#> [1] 3
hamDistBounded("foo", "bar", 3) # agrees with Hamming distance
#> [1] 3

# one insertion, one deletion
levDistBounded("1234567", "1.23457", 7)
#> [1] 2
hamDistBounded("1234567", "1.23457", 7) # compare to Hamming distance
#> [1] 5

# same as above, but with a different lower bound
levDistBounded("1234567", "1.23457", 3) # within the bound
#> [1] 2
hamDistBounded("1234567", "1.23457", 3) # exceeds the bound
#> [1] -1

# one deletion (last position)
levDistBounded("1234567890", "123456789", 10)
#> [1] 1
hamDistBounded("1234567890", "123456789", 10)
#> [1] 1

# note the Hamming distance agrees with the Levenshtein distance
# for the above example, since the deletion occurs in the final
# character position. This is due to how hamDistBounded() handles
# strings of different lengths. In the example below, however...

# one deletion (first position)
levDistBounded("1234567890", "234567890", 10)
#> [1] 1
hamDistBounded("1234567890", "234567890", 10) # compare to Hamming distance
#> [1] 10

# one deletion, one transmutation
levDistBounded("foobar", "fubar", 6)
#> [1] 2
hamDistBounded("foobar", "fubar", 6) # compare to Hamming distance
#> [1] 5