Bounded Computation of Levenshtein Distance
levDistBounded.Rd
Computes the Levenshtein distance between two strings subject to a specified upper bound.
Details
The Levenshtein distance (sometimes referred to as edit distance) between two character strings measures the minimum number of single-character edits (insertions, deletions and transformations) needed to transform one string into the other.
Compared to the Hamming distance
(see hamDistBounded()
), the
Levenshtein distance is particularly useful for comparing sequences of different
lengths, as it can account for insertions and deletions, whereas the Hamming
distance only accounts for single-character transformations. However, the
computational burden for the Levenshtein distance can be significantly greater
than for the Hamming distance.
Computation is aborted early if the Levenshtein distance is determined to exceed the specified upper bound. This functionality is designed for cases when distinguishing between values above the upper bound is not meaningful, taking advantage of this fact to reduce the computational burden.
Value
An integer. If the Levenshtein distance exceeds the specified upper bound
k
, then a value of -1
is returned. Otherwise, returns the
Levenshtein distance between a
and b
.
Note
The computed value may be invalid when the length of either string is close to
or greater than the value of INT_MAX
in the compiler that was used at
build time (typically 2147483647).
References
Hai Yang, Jason Cham, Brian Neal, Zenghua Fan, Tao He and Li Zhang. (2023). NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology, vol. 14. doi: 10.3389/fimmu.2023.1181825
Author
Brian Neal (Brian.Neal@ucsf.edu)
Examples
# equal string lengths,
# character transmutations only
levDistBounded("foo", "bar", 3)
#> [1] 3
hamDistBounded("foo", "bar", 3) # agrees with Hamming distance
#> [1] 3
# one insertion, one deletion
levDistBounded("1234567", "1.23457", 7)
#> [1] 2
hamDistBounded("1234567", "1.23457", 7) # compare to Hamming distance
#> [1] 5
# same as above, but with a different lower bound
levDistBounded("1234567", "1.23457", 3) # within the bound
#> [1] 2
hamDistBounded("1234567", "1.23457", 3) # exceeds the bound
#> [1] -1
# one deletion (last position)
levDistBounded("1234567890", "123456789", 10)
#> [1] 1
hamDistBounded("1234567890", "123456789", 10)
#> [1] 1
# note the Hamming distance agrees with the Levenshtein distance
# for the above example, since the deletion occurs in the final
# character position. This is due to how hamDistBounded() handles
# strings of different lengths. In the example below, however...
# one deletion (first position)
levDistBounded("1234567890", "234567890", 10)
#> [1] 1
hamDistBounded("1234567890", "234567890", 10) # compare to Hamming distance
#> [1] 10
# one deletion, one transmutation
levDistBounded("foobar", "fubar", 6)
#> [1] 2
hamDistBounded("foobar", "fubar", 6) # compare to Hamming distance
#> [1] 5