KMP
KMP is similar to BM.
It uses the concepts of bad character and good suffix.
It pays more attention on pattern. It studies pattern to decide how to slide when mismatch is met.
KMP pre-process the pattern. It stocks the result of the "study" in an array, called PTM (Partial Match Table).
Algo
When we meet bad character, we take all the pattern before bad character out as arr.
We need to find the longest match between arr's prefix and suffix, which is the "study" metioned previously.
For exemple,
main: a b a b a b a b c a
|
pattern: a b a b a b c a
The mismatch is found, let's slide the pattern.
main: a b a b a b a b c a
|
pattern: > > a b a b a b c a
We slide 2 characters and find the match. The question is how we know we should slide 2 characters.
PTM (Partial Match Table)
The array stored the "study" result is call PTM (Partial Match Table). PTM[i] means the length of the matched string.
pattern : a b a b a b c a
PTM : 0 0 1 2 3 4 0 1
Explanation:
-
PTM[0] = 0No prefix to match
-
PTM[1] = 0In the string
ab, the lastbdoesn't matche the firsta. So the matched string length is0. -
PTM[2] = 1In the string
aba, the lastamatches the firsta. So the matched string length is1. -
PTM[3] = 2In the string
abab, the lastabmatches the lastab. the matched string length is2. -
... (same to the
PTM[3] = 2) -
PTM[7] = 0In the string
abababc, the lastcdoesn't matche the firsta. So the matched string length is0. -
PTM[8] = 1In the string
abababca, the lastamatches the firsta, andcadoesn't matches the firstab. So the matched string length is1.
Back to the question, how we know we should slide 2 characters. Since c == pattern[7], we should check PTM[0:7]. PTM[6] tells us the length of the matched string is 4, so we need to slide (7-1) - 4 = 2 characters. 7-1 means the length of the pattern before c (ababab).
In conclusion, if pattern[i] is the bad character, we need to check PTM[i-1] the length of the matched string. Finally, the number of moving character is i - PTM[i-1] + 1.
Now, we move one space back in PTM and create the new array, next. It makes the generation of the PTM and usage of PTM simpler. It's tricky.
pattern : a b a b a b c a
PTM : 0 0 1 2 3 4 0 1
next : -1 0 0 1 2 3 4 0 1
Usage of PTM
KMP is the usage of PTM.
Suppose PTM is created, we use next array in the codes.
def kmp(main: str, pattern: str) -> int:
n = len(main)
m = len(pattern)
next_arr = generate_next_from_PTM(pattern)
# use fast-slow pointers
i = 0
j = 0
while i < n and j < m:
if j == -1 or main[i] == pattern[j]:
# check the next character in the main string
i += 1
j += 1
else:
# Thanks to moving one space back in PTM
# If pattern[j] is the bad character, (main[i] != pattern[j])
# We check next_arr[j] instead of PTM[j-1], which are the same.
j = next_arr[j]
# This line also implies the slide of pattern.
# Because the pointer i is not moved, the slide of pattern is i - j
Generation of PTM
def generate_next_from_PTM(pattern: str) -> List[int]:
m = len(pattern)
next_arr = [-1 for _ in range(m+1)]
i = 0
j = -1
while i < m:
if j == -1 or pattern[i] == pattern[j]:
i += 1
j += 1
next_arr[i] = j
else:
# key point
j = next_arr[j]
return next_arr