| This document explains Crochemore and Perrin's Two-Way string matching |
| algorithm, in which a smaller string (the "pattern" or "needle") |
| is searched for in a longer string (the "text" or "haystack"), |
| determining whether the needle is a substring of the haystack, and if |
| so, at what index(es). It is to be used by Python's string |
| (and bytes-like) objects when calling `find`, `index`, `__contains__`, |
| or implicitly in methods like `replace` or `partition`. |
| |
| This is essentially a re-telling of the paper |
| |
| Crochemore M., Perrin D., 1991, Two-way string-matching, |
| Journal of the ACM 38(3):651-675. |
| |
| focused more on understanding and examples than on rigor. See also |
| the code sample here: |
| |
| http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260 |
| |
| The algorithm runs in O(len(needle) + len(haystack)) time and with |
| O(1) space. However, since there is a larger preprocessing cost than |
| simpler algorithms, this Two-Way algorithm is to be used only when the |
| needle and haystack lengths meet certain thresholds. |
| |
| |
| These are the basic steps of the algorithm: |
| |
| * "Very carefully" cut the needle in two. |
| * For each alignment attempted: |
| 1. match the right part |
| * On failure, jump by the amount matched + 1 |
| 2. then match the left part. |
| * On failure jump by max(len(left), len(right)) + 1 |
| * If the needle is periodic, don't re-do comparisons; maintain |
| a "memory" of how many characters you already know match. |
| |
| |
| -------- Matching the right part -------- |
| |
| We first scan the right part of the needle to check if it matches the |
| the aligned characters in the haystack. We scan left-to-right, |
| and if a mismatch occurs, we jump ahead by the amount matched plus 1. |
| |
| Example: |
| |
| text: ........EFGX................... |
| pattern: ....abcdEFGH.... |
| cut: <<<<>>>> |
| |
| Matched 3, so jump ahead by 4: |
| |
| text: ........EFGX................... |
| pattern: ....abcdEFGH.... |
| cut: <<<<>>>> |
| |
| Why are we allowed to do this? Because we cut the needle very |
| carefully, in such a way that if the cut is ...abcd + EFGH... then |
| we have |
| |
| d != E |
| cd != EF |
| bcd != EFG |
| abcd != EFGH |
| ... and so on. |
| |
| If this is true for every pair of equal-length substrings around the |
| cut, then the following alignments do not work, so we can skip them: |
| |
| text: ........EFG.................... |
| pattern: ....abcdEFGH.... |
| ^ (Bad because d != E) |
| text: ........EFG.................... |
| pattern: ....abcdEFGH.... |
| ^^ (Bad because cd != EF) |
| text: ........EFG.................... |
| pattern: ....abcdEFGH.... |
| ^^^ (Bad because bcd != EFG) |
| |
| Skip 3 alignments => increment alignment by 4. |
| |
| |
| -------- If len(left_part) < len(right_part) -------- |
| |
| Above is the core idea, and it begins to suggest how the algorithm can |
| be linear-time. There is one bit of subtlety involving what to do |
| around the end of the needle: if the left half is shorter than the |
| right, then we could run into something like this: |
| |
| text: .....EFG...... |
| pattern: cdEFGH |
| |
| The same argument holds that we can skip ahead by 4, so long as |
| |
| d != E |
| cd != EF |
| ?cd != EFG |
| ??cd != EFGH |
| etc. |
| |
| The question marks represent "wildcards" that always match; they're |
| outside the limits of the needle, so there's no way for them to |
| invalidate a match. To ensure that the inequalities above are always |
| true, we need them to be true for all possible '?' values. We thus |
| need cd != FG and cd != GH, etc. |
| |
| |
| -------- Matching the left part -------- |
| |
| Once we have ensured the right part matches, we scan the left part |
| (order doesn't matter, but traditionally right-to-left), and if we |
| find a mismatch, we jump ahead by |
| max(len(left_part), len(right_part)) + 1. That we can jump by |
| at least len(right_part) + 1 we have already seen: |
| |
| text: .....EFG..... |
| pattern: abcdEFG |
| Matched 3, so jump by 4, |
| using the fact that d != E, cd != EF, and bcd != EFG. |
| |
| But we can also jump by at least len(left_part) + 1: |
| |
| text: ....cdEF..... |
| pattern: abcdEF |
| Jump by len('abcd') + 1 = 5. |
| |
| Skip the alignments: |
| text: ....cdEF..... |
| pattern: abcdEF |
| text: ....cdEF..... |
| pattern: abcdEF |
| text: ....cdEF..... |
| pattern: abcdEF |
| text: ....cdEF..... |
| pattern: abcdEF |
| |
| This requires the following facts: |
| d != E |
| cd != EF |
| bcd != EF? |
| abcd != EF?? |
| etc., for all values of ?s, as above. |
| |
| If we have both sets of inequalities, then we can indeed jump by |
| max(len(left_part), len(right_part)) + 1. Under the assumption of such |
| a nice splitting of the needle, we now have enough to prove linear |
| time for the search: consider the forward-progress/comparisons ratio |
| at each alignment position. If a mismatch occurs in the right part, |
| the ratio is 1 position forward per comparison. On the other hand, |
| if a mismatch occurs in the left half, we advance by more than |
| len(needle)//2 positions for at most len(needle) comparisons, |
| so this ratio is more than 1/2. This average "movement speed" is |
| bounded below by the constant "1 position per 2 comparisons", so we |
| have linear time. |
| |
| |
| -------- The periodic case -------- |
| |
| The sets of inequalities listed so far seem too good to be true in |
| the general case. Indeed, they fail when a needle is periodic: |
| there's no way to split 'AAbAAbAAbA' in two such that |
| |
| (the stuff n characters to the left of the split) |
| cannot equal |
| (the stuff n characters to the right of the split) |
| for all n. |
| |
| This is because no matter how you cut it, you'll get |
| s[cut-3:cut] == s[cut:cut+3]. So what do we do? We still cut the |
| needle in two so that n can be as big as possible. If we were to |
| split it as |
| |
| AAbA + AbAAbA |
| |
| then A == A at the split, so this is bad (we failed at length 1), but |
| if we split it as |
| |
| AA + bAAbAAbA |
| |
| we at least have A != b and AA != bA, and we fail at length 3 |
| since ?AA == bAA. We already knew that a cut to make length-3 |
| mismatch was impossible due to the period, but we now see that the |
| bound is sharp; we can get length-1 and length-2 to mismatch. |
| |
| This is exactly the content of the *critical factorization theorem*: |
| that no matter the period of the original needle, you can cut it in |
| such a way that (with the appropriate question marks), |
| needle[cut-k:cut] mismatches needle[cut:cut+k] for all k < the period. |
| |
| Even "non-periodic" strings are periodic with a period equal to |
| their length, so for such needles, the CFT already guarantees that |
| the algorithm described so far will work, since we can cut the needle |
| so that the length-k chunks on either side of the cut mismatch for all |
| k < len(needle). Looking closer at the algorithm, we only actually |
| require that k go up to max(len(left_part), len(right_part)). |
| So long as the period exceeds that, we're good. |
| |
| The more general shorter-period case is a bit harder. The essentials |
| are the same, except we use the periodicity to our advantage by |
| "remembering" periods that we've already compared. In our running |
| example, say we're computing |
| |
| "AAbAAbAAbA" in "bbbAbbAAbAAbAAbbbAAbAAbAAbAA". |
| |
| We cut as AA + bAAbAAbA, and then the algorithm runs as follows: |
| |
| First alignment: |
| bbbAbbAAbAAbAAbbbAAbAAbAAbAA |
| AAbAAbAAbA |
| ^^X |
| - Mismatch at third position, so jump by 3. |
| - This requires that A!=b and AA != bA. |
| |
| Second alignment: |
| bbbAbbAAbAAbAAbbbAAbAAbAAbAA |
| AAbAAbAAbA |
| ^^^^^^^^ |
| X |
| - Matched entire right part |
| - Mismatch at left part. |
| - Jump forward a period, remembering the existing comparisons |
| |
| Third alignment: |
| bbbAbbAAbAAbAAbbbAAbAAbAAbAA |
| AAbAAbAAbA |
| mmmmmmm^^X |
| - There's "memory": a bunch of characters were already matched. |
| - Two more characters match beyond that. |
| - The 8th character of the right part mismatched, so jump by 8 |
| - The above rule is more complicated than usual: we don't have |
| the right inequalities for lengths 1 through 7, but we do have |
| shifted copies of the length-1 and length-2 inequalities, |
| along with knowledge of the mismatch. We can skip all of these |
| alignments at once: |
| |
| bbbAbbAAbAAbAAbbbAAbAAbAAbAA |
| AAbAAbAAbA |
| ~ A != b at the cut |
| bbbAbbAAbAAbAAbbbAAbAAbAAbAA |
| AAbAAbAAbA |
| ~~ AA != bA at the cut |
| bbbAbbAAbAAbAAbbbAAbAAbAAbAA |
| AAbAAbAAbA |
| ^^^^X 7-3=4 match, and the 5th misses. |
| bbbAbbAAbAAbAAbbbAAbAAbAAbAA |
| AAbAAbAAbA |
| ~ A != b at the cut |
| bbbAbbAAbAAbAAbbbAAbAAbAAbAA |
| AAbAAbAAbA |
| ~~ AA != bA at the cut |
| bbbAbbAAbAAbAAbbbAAbAAbAAbAA |
| AAbAAbAAbA |
| ^X 7-3-3=1 match and the 2nd misses. |
| bbbAbbAAbAAbAAbbbAAbAAbAAbAA |
| AAbAAbAAbA |
| ~ A != b at the cut |
| |
| Fourth alignment: |
| bbbAbbAAbAAbAAbbbAAbAAbAAbAA |
| AAbAAbAAbA |
| ^X |
| - Second character mismatches, so jump by 2. |
| |
| Fifth alignment: |
| bbbAbbAAbAAbAAbbbAAbAAbAAbAA |
| AAbAAbAAbA |
| ^^^^^^^^ |
| X |
| - Right half matches, so use memory and skip ahead by period=3 |
| |
| Sixth alignment: |
| bbbAbbAAbAAbAAbbbAAbAAbAAbAA |
| AAbAAbAAbA |
| mmmmmmmm^^ |
| - Right part matches, left part is remembered, found a match! |
| |
| The one tricky skip by 8 here generalizes: if we have a period of p, |
| then the CFT says we can ensure the cut has the inequality property |
| for lengths 1 through p-1, and jumping by p would line up the |
| matching characters and mismatched character one period earlier. |
| Inductively, this proves that we can skip by the number of characters |
| matched in the right half, plus 1, just as in the original algorithm. |
| |
| To make it explicit, the memory is set whenever the entire right part |
| is matched and is then used as a starting point in the next alignment. |
| In such a case, the alignment jumps forward one period, and the right |
| half matches all except possibly the last period. Additionally, |
| if we cut so that the left part has a length strictly less than the |
| period (we always can!), then we can know that the left part already |
| matches. The memory is reset to 0 whenever there is a mismatch in the |
| right part. |
| |
| To prove linearity for the periodic case, note that if a right-part |
| character mismatches, then we advance forward 1 unit per comparison. |
| On the other hand, if the entire right part matches, then the skipping |
| forward by one period "defers" some of the comparisons to the next |
| alignment, where they will then be spent at the usual rate of |
| one comparison per step forward. Even if left-half comparisons |
| are always "wasted", they constitute less than half of all |
| comparisons, so the average rate is certainly at least 1 move forward |
| per 2 comparisons. |
| |
| |
| -------- When to choose the periodic algorithm --------- |
| |
| The periodic algorithm is always valid but has an overhead of one |
| more "memory" register and some memory computation steps, so the |
| here-described-first non-periodic/long-period algorithm -- skipping by |
| max(len(left_part), len(right_part)) + 1 rather than the period -- |
| should be preferred when possible. |
| |
| Interestingly, the long-period algorithm does not require an exact |
| computation of the period; it works even with some long-period, but |
| undeniably "periodic" needles: |
| |
| Cut: AbcdefAbc == Abcde + fAbc |
| |
| This cut gives these inequalities: |
| |
| e != f |
| de != fA |
| cde != fAb |
| bcde != fAbc |
| Abcde != fAbc? |
| The first failure is a period long, per the CFT: |
| ?Abcde == fAbc?? |
| |
| A sufficient condition for using the long-period algorithm is having |
| the period of the needle be greater than |
| max(len(left_part), len(right_part)). This way, after choosing a good |
| split, we get all of the max(len(left_part), len(right_part)) |
| inequalities around the cut that were required in the long-period |
| version of the algorithm. |
| |
| With all of this in mind, here's how we choose: |
| |
| (1) Choose a "critical factorization" of the needle -- a cut |
| where we have period minus 1 inequalities in a row. |
| More specifically, choose a cut so that the left_part |
| is less than one period long. |
| (2) Determine the period P_r of the right_part. |
| (3) Check if the left part is just an extension of the pattern of |
| the right part, so that the whole needle has period P_r. |
| Explicitly, check if |
| needle[0:cut] == needle[0+P_r:cut+P_r] |
| If so, we use the periodic algorithm. If not equal, we use the |
| long-period algorithm. |
| |
| Note that if equality holds in (3), then the period of the whole |
| string is P_r. On the other hand, suppose equality does not hold. |
| The period of the needle is then strictly greater than P_r. Here's |
| a general fact: |
| |
| If p is a substring of s and p has period r, then the period |
| of s is either equal to r or greater than len(p). |
| |
| We know that needle_period != P_r, |
| and therefore needle_period > len(right_part). |
| Additionally, we'll choose the cut (see below) |
| so that len(left_part) < needle_period. |
| |
| Thus, in the case where equality does not hold, we have that |
| needle_period >= max(len(left_part), len(right_part)) + 1, |
| so the long-period algorithm works, but otherwise, we know the period |
| of the needle. |
| |
| Note that this decision process doesn't always require an exact |
| computation of the period -- we can get away with only computing P_r! |
| |
| |
| -------- Computing the cut -------- |
| |
| Our remaining tasks are now to compute a cut of the needle with as |
| many inequalities as possible, ensuring that cut < needle_period. |
| Meanwhile, we must also compute the period P_r of the right_part. |
| |
| The computation is relatively simple, essentially doing this: |
| |
| suffix1 = max(needle[i:] for i in range(len(needle))) |
| suffix2 = ... # the same as above, but invert the alphabet |
| cut1 = len(needle) - len(suffix1) |
| cut2 = len(needle) - len(suffix2) |
| cut = max(cut1, cut2) # the later cut |
| |
| For cut2, "invert the alphabet" is different than saying min(...), |
| since in lexicographic order, we still put "py" < "python", even |
| if the alphabet is inverted. Computing these, along with the method |
| of computing the period of the right half, is easiest to read directly |
| from the source code in fastsearch.h, in which these are computed |
| in linear time. |
| |
| Crochemore & Perrin's Theorem 3.1 give that "cut" above is a |
| critical factorization less than the period, but a very brief sketch |
| of their proof goes something like this (this is far from complete): |
| |
| * If this cut splits the needle as some |
| needle == (a + w) + (w + b), meaning there's a bad equality |
| w == w, it's impossible for w + b to be bigger than both |
| b and w + w + b, so this can't happen. We thus have all of |
| the inequalities with no question marks. |
| * By maximality, the right part is not a substring of the left |
| part. Thus, we have all of the inequalities involving no |
| left-side question marks. |
| * If you have all of the inequalities without right-side question |
| marks, we have a critical factorization. |
| * If one such inequality fails, then there's a smaller period, |
| but the factorization is nonetheless critical. Here's where |
| you need the redundancy coming from computing both cuts and |
| choosing the later one. |
| |
| |
| -------- Some more Bells and Whistles -------- |
| |
| Beyond Crochemore & Perrin's original algorithm, we can use a couple |
| more tricks for speed in fastsearch.h: |
| |
| 1. Even though C&P has a best-case O(n/m) time, this doesn't occur |
| very often, so we add a Boyer-Moore bad character table to |
| achieve sublinear time in more cases. |
| |
| 2. The prework of computing the cut/period is expensive per |
| needle character, so we shouldn't do it if it won't pay off. |
| For this reason, if the needle and haystack are long enough, |
| only automatically start with two-way if the needle's length |
| is a small percentage of the length of the haystack. |
| |
| 3. In cases where the needle and haystack are large but the needle |
| makes up a significant percentage of the length of the |
| haystack, don't pay the expensive two-way preprocessing cost |
| if you don't need to. Instead, keep track of how many |
| character comparisons are equal, and if that exceeds |
| O(len(needle)), then pay that cost, since the simpler algorithm |
| isn't doing very well. |