[LeetCode] #28 Implement strStr()

2017-04-14

287
0
LeetCode
2017-05-16

#28 Implement strStr()

Question

Implement strStr().

Returns the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.

Thinking

假設今天我們要比對兩組字串，分別是主字串(haystack)、子字串(needle)

當中最簡單便是使用暴力字串比對法(Naive String-Matching Algorithm)，即主字串每個字元都作為子字串的起頭，再利用迴圈一一巡覽完為止，其概念如下圖

若主字串長度為n、子字串長度為m，則此方法的時間複雜度為O((n-m+1)*m)，實際Code則如下

public class Solution {
    public int StrStr(string haystack, string needle) {
        
        if (needle.Length == 0) return 0;
        if (haystack.Length < needle.Length) return -1;
        
        var currentHaystackIndex = 0;
        var currentNeedleIndex = 0;
        while(currentHaystackIndex < haystack.Length 
             && currentNeedleIndex < needle.Length)
        {
            if (haystack[currentHaystackIndex] == needle[currentNeedleIndex])
            {
                currentHaystackIndex++;
                currentNeedleIndex++;
            }
            else
            {
                currentHaystackIndex = 
                    currentHaystackIndex - currentNeedleIndex + 1;
                currentNeedleIndex = 0;
            }
        }
        return (currentNeedleIndex == needle.Length) 
               ? currentHaystackIndex - needle.Length 
               : -1;
    }
}

但許多人認為這樣的比對效率是非常低的，為什麼呢？

首先，子字串「ABCABE」中的needle[0]、needle[1]、needle[2]均不相等，因此在概念圖的步驟(1)比對過後，很明顯發現(2)(3)是可以省略的
再仔細看，needle[0]與needle[3]是相等的(needle[1]與needle[4]亦是)，在步驟(1)中，needle[3]已經與主字串的對應位置haystack[3]比對過且是相等的，那就代表與needle[3]相等的needle[0]下一回合也不需要再與haystack[3]比對了，因此步驟(4)第一組字元比對也是可以省略的

最後我們得到下圖

在暴力字串比對法中，haystackIndex是會不斷來回變動的，而從上圖簡化後發現這些變動其實是不必要的，因此科學家們把腦筋動到needleIndex上，這種只專注於needleIndex變動的方法便是Knuth-Morris-Pratt演算法(KMP Algorithm)

當發現主字串與子字串字元比對不相等時，在haystackIndex不變動的情況下，我們希望能將needleIndex退回可能的位置，而不是直接從頭(needle[0])開始比對。那該怎麼做呢？首先，我們需要建立一個陣列next，該陣列是用來儲存needleIndex的變化，其推導如下

needleIndex	0	1	2	3	4	5
needle	A	B	C	A	B	E
next[needleIndex]	-1	0	0	0	1	2

next[0]，沒有前後綴能比較，為例外情況，我們定義為-1
next[1]，前綴為「A」，但它沒有後綴，因此為0
next[2]、next[3]，同上皆沒有後綴，因此為0
next[4]，比較第一個字元與最後一個字元，可找到前綴「A」與後綴「A」，一個字元相等，因此為1
next[5]，可找到最大前綴「AB」與後綴「AB」，兩個字元相等，因此為2

可以發現，next陣列值的計算方式是找出相等的最大前綴與後綴詞，找出來的字元數代表著進行下一次比對時與needle[0]的初始距離，這距離其實就是needleIndex應退回到的位置(ex：距離為1時，代表從needle[1]開始)。若遇到-1的例外情況，那代表needleIndex已經退無可退，是時候將haystackIndex再往下推移了

若主字串長度為n、子字串長度為m，則此方法的時間複雜度為O(n+m)，實際Code如下

public class Solution {
    public int StrStr(string haystack, string needle) {
        
        if (needle.Length == 0) return 0;
        if (haystack.Length < needle.Length) return -1;
        
        var currentHaystackIndex = 0;
        var currentNeedleIndex = 0;
        var next = GetNext(needle);
        while(currentHaystackIndex < haystack.Length 
             && currentNeedleIndex < needle.Length)
        {
            if (currentNeedleIndex == -1 || 
               haystack[currentHaystackIndex] == needle[currentNeedleIndex])
            {
                currentHaystackIndex++;
                currentNeedleIndex++;
            }
            else
            {
                currentNeedleIndex = next[currentNeedleIndex];
            }
        }
        return (currentNeedleIndex == needle.Length) 
               ? currentHaystackIndex - needle.Length 
               : -1;
    }
    
    private int[] GetNext(string needle)
    {
        var index = 0;
        var indexValue = -1;
        var next = new int[needle.Length];
        next[index] = indexValue;
        while(index < needle.Length - 1)
        {
            if (indexValue == -1 || needle[index] == needle[indexValue])
            {
                index++;
                indexValue++;
                next[index] = indexValue;
            }
            else
            {
                indexValue = next[indexValue];
            }
        }
        return next;
    }
}

不過有人發現KMP演算法仍有改良的空間。如前圖步驟(1)中，我們得知haystack[4]與needle[4]不相等，也知道子字串「ABCABE」存在前後綴「AB」，依據這些條件，步驟(2)其實可以再更進一步簡化成下圖

我們需要建立一個新的next陣列，推導如下

needleIndex	0	1	2	3	4	5
needle	A	B	C	A	B	E
nextOld[needleIndex]	-1	0	0	0	1	2
nextNew[needleIndex]	-1	0	0	-1	0	2

先計算出nextOld
nextNew[0]，為例外情況，我們一樣定義為-1
nextNew[1]，因為nextOld[1]值為0，所以比較needle[0]的字元「A」與needle[1]的字元「B」，不相等則照舊，因此為0
nextNew[2]，因為nextOld[2]值為0，所以比較needle[0]的字元「A」與needle[2]的字元「C」，不相等則照舊，因此為0
nextNew[3]，因為nextOld[3]值為0，所以比較needle[0]的字元「A」與needle[3]的字元「A」，相等則被nextNew[0]的值取代，因此為-1
nextNew[4]，因為nextOld[4]值為1，所以比較needle[1]的字元「B」與needle[4]的字元「B」，相等則被nextNew[1]的值取代，因此為0
nextNew[5]，因為nextOld[5]值為2，所以比較needle[2]的字元「C」與needle[5]的字元「E」，不相等則照舊，因此為2

概念其實很簡單，若nextOld[a]的值b所指向的字元(needle[b])與needle[a]的字元相同，則nextNew[a]的值應該被nextNew[b]的值取代；反之nextNew[a]的值就是nextOld[a]的值。這樣的作法可以更精準定位到needleIndex應退回的位置，進而再減少運算次數，有了這概念，最後的解法如下

My C# Solution

public class Solution {
    public int StrStr(string haystack, string needle) {
        
        if (needle.Length == 0) return 0;
        if (haystack.Length < needle.Length) return -1;
        
        var currentHaystackIndex = 0;
        var currentNeedleIndex = 0;
        var next = GetNext(needle);
        while(currentHaystackIndex < haystack.Length 
             && currentNeedleIndex < needle.Length)
        {
            if (currentNeedleIndex == -1 ||
               haystack[currentHaystackIndex] == needle[currentNeedleIndex])
            {
                currentHaystackIndex++;
                currentNeedleIndex++;
            }
            else
            {
                currentNeedleIndex = next[currentNeedleIndex];
            }
        }
        return (currentNeedleIndex == needle.Length) 
               ? currentHaystackIndex - needle.Length 
               : -1;
    }
    
    private int[] GetNext(string needle)
    {
        var index = 0;
        var indexValue = -1;
        var next = new int[needle.Length];
        next[index] = indexValue;
        while(index < needle.Length - 1)
        {
            if (indexValue == -1 
               || needle[index] == needle[indexValue])
            {
                index++;
                indexValue++;
                if (needle[index] != needle[indexValue])
                {
                    next[index] = indexValue;
                }
                else
                {
                    next[index] = next[indexValue];                
                }
            }
            else
            {
                indexValue = next[indexValue];
            }
        }
        return next;
    }
}

Reference：《大話資料結構》Chapter 05 字串

回首頁

tsanmmmm2000's Tech Blog

A course of taking down my experience and life