算法擂台——查找在一本书中重复出现的一句话

后缀数组nlogn应该可以。排序和查找多核应该没问题～所以 n/p log n 应该就可以了～

我觉得lz应该让v_JULY_v做一下三道题，然后他就可以写出一个系列的文章教程了：）

题目看得云里雾里。
什么样的子串算一个句子
程序的目标只需要找到一对相同的就可以了吗
几十万单词平均一个单词算10个字节也只占几M的内存按句子的字符串全部进行HASH不就完了吗

胡乱写了一下，没算法，没优化，也不知道用多少内存        void foo()
        {
            FileStream fs = File.Open(@"C:\The Three Musketeers(三剑客).txt", FileMode.Open, FileAccess.Read);
            StreamReader sr = new StreamReader(fs);
            int ch, lastch = 0;
            string[] aryText;
            string orgText;
            List<Sentence> wordarray = new List<Sentence>();
            Dictionary<string, List<Sentence>> dict = new Dictionary<string, List<Sentence>>();
            Dictionary<string, List<Sentence>> newdict;
            StringBuilder sb = new StringBuilder();
            StringBuilder orgsb = new StringBuilder();
            int words = 0, chars = 0, startchars = 0;
            while ((ch = sr.Read()) != -1)
            {
                chars++;
                if ((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z'))
                {
                    sb.Append(((char)ch).ToString().ToLower());
                }
                else
                {
                    if ((lastch >= 'a' && lastch <= 'z') || (lastch >= 'A' && lastch <= 'Z'))
                    {
                        sb.Append(" ");
                        Sentence sentence;
                        sentence.start = words;
                        sentence.pos = startchars;
                        wordarray.Add(sentence);
                        words++;
                    }
                    startchars = chars;
                }
                orgsb.Append((char)ch);
                lastch = ch;
            }            fs.Close();
            orgText = orgsb.ToString();
            aryText = sb.ToString().Split(' ');
            for (int i = 0; i < aryText.Length - 1; i++)
            {
                string key = aryText[i] + " " + aryText[i + 1];
                if (dict.ContainsKey(key))
                    dict[key].Add(wordarray[i]);
                else
                    (dict[key] = new List<Sentence>()).Add(wordarray[i]);
            }
            int sentencelen = 1;
            while (true)
            {
                newdict = new Dictionary<string, List<Sentence>>();
                sentencelen++;
                foreach (var sentence in dict.Keys)
                {
                    if (dict[sentence].Count > 1)
                    {
                        foreach (Sentence s in dict[sentence])
                        {
                            if (s.start + sentencelen < aryText.Length)
                            {
                                string key = sentence + " " + aryText[s.start + sentencelen];
                                if (newdict.ContainsKey(key))
                                    newdict[key].Add(s);
                                else
                                    (newdict[key] = new List<Sentence>()).Add(s);
                            }
                        }
                    }
                }
                if (newdict.Count == 0)
                    break;
                dict = newdict;
            }
            Console.WriteLine("最长重复句子是： \"{0}\".", dict.Keys.ToList()[0]);
            dict.Values.ToList().ForEach(x => { Console.WriteLine("原文位于({1})：{0}...", orgText.Substring(x[0].pos, 100), x[0].pos); });        }
        struct Sentence
        {
            public int start;
            public int pos;
        }
输出：最长重复句子是： "read dec it is by more order and for the good of the state that the bearer of this has done what he has done richelieu and".
原文位于(912706)：read:Dec. 3, 1627
It is by more order and for the good of the state that the
bearer of this ha...
原文位于(939717)：read:Dec. 3, 1627
It is by more order and for the good of the state that the
bearer of this ha...

设文本包含n个单词，m个字母
以单词为单位，把文章所有的后缀放入一个数组(存放指针就好)，
对这个数组排序，每每相邻的记录最长前缀空间复杂度O(n), 时间复杂度n*m*log(n) (字符串比较是一个个字母比的！不是nlogn！)

这个可以用类似字典树，不过节点是词的编号，因为比较的是句子序列。
10万个句子大概200兆内存，如果只存词的引用指针的话。
参考http://www.cnblogs.com/dullwolf/archive/2011/04/15/2017479.html速度肯定是快的，而且可以做Like查找。

后缀数组nlogn应该可以。排序和查找多核应该没问题～所以 n/p log n 应该就可以了～

大概我说的太简单，就比如 how.are! you的话，直接处理为how are you，这样的话其实也不用太复杂的方法，统一处理之后排序就行了，如果还需要匹配are!you.ok中的are you，那么就需要把how are you,are you,you都加进去，are you ok，you ok,ok也都加进去。

这是一个求最长重复子串的问题！
想了几天，终于有点思路！下面说说我的想法。
假设已知道一个符串的最长重复子串长度为N，再读入一个字符后，该字符串的最长重复子串长度只有两种情况N或N+1；而且只N+1的情况只可能是倒数N+1个字符构成的子串。
如：abcab   最大重复子串长度为2，再读入一字符c,字符串变为abcabc,最大重复子串长度为3，子串为abc;
该结论的证明我还没完善好，改天给大家贴出来；
下面是算法的C语言实现：为了简化编程，我们暂且假设字符串已读到内存中：#include <stdio.h>int main(void)
{
     char array[]="abcabc";     int length=0;/*记录当前已求得的最长重复子串长度*/
     char *temp;/*用于指向读入字符后可能构成最长重复子串长度为length+1的子串的首字母*/     char *p,*q;/*p指向刚读入的字符*/     int flag=1;/*如果读入字符后可能构成最长重复子串长度为length+1的子串，flag=1*/     int i;
     p=array;
     while(*p!='\0')
     {
           temp=p-length;           for(q=array;q<temp;q++)/*从头开始比较长度为length+1的所有子串*/
           {
                flag=1;
                for(i=0;i<=length;i++)
                {
                     if(*(q+i)!=*(temp+i))
                     {
                          flag=0;
                          break;
                     }
                }
                if(flag)
                {
                      length++;
                      q=array;/*重新比较，也可以直接退出循环，主要是为了保险*/
                }
           }
           p++;

     }
      printf("%d\n",length);/*只打印出最长重复子串的长度*/
      return 0;
}补充：1。中间的字符串比较用一个函数会更好；
     2。可以记录下最长重复子串的位置，输出子串
看到楼上说用后缀数组，可能更好，正在学习中

调试易

算法擂台——查找在一本书中重复出现的一句话

解决方案 »