先将 WORD 文件全部(或部分)转换成文本文件,在对文本文件进行分词、建索引,然后使用索引进行检索,关键在于分词和建索引,文件转换不是主要问题。不要试图使用任何简单的文本匹配方法去做,你要做的是全文检索,就是按词匹配,而不是按字节匹配。分词是检索准确的关键,索引是检索速度的关键,当然还要有一个好的文件存储方案保证索引在创建、读取和更新时的效率。
they are probably hidden in some academic publications, but see Microsoft Reseacher Stephen Robertson's list of publications at http://research.microsoft.com/users/robertson/he developed some ranking algorithm for Microsofthere are some additional worthy read:An Algorithm for Full Text Indexing http://citeseer.nj.nec.com/529554.htmlCharming Python: Developing a full-text indexer in Python http://www-106.ibm.com/developerworks/xml/library/l-pyind.htmlSearch engine basics http://www-106.ibm.com/developerworks/library/searchengine.html
效率太底:(
http://research.microsoft.com/users/robertson/he developed some ranking algorithm for Microsofthere are some additional worthy read:An Algorithm for Full Text Indexing
http://citeseer.nj.nec.com/529554.htmlCharming Python: Developing a full-text indexer in Python
http://www-106.ibm.com/developerworks/xml/library/l-pyind.htmlSearch engine basics
http://www-106.ibm.com/developerworks/library/searchengine.html