初次使用lucene2.0的心得和疑问

这几天一直在看有关lucene的资料，在网上查了下，发现好多文章都是重复的，也有将的不错的文章，但大多例子都是基于lucene1.4.3的。本人刚接触lucene，在此一点小小的心得和疑问。请看下面：
lucene的基本特性可以参考：
<a href="http://www.yyhweb.com/Article.htm?cId=2&fId=3&aId=28">初识lucene</a>
<a href="http://www.yyhweb.com/Article.htm?cId=2&fId=3&aId=46">Lucene基本使用介绍</a>
网上大家对中文分词插件IK_CAnalyzer的评价不错，目前IK_CAnalyzer的最新版本是1.4，是在lucene2.0API的基础上开发的。本人下面的例子就是结合lucene2.0和IK_CAnalyzer使用的。介绍：
例子是对lucene的简单使用，对“文章”的三个基本属性Id，title，content的索引和搜索。
通过ArticleBiz.getForumArt("10");取得一个文章列表。
从此list里取得各文章属性建立索引。public class IndexAndSearch{
   public static void main(String[] args) throws IOException, ParseException{  //RAMDirectory directory = new RAMDirectory(); // 将索引保存到内存中   String directory = "C:/index";             // 将索引保存到硬盘中
   if(!new File(directory).exists())
      new File(directory).mkdirs();   // IK_CAnalyzer的分词器
   MIK_CAnalyzer mkAnalyzer = new MIK_CAnalyzer();
   try{
           // 取得文章列表，这部分省略了其与数据库的具体操作。
           List alist = ArticleBiz.getForumArt("10");           // 生成一个IndexWriter，其作用是把每个Document 对象加到索引中来。
           IndexWriter writer = new IndexWriter(directory,mkAnalyzer,true);
           long s = System.currentTimeMillis();// 计算开始时间           // 定义Document用来储存索引记录
            Document doc = null;         // 循环从文章列表里对每一篇文章进行索引并将其结果加到以上的writer 里
           for(int i=0;i<alist.size();i++){
            doc = new Document();
            Article curArt = (Article) alist.get(i);
            System.out.println("index of :"+i+" the id is:"+curArt.getId()+" the title :"+curArt.getTitle());           // 对文章Id，title，content索引，id：存储不分词。title：存储且分词。content：不存储但分词。           doc.add(new Field("id",String.valueOf(curArt.getId()),Field.Store.YES,Field.Index.UN_TOKENIZED));
           doc.add(new Field("title", curArt.getTitle(),Field.Store.YES,Field.Index.TOKENIZED));
           doc.add(new Field("content",curArt.getContent(),Field.Store.NO,Field.Index.TOKENIZED));
           writer.addDocument(doc);      // 加入到writer
           }
           writer.optimize();   // 优化，关闭
            writer.close();
           System.out.println("the process in:"+(System.currentTimeMillis()-s)+" ms");
      } catch (IOException e) {
         System.out.println(e);
      }    //查询tilte里含有"视频" 或content里含有"投票"的文章信息
   Query query1 = new TermQuery(new Term("title", "视频"));
   Query query2 = new TermQuery(new Term("content", "投票"));
   BooleanQuery query = new BooleanQuery();
   query.add(query1,null);
   query.add(query2,null);   // 如果只查询一个字段：不需要BooleanQuery来整合两个Query。
   // 如：只查询tilte里含有"视频"
   // Query query = new TermQuery(new Term("title", "视频"));   // 生成一个IndexSearcher搜索
   IndexSearcher indexsearch = new IndexSearcher(directory);   // 通过indexsearch获得query结果。
   Hits hits2 = indexsearch.search(query);
   System.out.println("begin search:length:"+hits2.length());
   Document doc2 = null;
   int id = 0;
   String title = "";   // 循环输去搜索结果。
   for(int i=0;i<hits2.length();i++){
     doc2 = hits2.doc(i);
     id = Integer.valueOf(doc2.get("id"));
     title = doc2.get("title");
     System.out.println("Result is:"+id+" and the title is:"+title);
   }
   System.out.println("end search");
}
}网上很多例子用的是lucene1.4.3，新版本的lucene在doc.add(new Field("content",curArt.getContent(),Field.Store.NO,Field.Index.TOKENIZED));这些地方与旧版本有很大的区别。
Field有两个属性可选：存储和索引。通过存储属性你可以控制是否对这个Field进行存储；通过索引属性你可以控制是否对该Field进行索引。这看起来似乎有些废话，事实上对这两个属性的正确组合很重要。
Field.Index       Field.Store       说明
TOKENIZED(分词)   YES               被分词索引且存储
TOKENIZED         NO                被分词索引但不存储
NO                YES               这是不能被搜索的，它只是被搜索内容的附属物。如URL等
UN_TOKENIZED      YES/NO           不被分词，它作为一个整体被搜索,搜一部分是搜不出来的
NO                 NO               没有这种用法如果要对某Field进行查找，那么一定要把Field.Index设置为TOKENIZED或UN_TOKENIZED。TOKENIZED会对Field的内容进行分词；而UN_TOKENIZED不会，只有全词匹配，该Field才会被选中。
如果Field.Store是No，那么就无法在搜索结果中从索引数据直接提取该域的值，会使null。以上仅为个人拙见，有谬误的地方，希望大家指正。
同时本人有以下疑问请教，希望得到大家的指点。
1、对于大量数据的来说建立索引的过程肯定是很好费时间的，在论坛搜索中肯定不能够按照上面的例子来做，否则每一个用户搜索都要执行该过程系统就有些吃不消了。
如果将IndexWriter writer = new IndexWriter(directory,mkAnalyzer,true);第三个参数设为false那么新的文章内容又如何加到原来的索引里呢？
2、索引出来的结果太多的话需要分页，如果一下子全部检索出来的话肯定不科学，但是他又没有类似数据库里的limit限制，如何将大量的查询结果分页呢？
3、删除索引，随着数据的增多，系统索引也增大，在必要的时候可以调用自定义方法删除索引，是否在生成IndexWriter的时候将第三个参数设为false。下次执行的时候就会调用删除索引的方法删除以前的所有索引？希望高人指点，谢谢！来源：http://www.yyhweb.com/Article.htm?cId=2&fId=3&aId=50

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

2、索引出来的结果太多的话需要分页，如果一下子全部检索出来的话肯定不科学，但是他又没有类似数据库里的limit限制，如何将大量的查询结果分页呢？
--------------------------------
分页在这控制// 循环输去搜索结果。for(int i=开始行数;i<结束行数&&i<hits2.length();i++){
doc2 = hits2.doc(i);
id = Integer.valueOf(doc2.get("id"));
title = doc2.get("title");
System.out.println("Result is:"+id+" and the title is:"+title);
}基于lucene的搜索: http://www.ruansou.com/
回复：iwlk
如果每天晚上建一次索引，那就不能够实现内容的及时搜索了？
关键字：lucene.net 搜索排序内存猛涨内存溢出 IndexSearcher TopDocs weight/** *//** Creates a searcher searching the index in the named directory. */
public IndexSearcher(String path) throws IOException ...{
    this(IndexReader.open(path), true);
  }  /** *//** Creates a searcher searching the index in the provided directory. */
  public IndexSearcher(Directory directory) throws IOException ...{
    this(IndexReader.open(directory), true);
  }  /** *//** Creates a searcher searching the provided index. */
  public IndexSearcher(IndexReader r) ...{
    this(r, false);
  }

  private IndexSearcher(IndexReader r, boolean closeReader) ...{
    reader = r;
    this.closeReader = closeReader;
  }在lucene应用中也许很多人都遇到这种情况。当索引太大（大于10G），搜索时用前两种构造方法声明IndexSearcher对象，这样每构造一个IndexSearcher对象，都要声明一个索引对象（实际上是一个索引的多次连接），而每个索引对象都要占用一定量的系统资源（主要是内存）。当大量用户访问系统时，就会看到系统内存直线增长，致使产生“java heap space”内存耗尽或内存溢出（.net）。这个问题可以通过以下方法解决:终极解决方法：
联系方式： [email protected] , [email protected]