局域网内的网络爬虫实现

本人正在做一个局域网内搜索引擎，请问局域网内的爬虫怎么做呢？用java实现，看了个heritrix，好像这个是web爬虫。

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

heritrix是个现成的爬行器，拿来下不就可啦
修改下那个robott.xml的说明俱可
从局域网的起始页开始下
http://www.xxx.xxx//index.hmtl
public class Spider implements Runnable {
        private ArrayList urls; //URL列表
    private HashMap indexedURLs; //已经检索过的URL列表
    private int threads ; //初始化线程数
    public static void main(String argv[]) throws Exception {
         if(argv[0] == null){
           System.out.println("Missing required argument: [Sit URL]";
           return ;
         }
                Spider Spider = new Spider(argv[0]);
                Spider.go();
    }
    public Spider(String strURL) {
           urls    = new ArrayList();
        threads = 10;
        urls.add(strURL);
        threadList = new ArrayList();
        indexedURLs = new HashMap();

        if (urls.size() == 0)
            throw new IllegalArgumentException("Missing required argument: -u [start url]";
        if (threads < 1)
            throw new IllegalArgumentException("Invalid number of threads: " +
                threads);
    }
    public void go(String strURL) throws Exception {
        // index each entry point URL
        long start = System.currentTimeMillis();
        for (int i = 0; i < threads; i++) {
            Thread t = new Thread(this, "Spide " + (i+1));
            t.start();
            threadList.add(t);
        }
        while (threadList.size() >; 0) {
            Thread child = (Thread)threadList.remove(0);
            child.join();
        }
        long elapsed = System.currentTimeMillis() - start;
    }
    public void run() {
        String url;
        try {
            while ((url = dequeueURL()) != null) {
                indexURL(url);
            }
        }catch(Exception e) {
                logger.info(e.getMessage());
        }
    }
    //检测URL列表容器中有没有URL没有被解析,如果有则返回URL由线程继续执行文章出处：http://www.diybl.com/course/webjsh/osgl/200798/71185.html
1楼robott.xml怎么改呢，我才刚开始呢，什么都还不会着了，还望在开发的过程中多指教，谢谢！
我是想用lucene做一个局域网的搜索引擎，应该怎么下手呢，现在只知道应该分几部分，如搜索器，索引器，检索器，用户接口等，可是具体每一部分该怎么做呢？要求能检索到局域网中的网页呀，文本文件，word,excel,powerpoint,pdf等文件！
局域网网络爬虫的url是采用什么协议？