我看了Lucene 2.0+Heritrix,在看第十章myeclipse中配置heritrix时,出现了两个问题。问题一: 我在网上搜索了好多资料,最后终于配置成功了,可是在设置抓取时的处理链时发现没办法选择抓取选项,举个例子:在书上p312上说Writer选择镜像(Mirror),可是我发现该页面只能是默认设置Arc,没法选择,我在网上搜索了好久,发现关于在myeclipse中hibernate配置的信息虽然很多,但很杂乱,也不详细,而且很多问题有人提出来了,但是却没有人解答,像本问题,也有人提出,但是答案几乎没有。问题二: 我在设置抓取时的处理链时myeclipse给出一些题是错误信息,但是依然能够抓取网页,我不太明白。关于我配置的步骤方法以及项目源程序和错误提示信息请看以下内容。说明:项目绝对路径:D:\share\JavaProjects\hibernate下:即D:\share\JavaProjects\hibernate\Heritrix
"Heritrix在Eclipse下的配置"是以1.14.1版的Heritrix配的,我先配的1.14.1后来出现了问题,我又删除了1.14.1版本,我配了1.10.1还是一样的问题。1.myeclipse中heritrix的配置步骤
1、在Eclipse下新建一个项目。(项目名:Heritrix);
2、将1.14.1版的Heritrix两个.zip文件下载并解压到临时目录(heritrix-1.14.1-src.zip和heritrix-1.14.1.zip);
3、从第一个zip文件解压目录下\src\java复制三个文件夹(包):org、com、st到项目中的src上;
4、从第一个zip文件解压目录下\src\conf复制所有文件夹及文件到项目根目录中(可能有不需要的文件);
5、从第一个zip文件解压目录下\src\resourses复制所有文件夹及文件到项目根目录中(可能有不需要的文件);
6、从第二个zip文件解压目录下复制webapps目录到项目根目录中;
7、把项目下Heritrix.properties文件复制到\src目录下(实践中发现);
8、在项目上鼠标右键->构建路径->添加外部归档 找到任意一个解压的目录下的lib文件夹,选中文件夹中的所有包;
9、从第一个zip文件解压目录下\src\conf\下的profiles文件夹复制到Eclipse工作空间所在分区的根目录;
10、配置Heritrix.properties,主要配置四项:如下红色部分所示
##############################################################################
# HERITRIX PROPERTIES
############################################################################## # Properties with prefixes 'heritrix.', 'org.archive.', or 'system.' prefix
# get copied into System.properties on startup so available via
# System.getProperties. (For 'system.' properties, that prefix is stripped.
# (See Heritrix.loadProperties()). # Version is filled in by the maven.xml pregoal. It copies here the project
# currentVersion property.
heritrix.version = 1.14.1 # Location of the heritrix jobs directory.
heritrix.jobsdir = jobs # Default commandline startup values.
# Below values are used if unspecified on the command line.
heritrix.cmdline.admin = admin:admin
heritrix.cmdline.port = 8088
heritrix.cmdline.run = false
heritrix.cmdline.nowui = false
heritrix.cmdline.order =
heritrix.cmdline.jmxserver = false
heritrix.cmdline.jmxserver.port = 8081 ##############################################################################
# LOGGING
############################################################################## 11、在Eclipse下启动Heritrix,找到项目src下的org.archive.crawler包下的主类Heritrix.java鼠标右键->运行方式->java项目,即可启动Heritrix!
启动后,Eclipse状态栏信息如下:
08:32:15.468 EVENT Starting Jetty/4.2.23
08:32:15.734 WARN!! Delete existing temp dir C:\DOCUME~1\ycf\LOCALS~1\Temp\Jetty_127_0_0_1_8088__ for WebApplicationContext[/,jar:file:/E:/projects/eclipse_workspace/Heritrix1.14.1/webapps/admin.war!/]
08:32:16.171 EVENT Started WebApplicationContext[/,Heritrix Console]
08:32:16.609 EVENT Started SocketListener on 127.0.0.1:8088
08:32:16.609 EVENT Started org.mortbay.jetty.Server@137c60d
Heritrix version: 1.14.1 上面的包的复制都在直接在Eclipse里面的项目上直接进行的,这样可以减少修改Eclipse的项目配置文件。
"Heritrix在Eclipse下的配置"是以1.14.1版的Heritrix配的,我先配的1.14.1后来出现了问题,我又删除了1.14.1版本,我配了1.10.1还是一样的问题。1.myeclipse中heritrix的配置步骤
1、在Eclipse下新建一个项目。(项目名:Heritrix);
2、将1.14.1版的Heritrix两个.zip文件下载并解压到临时目录(heritrix-1.14.1-src.zip和heritrix-1.14.1.zip);
3、从第一个zip文件解压目录下\src\java复制三个文件夹(包):org、com、st到项目中的src上;
4、从第一个zip文件解压目录下\src\conf复制所有文件夹及文件到项目根目录中(可能有不需要的文件);
5、从第一个zip文件解压目录下\src\resourses复制所有文件夹及文件到项目根目录中(可能有不需要的文件);
6、从第二个zip文件解压目录下复制webapps目录到项目根目录中;
7、把项目下Heritrix.properties文件复制到\src目录下(实践中发现);
8、在项目上鼠标右键->构建路径->添加外部归档 找到任意一个解压的目录下的lib文件夹,选中文件夹中的所有包;
9、从第一个zip文件解压目录下\src\conf\下的profiles文件夹复制到Eclipse工作空间所在分区的根目录;
10、配置Heritrix.properties,主要配置四项:如下红色部分所示
##############################################################################
# HERITRIX PROPERTIES
############################################################################## # Properties with prefixes 'heritrix.', 'org.archive.', or 'system.' prefix
# get copied into System.properties on startup so available via
# System.getProperties. (For 'system.' properties, that prefix is stripped.
# (See Heritrix.loadProperties()). # Version is filled in by the maven.xml pregoal. It copies here the project
# currentVersion property.
heritrix.version = 1.14.1 # Location of the heritrix jobs directory.
heritrix.jobsdir = jobs # Default commandline startup values.
# Below values are used if unspecified on the command line.
heritrix.cmdline.admin = admin:admin
heritrix.cmdline.port = 8088
heritrix.cmdline.run = false
heritrix.cmdline.nowui = false
heritrix.cmdline.order =
heritrix.cmdline.jmxserver = false
heritrix.cmdline.jmxserver.port = 8081 ##############################################################################
# LOGGING
############################################################################## 11、在Eclipse下启动Heritrix,找到项目src下的org.archive.crawler包下的主类Heritrix.java鼠标右键->运行方式->java项目,即可启动Heritrix!
启动后,Eclipse状态栏信息如下:
08:32:15.468 EVENT Starting Jetty/4.2.23
08:32:15.734 WARN!! Delete existing temp dir C:\DOCUME~1\ycf\LOCALS~1\Temp\Jetty_127_0_0_1_8088__ for WebApplicationContext[/,jar:file:/E:/projects/eclipse_workspace/Heritrix1.14.1/webapps/admin.war!/]
08:32:16.171 EVENT Started WebApplicationContext[/,Heritrix Console]
08:32:16.609 EVENT Started SocketListener on 127.0.0.1:8088
08:32:16.609 EVENT Started org.mortbay.jetty.Server@137c60d
Heritrix version: 1.14.1 上面的包的复制都在直接在Eclipse里面的项目上直接进行的,这样可以减少修改Eclipse的项目配置文件。
解决方案 »
- 谁能解释下这些是什么意思,串口信息????
- 很奇怪的一个问题
- 急急在线等!!请问如何在web网页上嵌入word、xls、pdf、ppt等文档在线阅读,并可以在线编辑,编辑完以后并可以存入到服务器端。我的web服务器是tomcat 代码参照内容
- spring管理的DAO的创建问题?
- 关于https的请求乱码问题
- ireport问题,配置数据库连接成功,但是查询失败.
- 无论你是IT行业里从事技术的、销售的,大家把所有的资源都统一起来,寻找机会,共同创业~!请加入这个QQ群~!
- JDBC连接Oracle问题:The network Adapter could not establish the connetion.
- 请教如何在button按钮上显示二行文字(html)?
- Struts2 addFieldError问题
- JSP中加入DB2Driver(db2jcc.jar)出现错误 无法部署 resloves to a package
- 求助:过滤IP地址问题
03:33:36.296 EVENT Started WebApplicationContext[/,Heritrix Console]
03:33:37.250 EVENT Started SocketListener on 127.0.0.1:8088
03:33:37.250 EVENT Started org.mortbay.jetty.Server@14a9972
Heritrix version: 1.10.1
错误:“null”
致命错误:“无法编译样式表”
05/21/2009 03:41:30 +0000 严重 org.archive.crawler.writer.ARCWriterProcessor getMetadataBody Failed transform javax.xml.transform.TransformerConfigurationException: 无法编译样式表
05/21/2009 03:41:30 +0000 严重 org.archive.io.arc.ARCWriter getMetadataLength Unsupported metadata type: null
05/21/2009 03:41:39 +0000 严重 org.archive.io.arc.ARCWriter getMetadataLength Unsupported metadata type: null
05/21/2009 03:41:39 +0000 严重 org.archive.io.arc.ARCWriter getMetadataLength Unsupported metadata type: null
05/21/2009 03:41:39 +0000 严重 org.archive.io.arc.ARCWriter getMetadataLength Unsupported metadata type: null
05/21/2009 03:41:39 +0000 严重 org.archive.io.arc.ARCWriter getMetadataLength Unsupported metadata type: null
05/21/2009 03:43:33 +0000 严重 org.archive.crawler.fetcher.FetchDNS storeDNSRecord Failed store of DNS Record for dns:esf.focus.cn
java.io.FileNotFoundException: D:\share\JavaProjects\hibernate\Heritrix\jobs\default-20090521033538062\scratch\tt32http.ris (请求的操作无法在使用用户映射区域打开的文件上执行。)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
at java.io.FileOutputStream.<init>(FileOutputStream.java:70)
at org.archive.io.RecordingOutputStream.open(RecordingOutputStream.java:183)
at org.archive.io.RecordingOutputStream.open(RecordingOutputStream.java:151)
at org.archive.io.RecordingInputStream.open(RecordingInputStream.java:92)
at org.archive.util.HttpRecorder.inputWrap(HttpRecorder.java:148)
at org.archive.crawler.fetcher.FetchDNS.recordDNS(FetchDNS.java:245)
at org.archive.crawler.fetcher.FetchDNS.storeDNSRecord(FetchDNS.java:191)
at org.archive.crawler.fetcher.FetchDNS.innerProcess(FetchDNS.java:145)
at org.archive.crawler.framework.Processor.process(Processor.java:103)
at org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:304)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:153)
05/21/2009 03:44:08 +0000 严重 org.archive.crawler.fetcher.FetchDNS storeDNSRecord Failed store of DNS Record for dns:tjimg.focus.cn
java.io.FileNotFoundException: D:\share\JavaProjects\hibernate\Heritrix\jobs\default-20090521033538062\scratch\tt32http.ris (请求的操作无法在使用用户映射区域打开的文件上执行。)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
at java.io.FileOutputStream.<init>(FileOutputStream.java:70)
at org.archive.io.RecordingOutputStream.open(RecordingOutputStream.java:183)
at org.archive.io.RecordingOutputStream.open(RecordingOutputStream.java:151)
at org.archive.io.RecordingInputStream.open(RecordingInputStream.java:92)
at org.archive.util.HttpRecorder.inputWrap(HttpRecorder.java:148)
at org.archive.crawler.fetcher.FetchDNS.recordDNS(FetchDNS.java:245)
at org.archive.crawler.fetcher.FetchDNS.storeDNSRecord(FetchDNS.java:191)
at org.archive.crawler.fetcher.FetchDNS.innerProcess(FetchDNS.java:145)
at org.archive.crawler.framework.Processor.process(Processor.java:103)
at org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:304)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:153)
05/21/2009 03:44:09 +0000 严重 org.archive.crawler.fetcher.FetchDNS storeDNSRecord Failed store of DNS Record for dns:mail.sohu.net
java.io.IOException: RIS already open for ToeThread #32: dns:mail.sohu.net
at org.archive.io.RecordingInputStream.open(RecordingInputStream.java:88)
at org.archive.util.HttpRecorder.inputWrap(HttpRecorder.java:148)
at org.archive.crawler.fetcher.FetchDNS.recordDNS(FetchDNS.java:245)
at org.archive.crawler.fetcher.FetchDNS.storeDNSRecord(FetchDNS.java:191)
at org.archive.crawler.fetcher.FetchDNS.innerProcess(FetchDNS.java:145)
at org.archive.crawler.framework.Processor.process(Processor.java:103)
at org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:304)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:153)
05/21/2009 03:44:10 +0000 严重 org.archive.crawler.fetcher.FetchDNS storeDNSRecord Failed store of DNS Record for dns:key.go2map.com
java.io.IOException: RIS already open for ToeThread #32: dns:key.go2map.com
at org.archive.io.RecordingInputStream.open(RecordingInputStream.java:88)
at org.archive.util.HttpRecorder.inputWrap(HttpRecorder.java:148)
at org.archive.crawler.fetcher.FetchDNS.recordDNS(FetchDNS.java:245)
at org.archive.crawler.fetcher.FetchDNS.storeDNSRecord(FetchDNS.java:191)
at org.archive.crawler.fetcher.FetchDNS.innerProcess(FetchDNS.java:145)
at org.archive.crawler.framework.Processor.process(Processor.java:103)
at org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:304)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:153)
你查看一下
Current selection: org.archive.crawler.scope.BroadScope而不是其它
所以如果你真会的话,就说详细点,不会 也不要紧,只是别不懂装懂就行。
另外 “在设置抓取时的处理链时发现没办法选择抓取选项,举个例子:在书上p312上说Writer选择镜像(Mirror),可是我发现该页面只能是默认设置Arc,没法选择,”这是怎么回事? 有谁能解释一下?
2.第二个问题跟你的配置有关,你修改order.xml或者通过UI修改 <integer name="recorder-out-buffer-bytes">4096</integer和<integer name="recorder-in-buffer-bytes">65536</integer>即可,将他们值放大
1)Select Crawl Scope:Crawl Scope 用于配置当前应该在什么范围内抓取网页链接。例如选择 BroadScope 则表示当前的抓取范围不受限制,选择 HostScope 则表示抓取的范围在当前的 Host 范围内。在这里我们选择 org.archive.crawler.scope.BroadScope,并单击右边的 Change 按钮保存设置状态。
6)Select Writers:它主要用于设定将所抓取到的信息以何种形式写入磁盘。一种是采用压缩的方式(Arc),还有一种是镜像方式(Mirror)。这里我们选择简单直观的镜像方式:org.archive.crawler.writer.MirrorWriterProcessor。
用“add”按钮添加org.archive.crawler.writer.MirrorWriterProcessor。
http://www.ibm.com/developerworks/cn/opensource/os-cn-heritrix/images/image013.jpg