Java版wordpress采集器正式开源,支持80%以上的网站的数据抓取。

唉终于还是决定开源这个项目,造福广大wordpress用户。
项目地址:http://code.google.com/p/jwp/
我的博客地址:http://www.ij2ee.com/
Java版wordpress采集器如果你想流畅的使用这个项目去采集文章,你必须具备网页代码的分析能力。你还必须会使用以下用具
firebug 或 ie developer tools
正则表达式
使用方法。
1 编写一个类,它要继承 CommonParser类并实现两个方法
    getTargetConF 文章所在的元素(比如div,p元素等等)。这里使用htmlparser的HasAttributeFilter("id","context");方法实现的。主要是要寻找特定的属性,比如class,id,name等等。要求这些元素要唯一。
    getTargetDivF 文章目录所在的元素。这里使用htmlparser的HasAttributeFilter("id","context");方法实现的。主要是要寻找特定的属性,比如class,id,name等等。要求这些元素要唯一。

2 开始在WPMover2\src\com\wpmover2\spring\core.xml 配置相关的类属性。<?xml version="1.0" encoding="UTF-8"?>
<beans
xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:p="http://www.springframework.org/schema/p"
xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.5.xsd">
<bean id="commonP" class="com.wpmover2.parser.CommonParser" abstract="true"></bean>
<bean id="ij2ee" class="com.wpmover2.all.ij2ee.IJ2eeParser" parent="commonP">

<property name="targetDomain" value="http://www.alixixi.com"/>

<property name="targetUrl" value="http://www.alixixi.com/program/c/php_{0}.shtml"></property>

<property name="MAX_PAGE" value="140"></property>

<property name="start_page" value="135"></property>

<property name="ENCODING" value="gb2312"></property>

<property name="artUrlMatch" value="http://www.alixixi.com/program/a/\d+.shtml"/>

<property name="mydomain" value="http://www.ij2ee.com/"></property>

<property name="targetPlaceTxt" value="_PHP教程_编程技术"></property>


<property name="useHtml" value="0"></property>


<property name="isTest" value="0"></property>

<property name="user" value=""/>

<property name="pwd" value=""/>

<property name="keyword" value="java"/>
</bean>
</beans>

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

呵呵采集器不是什么新奇的玩意啦,采集器基本上都是要钱的,什么狂人啦,火车头啦都是收费软件。我这玩意就当给大家玩玩Java啦。