如何采集淘宝卖出商品的数据

真正能用的PHP采集程序原理！！
由于需要，要写一个简单的PHP采集程序，照例是到网上找了一堆教程，然后照猫画虎，可是发现网上的教程全是似是而非，没有一个真正能用的。苦想了几天，终于弄明白了里面的道理。在这里写出来，请高手指正。采集程序的思路很简单，无非就是先打一个页面，一般都是列表页，取得里面全部链接的地址，然后打开逐条链接，寻找我们感兴趣的东西，如果找到，就把它入库或别的处理。下面以一个很简单的例子来说说。首先确定一个采集页，一般就是列表面了。这里目标是:http://www.zhanu.com/article/11/index.htm。这是一个列表页，我们的目的就是采集这个列表页上全部的文章。有列表页了，第一步先打开它，把它的内容纳入到我们的程序中来。一般用fopen或是file_get_contents这两个函数，我们这里用fopen作例子。怎么打开它呢？很简单：$source=fopen(")；实际上已经把内容纳入到我们的程序中来了。注意得到的$source是一个资源，不是可处理的文本，所以再用函数fread将内容读到一个变量中，这次就是真正的可编辑的文本了。例子：$content=fread($source,99999);后面的数字表示字节数，填个大的就行。你用file_put_contents将$content写入到一个文本文件，可以看出里面的内容其实就是网页的源码。得到了网页的源码，我们就要分析里面的文章链接地址，这里要用到正则表达式了，[推荐正则表达式教程（http://www.zhanu.com/article/7/all/545.1.htm）]。通过查看源代码，我们可以看到里面文章的链接地址全是这个样子<div class="in_arttitle"><a href="http://www.zhanu.com/article/10/all/273.1.htm">　　将数据库连接代码封装在函数里，在需要读取时调用..</a>我们就可以写正则表达式了。$count=preg_match_all("/<div class=\"in_arttitle\"><a\shref=\"(.+?)\">(.+?)<\/a>/",$content,$art_list);其中数组$art_list[1][$s]里面包含的就是某个文章的链接地址。而$art_list[2][$s]包含的就是某一文章的标题。到了这一步就可以算成功了一半了。接着用for循环依次打每个链接，然后像取得标题一样的方法取得内容即可。以上这些和我在网上找的教程都差不多，但是到了这个for循环网上的教程可就差劲，还没找到一篇可以说清这个事的文章，刚开始我是用js来帮助循环的，还是用实例说吧，刚开始我是这样做的：
for($i=0;$i<20;4i++ {中间就是采集内容的部分了，省略了采集了一页，肯定要采集再一页啊
可是再用fopen打开链接时就不行了。请求失败什么的，用js也不行，最后才知道要用这句echo "<META HTTP-EQUIV=REFRESH CONTENT='0;URL=aa.php?id=1'>";其中aa.php就是我们的程序的文件名， id后面的数字就可以帮助我们实现循环，采集多个页面。这就是能真正循环起来的关键
}
脑子有点难受，写得有点乱，将就着看吧，在高手看来这可能没什么大不了的，可是对于我等菜鸟来说，实在是很有帮助。

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

php采集程序采集程序其实最简单的思路就是：获取页面代码——分析代码——获取需要的部分——写入数据库对于采集程序来说，使用PHP来写的话，其实不算太好的，因为PHP并不支持多线程，对于采集来说，若没有多线程，将会是非常痛苦的一件事不过可以使用frame等来设置同时几个页面一起采集，这样就能增加速度了，在这里我不讨论怎么多线程，我只说怎么用PHP来进行简单的采集先确定采集目标：http://cn.jokes.yahoo.com/jok/index.html这是雅虎的笑话栏目，我就以这个来进行讲解吧首先分析一下网页，可以知道连接形式为：<img src="http://cn.yimg.com/i/cn/px_ar.gif" width=5 height=12 border=0 hspace=5><a href="http://cn.jokes.yahoo.com/07-07-/55/27lot.html" class=list target=_blank><big>头发与智慧</big></a>使用正则表达式将它表示出来为：/hspace=5><a href="http://cn.jokes.yahoo.com/(.*).html" class=list target=_blank>/isU书写PHP代码：代码
// 采集首页地址
$url = "<a href=\"http://cn.jokes.yahoo.com/jok/index.html\">http://cn.jokes.yahoo.com/jok/index.html</a>";
// 获取页面代码
$r = file_get_contents($url);
// 设置匹配正则
$preg = '/hspace=5><a href="http://cn.jokes.yahoo.com/(.*).html" class=list target=_blank>/isU';
// 进行正则搜索
preg_match_all($preg, $r, $title);
通过上面的代码，$title[1][num]就是连接的地址了，接着分析内容页，得到内容匹配正则为：/<div id="newscontent">(.*)</div>/isU继续写代码：代码
// 计算标题数量
$count = count($title[1]);
// 通过标题数量进行内容采集
for($i=0;$i<$count;$i++) {
   // 设置内容页地址
   $jurl = "<a href=\"http://cn.jokes.yahoo.com/\">http://cn.jokes.yahoo.com/</a>" . $title[1][$i] . ".html";
   // 获取内容页代码
   $c = file_get_contents($jurl);
   // 设置内容页匹配正则
   $p = '/<div id="newscontent">(.*)</div>/isU';
   // 进行正则匹配搜索
   preg_match($p, $c, $content);
   // 输出标题
   echo $title[1][$i] . "
";
   // 输出内容
   echo $content[$i];
}这样，一个简单的采集工具就写出来了，其他的功能只需要再进一步的完善就可以了完整代码：代码
<?php
// 采集首页地址
$url = "<a href=\"http://cn.jokes.yahoo.com/jok/index.html\">http://cn.jokes.yahoo.com/jok/index.html</a>";
// 获取页面代码
$r = file_get_contents($url);
// 设置匹配正则
$preg = '/hspace=5><a href="http://cn.jokes.yahoo.com/(.*).html" class=list target=_blank>/isU';
// 进行正则搜索
preg_match_all($preg, $r, $title);
// 计算标题数量
$count = count($title[1]);
// 通过标题数量进行内容采集
for($i=0;$i<$count;$i++) {
   // 设置内容页地址
   $jurl = "<a href=\"http://cn.jokes.yahoo.com/\">http://cn.jokes.yahoo.com/</a>" . $title[1][$i] . ".html";
   // 获取内容页代码
   $c = file_get_contents($jurl);
   // 设置内容页匹配正则
   $p = '/<div id="newscontent">(.*)</div>/isU';
   // 进行正则匹配搜索
   preg_match($p, $c, $content);
   // 输出标题
   echo $title[1][$i] . "
";
   // 输出内容
   echo $content[$i];
}
?>
最近做PHP采集，发几个实用的函数//获得当前的脚本网址
function get_php_url(){
        if(!empty($_SERVER["REQUEST_URI"])){
                $scriptName = $_SERVER["REQUEST_URI"];
                $nowurl = $scriptName;
        }else{
                $scriptName = $_SERVER["PHP_SELF"];
                if(empty($_SERVER["QUERY_STRING"])) $nowurl = $scriptName;
                else $nowurl = $scriptName."?".$_SERVER["QUERY_STRING"];
        }
        return $nowurl;
}
//把全角数字转为半角数字
function GetAlabNum($fnum){
        $nums = array("０","１","２","３","４","５","６","７","８","９");
        $fnums = "0123456789";
        for($i=0;$i<=9;$i++) $fnum = str_replace($nums[$i],$fnums[$i],$fnum);
        $fnum = ereg_replace("[^0-9.]|^0{1,}","",$fnum);
        if($fnum=="") $fnum=0;
        return $fnum;
}
//去除HTML标记
function Text2Html($txt){
        $txt = str_replace("  ","　",$txt);
        $txt = str_replace("<","<",$txt);
        $txt = str_replace(">",">",$txt);
        $txt = preg_replace("/[ ]{1,}/isU","<br/> ",$txt);
        return $txt;
}//清除HTML标记
function ClearHtml($str){
        $str = str_replace('<','<',$str);
        $str = str_replace('>','>',$str);
        return $str;
}
//相对路径转化成绝对路径
function relative_to_absolute($content, $feed_url) {
    preg_match('/(http|https|ftp):///', $feed_url, $protocol);
    $server_url = preg_replace("/(http|https|ftp|news):///", "", $feed_url);
    $server_url = preg_replace("//.*/", "", $server_url);     if ($server_url == '') {
        return $content;
    }     if (isset($protocol[0])) {
        $new_content = preg_replace('/href="//', 'href="'.$protocol[0].$server_url.'/', $content);
        $new_content = preg_replace('/src="//', 'src="'.$protocol[0].$server_url.'/', $new_content);
    } else {
        $new_content = $content;
    }
    return $new_content;
}
//取得所有链接
function get_all_url($code){
        preg_match_all('/<as+href=["|']?([^>"' ]+)["|']?s*[^>]*>([^>]+)</a>/i',$code,$arr);
        return array('name'=>$arr[2],'url'=>$arr[1]);
}//获取指定标记中的内容
function get_tag_data($str, $start, $end){
        if ( $start == '' || $end == '' ){
               return;
        }
        $str = explode($start, $str);
        $str = explode($end, $str[1]);
        return $str[0];
}
//HTML表格的每行转为CSV格式数组function get_tr_array($table) {
        $table = preg_replace("'<td[^>]*?>'si",'"',$table);
        $table = str_replace("</td>",'",',$table);
        $table = str_replace("</tr>","{tr}",$table);
        //去掉 HTML 标记
        $table = preg_replace("'<[/!]*?[^<>]*?>'si","",$table);
        //去掉空白字符
        $table = preg_replace("'([ ])[s]+'","",$table);
        $table = str_replace(" ","",$table);
        $table = str_replace(" ","",$table);        $table = explode(",{tr}",$table);
        array_pop($table);
        return $table;
}//将HTML表格的每行每列转为数组，采集表格数据
function get_td_array($table) {
        $table = preg_replace("'<table[^>]*?>'si","",$table);
        $table = preg_replace("'<tr[^>]*?>'si","",$table);
        $table = preg_replace("'<td[^>]*?>'si","",$table);
        $table = str_replace("</tr>","{tr}",$table);
        $table = str_replace("</td>","{td}",$table);
        //去掉 HTML 标记
        $table = preg_replace("'<[/!]*?[^<>]*?>'si","",$table);
        //去掉空白字符
        $table = preg_replace("'([ ])[s]+'","",$table);
        $table = str_replace(" ","",$table);
        $table = str_replace(" ","",$table);

        $table = explode('{tr}', $table);
        array_pop($table);
        foreach ($table as $key=>$tr) {
                $td = explode('{td}', $tr);
                array_pop($td);
            $td_array[] = $td;
        }
        return $td_array;
}//返回字符串中的所有单词 $distinct=true 去除重复
function split_en_str($str,$distinct=true) {
        preg_match_all('/([a-zA-Z]+)/',$str,$match);
        if ($distinct == true) {
                $match[1] = array_unique($match[1]);
        }
        sort($match[1]);
        return $match[1];
}
我试了一下，淘宝需要身份验证，我file_get_contents(http://trade.taobao.com/trade/itemlist/list_sold_items.htm')就跳转到了验证面
我试了一下，淘宝需要身份验证，我file_get_contents(http://trade.taobao.com/trade/itemlist/list_sold_items.htm')此地址是发货需要采集的地址，我登陆后采集此页就跳转到了验证面
不可能是任何的采集系统可以完成的任务，请去淘宝开放平台里面查找相关TOPAPI2.0接口。