正则表达式

http://onlinebooks.library.upenn.edu/webbin/book/search?author=&amode=words&title=history&tmode=words
取它的链接和后面的黑子
谢谢

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

哪个算链接？黑子是什么？
parse_url 函数
http://www.scls.lib.wi.us/mcm/taylor/index.html链接
by Theodore Asa Taylor (illustrated HTML at McMillan Memorial Library)黑色的子
$c = file_get_contents('http://onlinebooks.library.upenn.edu/webbin/book/search?author=&amode=words&title=history&tmode=words');
$arr = array();
preg_replace("/<a\s+href\s*=\s*[\'|\"]([^\'\"\>]*)[\'|\"|][^>]*><cite>([^>]*)<\/cite><\/a>\s*,([^<]*)/ise","foo('\\1','\\2','\\3');",$c);
function foo($_1,$_2,$_3)
{
global $arr;
$count = count($arr);
$arr[$count]['url'] = $_1;
$arr[$count]['title'] = $_2;
$arr[$count]['author'] = $_3;
}
echo "<pre/>";
print_r($arr);
preg_replace('/^(https?:\/\/)?([\w\.]+)(\/[^\s\?]+)*(\?[^\s]*)/i','\\1\\2\\3', $url);
4楼能不能请你把url和author字段分下来写啊  就是写两个正则表达式分别代表url和author的
title不要了
楼主准精通的话，我请教LZ一个问题，怎么样让一个页面内包含两种不同的编码如有 GB和UTF8，是一个页面，不是两个页面不编码
比如中文是GB的共产党是UTF-8 我怎么样才能让页面显示中国共产党而不产生乱码
各位我要两个正则分别代表url和author
晕！获取内容，然后用 preg_match_all 匹配到数组里面
那你开两个不同的数组，把\\1(url)丢到一个数组，\\3(author)丢到另一个数组不就得了，何必要拆成两条正则式？
preg_match_all("/<a\s+href\s*=\s*[\'|\"]([^\'\"\>]*)[\'|\"|][^>]*><cite>([^>]*)<\/cite><\/a>\s*,([^<]*)/i",$c,$m);
echo "<pre/>";
echo "url===================== ";
print_r($m[1]);
echo "author===================== ";
print_r($m[3]);
你一定两条正则么？那你就同一正则写两次preg_match_all吧
preg_replace_callback == preg_replace + e修正符。
PHP codepreg_match_all("/<a\s+href\s*=\s*[\'|\"]([^\'\"\>]*)[\'|\"|][^>]*><cite>([^>]*)<\/cite><\/a>\s*,([^<]*)/i",$c,$m);
echo "<pre/>";
echo "url===================== ";
print_r($m[1]);
echo "author===================== ";
print_r($m[3]);输出来是空的
你没有copy file_get_contents那一行？
。。
我本地测试过，可以拿到。$c = file_get_contents('http://onlinebooks.library.upenn.edu/webbin/book/search?author=&amode=words&title=history&tmode=words');preg_match_all("/<a\s+href\s*=\s*[\'|\"]([^\'\"\>]*)[\'|\"|][^>]*><cite>([^>]*)<\/cite><\/a>\s*,([^<]*)/i",$c,$m);
echo "<pre/>";
echo "url===================== ";
print_r($m[1]);
echo "author===================== ";
print_r($m[3]);
少了什么？自己用firebug查看下要抓取的页面的html排版，基本都比较有规律的。
$c = file_get_contents('http://onlinebooks.library.upenn.edu/webbin/book/search?author=&amode=words&title=history&tmode=words');
//仔细看了下你的给的页面，有些title是不带url的。。
preg_match_all("/(<a\s+href\s*=\s*[\'|\"]([^\'\"\>]*)[\'|\"|][^>]*>)?<cite>([^>]*)<\/cite>([^<]*<\/a>,?\s*.[^<]*|[^<]*)/is",$c,$m);
echo "<pre/>";
echo "url===================== ";
print_r($m[2]);
echo "author===================== ";
print_r($m[4]);