求救!正则如何提取 table 块

一个html文件中有多个table，我想提取其中一个包含某个关键字的table,但是老搞不定，能否帮我看看待处理的html字符窜如下：
------------------------------
$sTmp = '
<body bgcolor="#FFFFFF" link="#0000FF" vlink="#0000FF" alink="#FF0000" leftmargin="0" topmargin="0">

<table border="0" align="center" cellpadding="0" cellspacing="0">
<tr>
<td valign="top"> <table border="0" cellpadding="0" cellspacing="0" background="../images/default05_ent_26.gif">
<tr>
<td ><IMG SRC="../images/default05_ent_06.gif" border="0"></td>
<td><div align="center"><a href="../ent.htm" class="whitec11"><strong>首页</strong></a>
<font color="#FFFFFF">|</font> <a href="../zhi_nan/ZN_default.htm" class="whitec11"><strong>招生指南</strong></a>
<font color="#FFFFFF">|</font> <a href="../zhuan_ye/ZSZY_default.htm" class="whitec11"><strong>招生专业</strong></a>
<font color="#FFFFFF">|</font> <a href="../fu_dao_ban/FDB_default.htm" class="whitec11"><strong>辅导班</strong></a>
<font color="#FFFFFF">|</font><a href="../xing_xi/XX_default.htm" class="whitec11"><strong>
最新信息</strong></a></div></td>
<td ><IMG SRC="../images/default05_ent_08.gif" ></td>
</tr>
</table>
<table border="0" align="center" cellpadding="2" cellspacing="0">
<tr>
<td><div align="right">您现在的位置：<a href="http://www.sjtuce.net" target="_blank">上海交通大学继续教育学院成人教育部</a>
<span class="bluearrow">>></span> 成人高考辅导班 <span class="bluearrow">>></span>
<a href="#">招生专业</a></div></td>
</tr>
</table>
<table border="0" align="center" cellpadding="2" cellspacing="0">
<tr>
<td ><strong><img src="../zmages/arrow3.jpg" >
</strong></td>
';
-----------------------------------我要提取表格里面包含 “招生指南”这个关键字的表格内容，提取到的表格内容不包含<table>与</table>标签。
我整了好久，整理出一个这样的。单结果还是一次提取出了多个表格。
$sRules ='/<(div|table)[^<]*>{1}([\s\S]+招生指南[\s\S]+)(<\/\\1>){1}/i';
preg_match_all($sRules, $sTmp, $aResult, PREG_PATTERN_ORDER);
print_r($aResult);通过正则，我想取到这样一个代码段，但是老调不好。
<table border="0" cellpadding="0" cellspacing="0" background="../images/default05_ent_26.gif">
<tr>
<td ><IMG SRC="../images/default05_ent_06.gif" border="0"></td>
<td><div align="center"><a href="../ent.htm" class="whitec11"><strong>首页</strong></a>
<font color="#FFFFFF">|</font> <a href="../zhi_nan/ZN_default.htm" class="whitec11"><strong>招生指南</strong></a>
<font color="#FFFFFF">|</font> <a href="../zhuan_ye/ZSZY_default.htm" class="whitec11"><strong>招生专业</strong></a>
<font color="#FFFFFF">|</font> <a href="../fu_dao_ban/FDB_default.htm" class="whitec11"><strong>辅导班</strong></a>
<font color="#FFFFFF">|</font><a href="../xing_xi/XX_default.htm" class="whitec11"><strong>
最新信息</strong></a></div></td>
<td ><IMG SRC="../images/default05_ent_08.gif" ></td>
</tr>
</table>或者这样的块：
<div align="center"><a href="../ent.htm" class="whitec11"><strong>首页</strong></a>
<font color="#FFFFFF">|</font> <a href="../zhi_nan/ZN_default.htm" class="whitec11"><strong>招生指南</strong></a>
<font color="#FFFFFF">|</font> <a href="../zhuan_ye/ZSZY_default.htm" class="whitec11"><strong>招生专业</strong></a>
<font color="#FFFFFF">|</font> <a href="../fu_dao_ban/FDB_default.htm" class="whitec11"><strong>辅导班</strong></a>
<font color="#FFFFFF">|</font><a href="../xing_xi/XX_default.htm" class="whitec11"><strong>
最新信息</strong></a></div>
提取出信息块后，我后期还得要对这个信息块进行处理。

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

DOM？我是在php里面哈，怎么用DOM，DOM是windows的组件吧
$preg='/class="whitec11">[^<]*<strong>([^<]+)<\/strong>'/;
preg_match_all($preg, $orders, $a);
print_r($preg);
试试看
恩，这个我知道，但是问题是现在这个字符串是不固定的，可能是任何形式，有可能这个包含“招生指南”这个关键字的区域不是表格也有可能，前后可能是一个<div>...招生指南....</div>能否能够通过正则提取到，包含 “招生指南”这个关键字的最接近的最小的那个table区域？因为有可能里面还有表格的嵌套。
楼上的不行，我少所了一点，不好意思。整个需要过滤的源字符串格式不固定，唯一固定的就是，知道可能是table或者是div两种标签，在这个标签区域内出现 “首页”这个关键字，就把这个table或者div给整体提取出来。table或div有可能是嵌套的。要取出的是最接近这个关键字的table或者div区域的内容。
preg_match('/<table.*?>(.*?招生指南.*?)<\/table>/sm', $sTmp, $arr);
echo $arr[1];
也不行啊，取出来的时候，头上还是会出现两个<table >换这个源字符串试一试
$sTmp = '
<table border="0" align="center" cellpadding="2" cellspacing="0">
<tr>  <td>您现在的位置</td> </tr>
</table>
<table cellspacing="0">
<tr>
<td valign="top"> <table border="0" cellpadding="0" cellspacing="0" background="../images/default05_ent_26.gif">
<tr>
<td><a href="../ent.htm" class="whitec11"><strong>首页</strong></a>
<a href="../xing_xi/XX_default.htm" class="whitec11"><strong>
最新信息</strong></a></div></td>
</tr>
</table>
<table border="0" align="center" cellpadding="2" cellspacing="0">
<tr>
<td><div align="right">您现在的位置：</td>
</tr>
</table>
  <table border="0" align="center" cellpadding="2" cellspacing="0">
<tr>
<td ><strong><img src="../zmages/arrow3.jpg" >
</strong></td>
';
我不知道怎么在正则里面限定某个字符穿不能出现的正则表现方式。我想在'<table[^>]*?>.*?<\/table>'
这个能够提取出 table部分，但是如果是上下都有table的时候，他会取出的是最前面的<table,跟最末尾的,</table>。
这样一来，取出的内容就会包含多个table了但是我测试了一下，这样写:'/ <table.*?>[^(<table)]+?(.*?首页.*?) <\/table>/sm'
没有效果，我是想在<table 后面不要在出现“<table”标记，这样就能保证取出的是最小范围的一个<table。所以想请教一下正则里面限定某个字符穿不能出现的正则怎么写？还有啊，请教8楼的师傅，最后面的"/sm"这个是表示什么意思
你把空格去掉,这里会自动加空格
preg_match('/<table.*?>(.*?招生指南.*?)<\/table>/sm', $sTmp, $arr);
echo $arr[1];
你描述问题要清楚一点,什么不要table...还真以为你不要这个<table>
preg_match('/(<table.*?>.*?招生指南.*?<\/table>)/sm', $sTmp, $arr);
echo $arr[1];
上面那个是错的,用这个:
preg_match('/(<table(?:.(?<!<table))*招生指南.*?<\/table>)/sm', $sTmp, $arr);
echo $arr[1];
$sTmp = '
<table border="0">
<tr>  <td></td> </tr>
</table>
<table cellspacing="0">
<tr>
<td valign="top"> <table border="0">
<tr>
<td><a href="../ent.htm" class="whitec11"><strong>首页</strong></a></td>
</tr>
</table>
<table border="0">
<tr>
<td><div align="right">您现在的位置：</td>
</tr>
</table>
  <table cellspacing="0">
<tr>
<td ><strong><img src="../zmages/arrow3.jpg" >
</strong></td>
';
$sRules ='/(<table.*?>.*?首页.*?<\/table>)/sm';
preg_match_all($sRules, $sTmp, $aResult, PREG_PATTERN_ORDER);
print_r($aResult[1]);
不好意思，我没有表达清楚，上面是你的代码提取的结果:<table border="0" align="center" cellpadding="2" cellspacing="0">
<tr>  <td>您现在的位置</td> </tr>
</table>
<table cellspacing="0">
<tr>
<td valign="top"> <table border="0" cellpadding="0" cellspacing="0" background="../images/default05_ent_26.gif">
<tr>
<td><a href="../ent.htm" class="whitec11"><strong>首页</strong></a>
<a href="../xing_xi/XX_default.htm" class="whitec11"><strong>
最新信息</strong></a></div></td>
</tr>
</table>
我想要的结果是:<table border="0" cellpadding="0" cellspacing="0" background="../images/default05_ent_26.gif">
<tr>
<td><a href="../ent.htm" class="whitec11"><strong>首页</strong></a>
<a href="../xing_xi/XX_default.htm" class="whitec11"><strong>
最新信息</strong></a></div></td>
</tr>
</table>
就是想得到最小范围内的<table>....</table>这个区域。
我一直在想怎么排除在<table >后面不再出现<table >标签的方式，即取出最小范围的<table>....</table>区域。
上面是错的,用这个
preg_match('/(<table(?:.(?<!<table))*招生指南.*?<\/table>)/sm', $sTmp, $arr);
echo $arr[1];
http://cn.php.net/manual/en/reference.pcre.pattern.modifiers.php
m是把很多行看成一行来处理,s的话.号匹配所有字符包括换行符

求救!正则 如何提取 table 块

解决方案 »

求救!正则如何提取 table 块