文章内容提取

随便给你一个链接地址，当然这地址是能访问的，而且是一篇文章。怎么用一种算法，把标题、内容。提取出来
例如：
http://www.boraid.com/darticle3/list.asp?id=132806
http://sc.stock.cnfol.com/100421/123,1325,7576061,00.shtml
http://www.openvoip.cn/Html_Data/2010/04/21/Content_32510.html
文章链接可能是随机的任意网站。
但是内容的排版大概就是上面网址的排版。不考虑分页情况。
内容标题提取不一定要 100% 一致，只要不影响阅读即可。单独一个网页可以，对着模板写正则就可以提取了。
但是网址很随机，刚开始我想的是，想办法找哪段文字最多。那么最多的文字所处的同级标签
或者父级标签，提取再做分析。所以这样分析 html源码。这是我能想到的一种大致的思路。标题貌似直接找 title 标签求思路。

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

没测试,应该可以,以http://www.boraid.com/darticle3/list.asp?id=132806为例//先把页面整个内容读出来
string Html = Job.Spider.App_Code.Common.GetWebPage("http://www.boraid.com/darticle3/list.asp?id=132806", "UTF-8");//读页面内容自己写
                    string title= Regex.Match(Html, @"(?<=<h1>)(.+)(?=</h1>)").Value;
然后根据不同的URL用不同的正则
HTTPwebtrequest
webclirent抓取内容，再根据正则获取
如@"(?i)(?<=<title>)\s*(?=</title>)"
如果是随机取链接然后取内容,那没这么简单,可以看看火车头等软件,如果是固定地址
我以前写的一个抓职位的代码部分using System;
using System.Data;
using System.Configuration;
using System.Text.RegularExpressions;
using System.Collections;
using System.IO;
using System.Text;
using System.Threading;
using System.Data.SqlClient;
using System.Collections.Generic;
namespace Saongroup.Job.Spider.App_Code
{
    public static class Spider
    {
        public static int NewsCount { get; set; }
        public static int CountNum { get; set; }
        public static int Page { get; set; }
        public static int CountPage { get; set; }        #region = 分析 =
        public static void Start()
        {
            if (Page == 0)
            {
                Page = 1;
            }
            string Url = "http://www.boraid.com/darticle3/index.asp?classid=9&Nclassid=48";//取市场营销新闻            #region = 初始化 =
            for (int t = 1; t < Page + 1; t++)
            {
                string Html = Saongroup.Job.Spider.App_Code.Common.GetWebPage(Url, "UTF-8");
                CountPage = Convert.ToInt32(Regex.Match(Html, @"(?<=</font>/)(.+)(?=</strong>页)").Value);//得到页数
                CountNum = 20 * CountPage;//每页取20条新闻
                Console.WriteLine("需要处理的数据总数:{0}条...", CountNum);
                Console.WriteLine("需要处理的总页数:{0}...", CountPage);
                Console.WriteLine("正在处理第{0}页...", Page);
                Thread.Sleep(500);
                HtmlHandler(Html, Page);
            }
            #endregion
        }
        #endregion        #region = 抓取 =
        public static void HtmlHandler(string src, int p)
        {
            #region = 得到新闻链接 =
            Regex rgx = new Regex("(?<=<a href=\"list.asp?id=)(.+)(?=<a href=\"list.asp?id=)");
            Thread.Sleep(1000);
            #endregion            foreach (Match m in rgx.Matches(src))
            {
                Console.Clear();
                Saongroup.Job.Spider.App_Code.Spider.NewsCount += 1;
                Console.WriteLine("Formating {0}...", m.Value);
                Thread.Sleep(50);
            }
            Console.Clear();
            Console.WriteLine("===========================分析新闻===========================");
            Thread.Sleep(1000);
            foreach (Match m in rgx.Matches(src))
            {
                string NewsLink = m.Value;                    #region = 得到新闻信息 =
                    Console.WriteLine("打开 {0}", NewsLink);
                    string Html = Saongroup.Job.Spider.App_Code.Common.GetWebPage(NewsLink, "UTF-8");
                    Thread.Sleep(50);
                    string title = Regex.Match(Html, @"(?<=<h1>)(.+)(?=</h1>)").Value;
                    string content = Regex.Match(Html, @"(?<=<font name=)(.+)(?=<table width=\""570\"")").Value;
                    #endregion
                    Console.WriteLine(title);
                //或者把取到的内容存到数据库
            }            #region = 单页处理完成输出 =
            Console.Clear();
            Console.WriteLine("已处理的新闻数:{0}", Saongroup.Job.Spider.App_Code.Spider.NewsCount);
            Console.WriteLine("数据处理完成!");
            #endregion            #region = 全部处理完成 =
            if (p < CountPage)
            {
                p = p + 1;
                Page = p;
            }
            else
            {
                Saongroup.Job.Spider.App_Code.LastProcess.Process();
                Console.WriteLine("采集完成!共处理{0}条数据", CountNum);
            }
            #endregion
        }
        #endregion
    }
}