C#用正则表达式提取这个网页http://arxiv.org/list/astro-ph/new 里面文章的编号、标题、作者、内容保存在数组中

本帖最后由 u010509224 于 2013-05-03 16:49:24 编辑

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

找到一个现成的例子，楼主参考下吧。
C#使用正则表达式提取网页中的信息数据 http://blog.csdn.net/aa466564931/article/details/6575683
(?i)<div[^>]*?class=(['""]?)list-title\1[^>]*?>[\s\S]*?<span[^>]*?class=(['""]?)descriptor\2[^>]*?>[\s\S]*?</span>([\s\S]*?)\s*?</div>第三个分组看看
我是用C#开发的这个东西，现在把这个网页源代码保存到string里面了，这样匹配成功了，可以把文章的内容提取出来吗，
提取文章标题：  (?is)(?<=<(span).*?>Title:<\1>).*?(?=</div>)
源代码挺长的里面有几十篇文章，这个网页http://arxiv.org/list/astro-ph/new 右键查看源代码可以看到。想把这些文章的上述信息保存在一个数组里面，然后再存入数据库
string tempStr = File.ReadAllText(@"C:\Documents and Settings\Administrator\桌面\Test.txt", Encoding.GetEncoding("GB2312"));//读取txt
                string pattern = @"(?i)<span[^>]*?class=(['""]?)list-identifier\1[^>]*?><a[^>]*?>[^<>\d]*?(?<ID>[\d\.]+)\s*?</a>";
                pattern += @"[\s\S]*?<div[^>]*?class=(['""]?)list-title\2[^>]*?>[\s\S]*?<span[^>]*?>[\s\S]*?</span>(?<Title>[\s\S]*?)\s*?</div>";
                pattern += @"[\s\S]*?<div[^>]*?class=(['""]?)list-authors\3[^>]*?>[\s\S]*?<span[^>]*?>[\s\S]*?</span>(?:\s*?<a[^>]*?>(?<Authors>[^<>]*?)</a>[^<>]*)+\s*?</div>";
                pattern += @"[\s\S]*?<p>\s*?(?<Content>[\s\S]*?)\s*?</p>";
                foreach (Match m in Regex.Matches(tempStr, pattern))
                {
                    //循环输出
                    string ID = m.Groups["ID"].Value;//1305.0262
                    string Title = m.Groups["Title"].Value;//On-sky characterisation of the VISTA NB118 narrow-band filters at 1.19  micron
                    string Authors = string.Join("|", m.Groups["Authors"].Captures.Cast<Capture>().Select(a =>a.Value));//
                    /*不同作者以|分割
                     * B. Milvang-Jensen|W. Freudling|J. Zabl|J. P. U. Fynbo|P. Moller|K. K. Nilsson|H. Joy McCracken|J. Hjorth|O. Le Fevre|L. Tasca|J. S. Dunlop|D. Sobral
                     */
                    string Content = m.Groups["Content"].Value;//
                }
那个test.txt 是不是存的网页的源代码啊？非常感谢了。。
那个test.txt 是不是存的网页的源代码啊？非常感谢了。。是的，抓取，首先得到源码，然后再进行分析
我对你这条：
从字符串操作xml不是很简单的事么？XmlDocument.LoadXml 方法
不知道是不是因为存在string里面的不仅仅有这个网页的源代码，前面还有些别的东西
就没有保存成功，就放弃用dom提取了，改用正则了
大神，还有2个问题要请教。
1、文章所属的类别：subject 也需要提取出来，只提取第一个所属的类别就行，因为还得在界面中通过文章类别查询。这是源代码部分的截图
2、页面中有70多篇文章，现在可以抓取20多篇文章的信息存入数据库了，还有一些不能存储，下面这些程序运行时，提示错误的截图：
几个ORA-XXXXX的错误都可以直接搜索，相关帮助信息不少。
应该还是你的SQL语句的问题，注意单引号什么的
加入学科
string tempStr = File.ReadAllText(@"C:\Users\myx\Desktop\Test.txt", Encoding.GetEncoding("GB2312"));//读取txt
        string pattern = @"(?i)<span[^>]*?class=(['""]?)list-identifier\1[^>]*?><a[^>]*?>[^<>\d]*?(?<ID>[\d\.]+)\s*?</a>";
        pattern += @"[\s\S]*?<div[^>]*?class=(['""]?)list-title\2[^>]*?>[\s\S]*?<span[^>]*?>[\s\S]*?</span>(?<Title>[\s\S]*?)\s*?</div>";
        pattern += @"[\s\S]*?<div[^>]*?class=(['""]?)list-authors\3[^>]*?>[\s\S]*?<span[^>]*?>[\s\S]*?</span>(?:\s*?<a[^>]*?>(?<Authors>[^<>]*?)</a>[^<>]*)+\s*?</div>";
        pattern += @"[\s\S]*?<div[^>]*?class=(['""]?)list-subjects\4[^>]*?>[\s\S]*?<span[^>]*?class=(['""]?)primary-subject\5[^>]*?>(?<Subject>[\s\S]*?)</span>[\s\S]*?</div>";
        pattern += @"[\s\S]*?<p>\s*?(?<Content>[\s\S]*?)\s*?</p>";
        foreach (Match m in Regex.Matches(tempStr, pattern))
        {
            //循环输出
            string ID = m.Groups["ID"].Value;//1305.0262
            string Title = m.Groups["Title"].Value;//On-sky characterisation of the VISTA NB118 narrow-band filters at 1.19  micron
            string Authors = string.Join("|", m.Groups["Authors"].Captures.Cast<Capture>().Select(a =>a.Value));//
            /*不同作者以|分割
             * * B. Milvang-Jensen|W. Freudling|J. Zabl|J. P. U. Fynbo|P. Moller|K. K. Nilsson|H. Joy McCracken|J. Hjorth|O. Le Fevre|L. Tasca|J. S. Dunlop|D. Sobral
             */
            string Subject = m.Groups["Subject"].Value;//Instrumentation and Methods for Astrophysics (astro-ph.IM)
            string Content = m.Groups["Content"].Value;//
        }

C#用正则表达式提取这个网页http://arxiv.org/list/astro-ph/new 里面文章的编号、标题、作者、内容 保存在数组中

解决方案 »

C#用正则表达式提取这个网页http://arxiv.org/list/astro-ph/new 里面文章的编号、标题、作者、内容保存在数组中