本帖最后由 sosoben 于 2013-08-05 09:22:09 编辑

解决方案 »

  1.   

    string tempStr = File.ReadAllText(@"C:\Users\myx\Desktop\Test.txt", Encoding.GetEncoding("GB2312"));//读取txt
                string pattern = @"<span[^>]*?class=(['""]?)ocrx_word\1[^>]*?title=(['""]?)([^'""]+)\2[^>]*?>";            string result = Regex.Match(tempStr, pattern).Groups[3].Value;//bbox 28 27 76 52
      

  2.   

    用这个http://download.csdn.net/detail/zhuankeshumo/5865413
      

  3.   


    你好 谢谢你的回答。已经基本能拿到bbox 28 27 76 52 这个数据,不过我看正则(我不太懂) 不过好像不是查找那个200,
    而是查找第一个ocrx_word 就是我一旦200不是最前面就会 错了请问如何修改呢?
      

  4.   

    string tempStr = File.ReadAllText(@"C:\Users\myx\Desktop\Test.txt", Encoding.GetEncoding("GB2312"));//读取txt
                string pattern = @"<span[^>]*?class=(['""]?)ocrx_word\1[^>]*?title=(['""]?)([^'""]+)\2[^>]*?>200</span>";
      

  5.   


    string s = @"
        <?xml version=""1.0"" encoding=""UTF-8""?>
        <!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"">
        <html xmlns=""http://www.w3.org/1999/xhtml"" xml:lang=""en"" lang=""en"">
            <head>
                <title></title>
                <meta http-equiv=""Content-Type"" content=""text/html; charset=utf-8"" />
                <meta name='ocr-system' content='tesseract 3.02' />
                <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
            </head>
            <body>
                <div class='ocr_page' id='page_1' title='image ""tamron.new.exp4.TIF""; bbox 0 0 547 70; ppageno 0'>
                    <div class='ocr_carea' id='block_1_1' title=""bbox 28 27 509 54"">
                        <p class='ocr_par' dir='ltr' id='par_1' title=""bbox 28 27 509 54"">
                        <span class='ocr_line' id='line_1' title=""bbox 28 27 509 54"">
                            <span class='ocrx_word' id='word_1' title=""bbox 28 27 76 52"">200</span> 
                            <span class='ocrx_word' id='word_2' title=""bbox 108 27 200 52"">135100</span> 
                            <span class='ocrx_word' id='word_3' title=""bbox 228 27 260 52"">70</span>
                        </span>
                        </p>
                    </div>
                </div>
            </body>
        </html>
    ";Regex regex = new Regex(@"\<span(?:\s+(?<kv>\w+=(?:'[^']+'|""[^""]+"")))+\>200\</span\>");
    Match match = regex.Match(s);
    if (match.Success)
    {
        foreach (Capture capture in match.Groups["kv"].Captures)
        {
            if (capture.Value.IndexOf("title") >= 0)
            {
                Console.WriteLine(capture.Value.Substring(capture.Value.IndexOf("=") + 1));
            }
        }
    }
    Console.ReadKey();
      

  6.   


    你好,我的200 前面有时侯会出现个<em> 有时侯又不会,请问怎么修改才能比较完美呢?