利用itext对pdf内容进行检索，在指定位置找出关键词的问题

请教各位，如何用itext对pdf内容进行检索或者来说就是读取。并且在可以在指定的位置找到关键字比如：
实现在一个pdf文件中的第一页找到“金融”两个字。itext能实现这种功能吗？如果可以请各位指教，最好给出例子itext在网上写pdf文件的例子很多。可是没有读取的。谢谢

解决方案 »

免费领取超大流量手机卡，每月29元包185G流量+100分钟通话, 中国电信官方发货

以前写的代码，把pdf文件第一页和倒数第一、二页的特殊字符换成pdf的总页数import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;import com.lowagie.text.Document;
import com.lowagie.text.DocumentException;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfReader;
import com.lowagie.text.pdf.PdfStamper;
import com.lowagie.text.pdf.PdfWriter;public class ggg
{
    static String PAGESTRING = "#$%@*";    public void creatpdf(String filepath,int num)
    {
        if(num<1999){
            num=1999;
        }
        // 创建一个Document对象
        Document document = new Document();
        try
        {
            PdfWriter.getInstance(document, new FileOutputStream(filepath));            // 添加PDF文档的一些信息
            document.addTitle("Hello World example");
            document.addAuthor("Bruno Lowagie");
            document.addSubject("This example explains how to add metadata.");
            document.addKeywords("iText, Hello World, step 3, metadata");
            document.addCreator("My program using iText");
            // 打开文档，将要写入内容
            document.open();
            for (int i = 0; i < num; i++)
            {
                if (i == 2 || i == num-2)
                {
                    Paragraph hhh = new Paragraph(PAGESTRING);
                    document.add(hhh);
                }
                Paragraph hhh = new Paragraph("Hello World!===== " + i);
                document.add(hhh);
            }
        }
        catch (DocumentException de)
        {
            System.err.println(de.getMessage());
        }
        catch (IOException ioe)
        {
            System.err.println(ioe.getMessage());
        }        // 关闭打开的文档
        document.close();
    }    public void editpdf(String sourFilePath, String destFilePath) throws IOException
    {
        PdfReader reader = new PdfReader(sourFilePath);
        try
        {
            int p = reader.getNumberOfPages();
            String s = new String(reader.getPageContent(1));
            String ss = "";
            String pageNum = String.valueOf(p);
            if (pageNum.length() < PAGESTRING.length())
            {
                pageNum = (pageNum + "      ").substring(0, PAGESTRING.length());
            }
            if (s.indexOf(PAGESTRING) != -1)
            {
                ss = s.substring(0, s.indexOf(PAGESTRING)) + pageNum
                        + s.substring(s.indexOf(PAGESTRING) + PAGESTRING.length());
                reader.setPageContent(1, ss.getBytes());
            }
            s = new String(reader.getPageContent(p - 1));
            if (s.indexOf(PAGESTRING) != -1)
            {
                ss = s.substring(0, s.indexOf(PAGESTRING)) + pageNum
                        + s.substring(s.indexOf(PAGESTRING) + PAGESTRING.length());
                reader.setPageContent(p - 1, ss.getBytes());
            }
            s = new String(reader.getPageContent(p));
            if (s.indexOf(PAGESTRING) != -1)
            {
                ss = s.substring(0, s.indexOf(PAGESTRING)) + pageNum
                        + s.substring(s.indexOf(PAGESTRING) + PAGESTRING.length());
                reader.setPageContent(p, ss.getBytes());
            }
            PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(destFilePath));
            stamper.close();
        }
        catch (DocumentException de)
        {
            System.err.println(de.getMessage());
        }
        catch (IOException ioe)
        {
            System.err.println(ioe.getMessage());
        }
    }    private void renameAndDelPdf(String soureFilePath, String destFilename)
    {
        File f = new File(soureFilePath);
        File df = new File(destFilename);
        if (f.exists() && f.isFile() && df.exists() && df.isFile())
        {
            if (f.delete())
            {
                if (!df.renameTo(new File(soureFilePath)))
                {
                    System.err.println("file rename error");
                }
            }
            else
            {
                System.err.println("file delete error");
            }
        }
        else
        {
            System.err.println("file no exit or incorrect");
        }    }    public static void main(String[] args) throws IOException, DocumentException
    {
        ggg m = new ggg();        long t=System.currentTimeMillis();
        m.creatpdf("HelloWorld-old.pdf",200000);
        long n=System.currentTimeMillis();
        System.out.println(n-t);        t=System.currentTimeMillis();
        m.editpdf("HelloWorld-old.pdf", "HelloWorld-new.pdf");
        n=System.currentTimeMillis();
        System.out.println(n-t);        t=System.currentTimeMillis();
        m.renameAndDelPdf("HelloWorld-old.pdf", "HelloWorld-new.pdf");
        n=System.currentTimeMillis();
        System.out.println(n-t);
    }
}
*************************************************************************** 思想决定行动，交流产生力量。
程序员在深圳QQ群大集专业分类:
程序员在深圳JAVA群4247660
程序员在深圳c++群15195967
程序员在深圳.NET群Ⅱ:12203296
程序员在深圳TCP/IP协议栈开发:16956462
程序员在深圳JS & AJAX群:12578377
程序员在深圳英语学习群:23864353
深序员在深圳VB:11055959
程序员在深圳c++Ⅱ17409451
程序员在深圳c++群15195967
程序员在深圳嵌入式开发群37489763
程序员在深圳移动开发群31501597
程序员在深圳创业群33653422 不限专业分类:
高级群:17538442
第三群:2650485
第二群:7120862
第五群:29537639
第四群:28702746
第六群:10590618
第七群:10543585
第八群:12006492
第九群:19063074
第十群:2883885
第十一群:25460595
第十二群:9663807 深圳程序员QQ群联盟成立两年多，拥有三十个以上的QQ群,人数达两千多人,有30%以上的成员的经验丰富的老手,包括国内外顶级大公司的成员（如微软、IBM,SUN，华为）、国内著名高校和研究院成员，和有丰富实践经验的高级程序(包括参加过上亿元的项目的架构师),有很热爱技术的成员(包括自己写过嵌入式操作系统),还有少数女程序员。现推介如下QQ群,如有兴趣速速加入:深程高级群:17538442（此群不欢迎新手，已经在深圳工作的，月薪6K以下的不欢迎）c++:15195967 .NET:12203296 mobile:31501597嵌入式:37489763 JAVA:4247660
——————————————————————————————————————————
希望大家不要认为群能给你送来什么，这只是一个平台,让同等水平的程序员有个交流的机会或许能得到一点信息或许能带来一点启发。
有人说常聊QQ的人肯定技术不怎么样，但其实很多技术高朋友不需要做一些简单的重复劳动所以还是有时间聊天的。 *****************************************************************************
不好意思。刚回来
2楼的方法我先试下。有效就给分但是有个问题。
这个方法对中文不好使吧。是不是也要加上itext的远东中文包？
如果是的话对于读取中文要怎么应用这个中文包呢？
用了二楼的方法。
getPageContent(1)
可是读出来的都是些莫名其妙的英文。和我最早的时候试的结果是一样的内容截取一段
BT
/F2 1 Tf
23.998 0 0 23.998 230.9807 479.0661 Tm
0 0 0 rg
/GS1 gs
-0.0001 Tc
0 Tw
[(Console)-251.1(G)2.9(uide）]Tj
.......后面省略了。反正乱七八糟的看不懂有没人解释一下这是什么意思啊。
测试了蛮久
getPageContent方法获得的pdf文件内容
只支持英文，并且符合一定规则吧。
在()里面的是pdf里的英文字符，但是中文的不能读出请问还有人知道怎么实现吗？需要能读出中文，就像我1楼写的那个例子，在有某个含有“金融”这个关键字的pdf文档里能读出“金融”两个字谢谢大家
但是中文的不能读出
=================
不知道getBytes能不能起到作用