CInternetsession 获取网页内容 (脚本运行所获得)

很多年前还有同步的代码，现在全改成异步了，不能拷代码，就直接给你说说吧：
直接用Win32 Api方便很多：第一步使用InternetConnect函数
第二步使用InternetOpenUrl函数
第三步使用InternetReadFile函数具体的定义到MSDN里找

this article was contributed by asif rasheed. wininet is a high-level interface to the more complicated underlying internet protocols (including http, ftp, and gopher).
wininet allows your application to act as an http, ftp, or gopher client without its having to understand or, more importantly, keep up with the ever-evolving protocol standards. if you use wininet in your applications, when standards change you can let wininet worry about the changes while your interface to the protocol remains the same. wininet can be used to write product-ordering systems, stock tickers/analyzers, online banking systems, ftp clients, your own internet browser, and so on. before wininet, adding internet communications to windows-based applications required expertise in sockets and protocol specifications. even simple communications required considerable development time. wininet lets you quickly and easily add internet communications to your applications.mfc also implemented some class which uses these apis. these classes are distributed in different hierarchies. i develop a small class for that which has only two methods. by introducing this class in project and calling one method, one can easily download the web page from given url.this class has two methods,  cstring getwebpage(const cstring& url);
  void seterrormessage(cstring s);
getwebpage method is used for accepting the url (it must me complete i.e., http:\\www.codeguru.com) and returning the desired page.seterrormessage method receives the default error message. when there was some error due to any reason, getwebpage method will return this message. i am working on it and in future beside default error message, a actual error message will be also transmitted./*
//------------------------------------------------------------------------------------------------------------------
// webworld.h: interface for the cwebworld class.
//------------------------------------------------------------------------------------------------------------------
*/#include "wininet.h"class cwebworld
{
public:
  void seterrormessage(cstring s);
  cstring getwebpage(const cstring& url);
  cwebworld();
  virtual ~cwebworld();private:
  cstring m_errormessage;
  hinternet m_session;
};/*
//------------------------------------------------------------------------------------------------------------------
// webworld.cpp: implementation of the cwebworld class.
//------------------------------------------------------------------------------------------------------------------
*/#include "stdafx.h"
#include "webthief.h"
#ifdef _debug
#undef this_file
static char this_file[]=__file__;
#define new debug_new
#endif#define agent_name "codegurubrowser1.0"//////////////////////////////////////////////////////////////////////
// construction/destruction
//////////////////////////////////////////////////////////////////////cwebworld::cwebworld()
{
  dword dwerror;  // initialize the win32 internet functions
  m_session = ::internetopen(agent_name,
    internet_open_type_preconfig, // use registry settings.
    null, // proxy name. null indicates use default.
    null, // list of local servers. null indicates default.
    0) ;  dwerror = getlasterror();
}cwebworld::~cwebworld()
{
  // closing the session
  ::internetclosehandle(m_session);
}cstring cwebworld::getwebpage(const cstring& url)
{
  hinternet hhttpfile;
  char szsizebuffer[32];
  dword dwlengthsizebuffer = sizeof(szsizebuffer);
  dword dwfilesize;
  dword dwbytesread;
  bool bsuccessful;
  cstring contents;  // setting default error message
  contents = m_errormessage;

  // opening the url and getting a handle for http file
  hhttpfile = internetopenurl(m_session, (const char *) url, null, 0, 0, 0);  if (hhttpfile)
  {
    // getting the size of http files
    bool bquery = ::httpqueryinfo(hhttpfile,http_query_content_length, szsizebuffer, &dwlengthsizebuffer, null) ;    if(bquery==true)
    {
        // allocating the memory space for http file contents
        dwfilesize=atol(szsizebuffer);
        lpstr szcontents = contents.getbuffer(dwfilesize);        // read the http file
        bool bread = ::internetreadfile(hhttpfile, szcontents, dwfilesize, &dwbytesread);

        if (bread)
          bsuccessful = true;        ::internetclosehandle(hhttpfile); // close the connection.
    }  }
  else
  {
    // connection failed.
    bsuccessful = false;
  }
  return contents;
}void cwebworld::seterrormessage(cstring s)
{
  m_errormessage = s;
}
following is a use of above class.  cwebworld a;
  cstring pagecontent;  a.seterrormessage("there is some error in getting web page ... ");
  pagecontent = a.getwebpage(m_url);

To: joycheney 谢谢回复.不过你所说的方法所得到的与MFC Cinternetsession::OpenURL 结果相同.问题是, 如何得到一文件(.txt)而不是源文件, 需要与通过IE浏览器下 "另存为 .txt" 菜单操作相同?

不好意思，没有仔细读题，我没有做过类似的事情，不过我猜测，既然你想要使用IE的功能，很可能要使用IWebBrowser2接口，因为这个接口使用的是Internet Explorer的实例。你可以往这方面查一下，给一段简单的使用IWebBrowser2的介绍，
生成IWebBrowser2:
   IWebBrowser2* pWebBrowser = NULL;
   hr = CoCreateInstance (CLSID_InternetExplorer, NULL, CLSCTX_SERVER, IID_IWebBrowser2, (LPVOID*)&pWebBrowser);

   if (SUCCEEDED (hr) && (pWebBrowser != NULL))
   {
      m_pWebBrowser = pWebBrowser;
      m_pWebBrowser->put_Visible (VARIANT_TRUE); //这里设置为VARIANT_FALSE则不可见。
      return true;
   }
   else
   {
      if (pWebBrowser)
         pWebBrowser->Release ();
      return false;
   }
加载Html:
   HRESULT hr;
   IDispatch* pHtmlDocDispatch = NULL;
   IHTMLDocument2 * pHtmlDoc = NULL;   // Retrieve the document object.
   hr = pWebBrowser->get_Document (&pHtmlDocDispatch);
   if (SUCCEEDED (hr) && (pHtmlDocDispatch != NULL))
   {
      hr = pHtmlDocDispatch->QueryInterface (IID_IHTMLDocument2,  (void**)&pHtmlDoc);
      if (SUCCEEDED (hr) && (pHtmlDoc != NULL))
      {
         IHTMLElement * pBodyElem = NULL;

         hr = pHtmlDoc->get_body(&pBodyElem);
         if (SUCCEEDED (hr) && (pBodyElem != NULL))
         {
    CString sInnerHTML = "你要加载的Body部分的HTML";
            BSTR bstrInnerHTML = sInnerHTML.AllocSysString ();
            pBodyElem->put_innerHTML(bstrInnerHTML);
            SysFreeString (bstrInnerHTML);
            pBodyElem->Release();
         }
         pHtmlDoc->Release();
      }
      pHtmlDocDispatch->Release ();
   }
最后，使用ExecWB执行IE命令，比如另存为
   pWebBrowser->ExecWB(OLECMDID_SAVEAS, OLECMDEXECOPT_DONTPROMPTUSER, "c:\c.htm", "c:\c.htm");
然后，读取c:\c.htm文件内容存htm格式可以这样，要存txt格式我没有试过，你自己试试吧

可以参考下我回的这贴，感觉和你的意思差不多，得到数据然后再用CFile保存下就可以了http://topic.csdn.net/u/20080214/20/0c9541e3-fd1a-488f-9bcd-c3d05fbb51dd.html

处理你说的文本文件，用正则表达式挺方便的。在VC中引用VBSCRIPT从而使用正则，网上有介绍。
下面是VB中用正则提取文本，供你在VC中参考：Private Function HtmToTXT(ByVal s As String)
    Set ObjRegExp = New RegExp
    ObjRegExp.IgnoreCase = True '转换－正则
    ObjRegExp.Global = True
    ObjRegExp.Pattern = ""
     s = ObjRegExp.Replace(s, "")
    ObjRegExp.Pattern = "<(style)[^<]*>[^<]*<\/\1>" '式样表
    s = ObjRegExp.Replace(s, "")
    '''//-------------------
    ObjRegExp.Pattern = "<(select)[^<]*>[^<]*<\/\1>" 'select标签
    s = ObjRegExp.Replace(s, "")
    '//--------------------------------------
    ObjRegExp.Pattern = "<(script)[^<]*>[\s\S]*?<\/\1>" '脚本
    s = ObjRegExp.Replace(s, "")
    ObjRegExp.Pattern = "<br[^>]*>" 'br
    s = ObjRegExp.Replace(s, vbCrLf)
    ObjRegExp.Pattern = "<(title)[^<]*>[^<]*<\/\1>" '标题
    s = ObjRegExp.Replace(s, "")
    ObjRegExp.Pattern = "<[^<]{1,100}>" 'all html
    s = ObjRegExp.Replace(s, "")
    ObjRegExp.Pattern = "&[^;]{2,4};"
    s = ObjRegExp.Replace(s, " ")

    ObjRegExp.Pattern = "\n[ \f\r\t\v]*"
    s = ObjRegExp.Replace(s, vbCrLf)
    ObjRegExp.Pattern = "[\n\x0a\x0d]+"
    s = ObjRegExp.Replace(s, vbCrLf)
    HtmToTXT = s
End Function

有些网页使用IWebBrowser2接口是取不了里面的内容！
还有其它方法？？

从微软网站下载WalkAll例子代码来看看，这是一个没有UI的HTML文档解析器，能下载并解析成DOM，提供一个IHTMLDocument2接口指针，有了这个指针，你就可以调用
IHTMLDocument2::get_body()获得IHTMLElement指针，然后调用get_innerText就能得到一个字符串，这个字符串就是保存成TXT时的内容了。

仔细拜读了各位的回复, 无奈根基太浅,需要很长时间来消化:To: joycheney

    脸红的说一声: 我在看 ExecWB 方法中参数的用法, 学习 VARIANT 结构;TO: greatws

    如同你所说, 挺有趣, 可惜尚不知道怎么改造为我用;To:ruo_gu     谢谢你的热心, 可惜我一不懂 VB, 二不懂正则表达式(听说过名字而已)To: jameshooo      你的方法我感觉是最需要的, 可惜连微软的例子还没有运行成功(安装设置 IE5.01库文件和头文件后不成,
     怀疑还需要 SDK ?)
     另外, 还没有了解 "DOM" 是什么;我的目的:     在某一网页中, 有某项内容(文本)是运行脚本所得; 我需要的就是运行这段脚本,得到对应文本;
     查看了源文件, 发现有用 innerHTML属性 , 查阅msdn, 才知道 innerHTML 是有关DHTML等的东西, 头大! 各位不要笑话俺, 俺只是若干年前学了一点编程(C语言), 最近看了些几下vc, 就不知天高地厚的想弄个程序用.

To: joycheney 我使用了你发的代码, 在生成IWebBrowser2 接口后, 调用 navigate2 函数;不过在加载 html , 运行 hr = pWebBrowser-> get_Document (&pHtmlDocDispatch);
总是返回负值, 请问是为何? 多些!

终于搞定,
调用IHTMLDocument2::get_body()
调用IHTMLElemen::get_innertext();
多谢各位!

调试易

CInternetsession 获取网页内容 (脚本运行所获得)

解决方案 »