各位前辈:预准备设计一程序, 获取网页内容 做后续分析.使用Cinternetsession::OpenURL 下载网页,得到(源文件)
如何获得 在浏览器内另存页面为(txt) 所得到的文档呢?只需要文本,不需要图片的美好其他资源
如何获得 在浏览器内另存页面为(txt) 所得到的文档呢?只需要文本,不需要图片的美好其他资源
解决方案 »
- 高手帮我解决下: Visual C++ 6.0 生成具有自动化支持的 DLL 时,没是取消注册的函数,怎么编写代码?
- 谁有用过VC9.0做过大项目的?来说下具体的思路和实现方法
- 如何在文档窗体中创建sup3DBrowser控件??
- 怎么对文本编辑框进行查找\替换?
- 请问如何将一个ActiveX 控件加入到EXE类型的一个COM中?
- 关于模板类vector的问题?(很急!)
- 有关多线程临界区的几个疑问:
- 写入注册表中的字段和值,为什么系统重启后,就没有了?
- 求助:工业相机,将在图形缓冲区的RGB数据转化为灰度矩阵数据输出
- 学c最起码的要求是什么?
- 关于CoInitialize与CoUninitialize的疑问
- 请问:谁有<<OLE2高级编程技术>>书的源代码..
直接用Win32 Api方便很多:第一步使用InternetConnect函数
第二步使用InternetOpenUrl函数
第三步使用InternetReadFile函数具体的定义到MSDN里找
wininet allows your application to act as an http, ftp, or gopher client without its having to understand or, more importantly, keep up with the ever-evolving protocol standards. if you use wininet in your applications, when standards change you can let wininet worry about the changes while your interface to the protocol remains the same. wininet can be used to write product-ordering systems, stock tickers/analyzers, online banking systems, ftp clients, your own internet browser, and so on. before wininet, adding internet communications to windows-based applications required expertise in sockets and protocol specifications. even simple communications required considerable development time. wininet lets you quickly and easily add internet communications to your applications.mfc also implemented some class which uses these apis. these classes are distributed in different hierarchies. i develop a small class for that which has only two methods. by introducing this class in project and calling one method, one can easily download the web page from given url.this class has two methods, cstring getwebpage(const cstring& url);
void seterrormessage(cstring s);
getwebpage method is used for accepting the url (it must me complete i.e., http:\\www.codeguru.com) and returning the desired page.seterrormessage method receives the default error message. when there was some error due to any reason, getwebpage method will return this message. i am working on it and in future beside default error message, a actual error message will be also transmitted./*
//------------------------------------------------------------------------------------------------------------------
// webworld.h: interface for the cwebworld class.
//------------------------------------------------------------------------------------------------------------------
*/#include "wininet.h"class cwebworld
{
public:
void seterrormessage(cstring s);
cstring getwebpage(const cstring& url);
cwebworld();
virtual ~cwebworld();private:
cstring m_errormessage;
hinternet m_session;
};/*
//------------------------------------------------------------------------------------------------------------------
// webworld.cpp: implementation of the cwebworld class.
//------------------------------------------------------------------------------------------------------------------
*/#include "stdafx.h"
#include "webthief.h"
#ifdef _debug
#undef this_file
static char this_file[]=__file__;
#define new debug_new
#endif#define agent_name "codegurubrowser1.0"//////////////////////////////////////////////////////////////////////
// construction/destruction
//////////////////////////////////////////////////////////////////////cwebworld::cwebworld()
{
dword dwerror; // initialize the win32 internet functions
m_session = ::internetopen(agent_name,
internet_open_type_preconfig, // use registry settings.
null, // proxy name. null indicates use default.
null, // list of local servers. null indicates default.
0) ; dwerror = getlasterror();
}cwebworld::~cwebworld()
{
// closing the session
::internetclosehandle(m_session);
}cstring cwebworld::getwebpage(const cstring& url)
{
hinternet hhttpfile;
char szsizebuffer[32];
dword dwlengthsizebuffer = sizeof(szsizebuffer);
dword dwfilesize;
dword dwbytesread;
bool bsuccessful;
cstring contents; // setting default error message
contents = m_errormessage;
// opening the url and getting a handle for http file
hhttpfile = internetopenurl(m_session, (const char *) url, null, 0, 0, 0); if (hhttpfile)
{
// getting the size of http files
bool bquery = ::httpqueryinfo(hhttpfile,http_query_content_length, szsizebuffer, &dwlengthsizebuffer, null) ; if(bquery==true)
{
// allocating the memory space for http file contents
dwfilesize=atol(szsizebuffer);
lpstr szcontents = contents.getbuffer(dwfilesize); // read the http file
bool bread = ::internetreadfile(hhttpfile, szcontents, dwfilesize, &dwbytesread);
if (bread)
bsuccessful = true; ::internetclosehandle(hhttpfile); // close the connection.
} }
else
{
// connection failed.
bsuccessful = false;
}
return contents;
}void cwebworld::seterrormessage(cstring s)
{
m_errormessage = s;
}
following is a use of above class. cwebworld a;
cstring pagecontent; a.seterrormessage("there is some error in getting web page ... ");
pagecontent = a.getwebpage(m_url);
生成IWebBrowser2:
IWebBrowser2* pWebBrowser = NULL;
hr = CoCreateInstance (CLSID_InternetExplorer, NULL, CLSCTX_SERVER, IID_IWebBrowser2, (LPVOID*)&pWebBrowser);
if (SUCCEEDED (hr) && (pWebBrowser != NULL))
{
m_pWebBrowser = pWebBrowser;
m_pWebBrowser->put_Visible (VARIANT_TRUE); //这里设置为VARIANT_FALSE则不可见。
return true;
}
else
{
if (pWebBrowser)
pWebBrowser->Release ();
return false;
}
加载Html:
HRESULT hr;
IDispatch* pHtmlDocDispatch = NULL;
IHTMLDocument2 * pHtmlDoc = NULL; // Retrieve the document object.
hr = pWebBrowser->get_Document (&pHtmlDocDispatch);
if (SUCCEEDED (hr) && (pHtmlDocDispatch != NULL))
{
hr = pHtmlDocDispatch->QueryInterface (IID_IHTMLDocument2, (void**)&pHtmlDoc);
if (SUCCEEDED (hr) && (pHtmlDoc != NULL))
{
IHTMLElement * pBodyElem = NULL;
hr = pHtmlDoc->get_body(&pBodyElem);
if (SUCCEEDED (hr) && (pBodyElem != NULL))
{
CString sInnerHTML = "你要加载的Body部分的HTML";
BSTR bstrInnerHTML = sInnerHTML.AllocSysString ();
pBodyElem->put_innerHTML(bstrInnerHTML);
SysFreeString (bstrInnerHTML);
pBodyElem->Release();
}
pHtmlDoc->Release();
}
pHtmlDocDispatch->Release ();
}
最后,使用ExecWB执行IE命令,比如另存为
pWebBrowser->ExecWB(OLECMDID_SAVEAS, OLECMDEXECOPT_DONTPROMPTUSER, "c:\c.htm", "c:\c.htm");
然后,读取c:\c.htm文件内容存htm格式可以这样,要存txt格式我没有试过,你自己试试吧
下面是VB中用正则提取文本,供你在VC中参考:Private Function HtmToTXT(ByVal s As String)
Set ObjRegExp = New RegExp
ObjRegExp.IgnoreCase = True '转换-正则
ObjRegExp.Global = True
ObjRegExp.Pattern = "<!--[\s\S]*?-->"
s = ObjRegExp.Replace(s, "")
ObjRegExp.Pattern = "<(style)[^<]*>[^<]*<\/\1>" '式样表
s = ObjRegExp.Replace(s, "")
'''//-------------------
ObjRegExp.Pattern = "<(select)[^<]*>[^<]*<\/\1>" 'select标签
s = ObjRegExp.Replace(s, "")
'//--------------------------------------
ObjRegExp.Pattern = "<(script)[^<]*>[\s\S]*?<\/\1>" '脚本
s = ObjRegExp.Replace(s, "")
ObjRegExp.Pattern = "<br[^>]*>" 'br
s = ObjRegExp.Replace(s, vbCrLf)
ObjRegExp.Pattern = "<(title)[^<]*>[^<]*<\/\1>" '标题
s = ObjRegExp.Replace(s, "")
ObjRegExp.Pattern = "<[^<]{1,100}>" 'all html
s = ObjRegExp.Replace(s, "")
ObjRegExp.Pattern = "&[^;]{2,4};"
s = ObjRegExp.Replace(s, " ")
ObjRegExp.Pattern = "\n[ \f\r\t\v]*"
s = ObjRegExp.Replace(s, vbCrLf)
ObjRegExp.Pattern = "[\n\x0a\x0d]+"
s = ObjRegExp.Replace(s, vbCrLf)
HtmToTXT = s
End Function
还有其它方法??
IHTMLDocument2::get_body()获得IHTMLElement指针,然后调用get_innerText就能得到一个字符串,这个字符串就是保存成TXT时的内容了。
脸红的说一声: 我在看 ExecWB 方法中参数的用法, 学习 VARIANT 结构;TO: greatws
如同你所说, 挺有趣, 可惜尚不知道怎么改造为我用;To:ruo_gu 谢谢你的热心, 可惜我一不懂 VB, 二不懂 正则表达式(听说过名字而已)To: jameshooo 你的方法我感觉是最需要的, 可惜连微软的例子还没有运行成功(安装 设置 IE5.01库文件和头文件后不成,
怀疑还需要 SDK ?)
另外, 还没有了解 "DOM" 是什么;我的目的: 在某一网页中, 有某项内容(文本)是运行脚本所得; 我需要的就是运行这段脚本,得到对应文本;
查看了源文件, 发现有用 innerHTML属性 , 查阅msdn, 才知道 innerHTML 是有关DHTML等的东西, 头大! 各位不要笑话俺, 俺只是若干年前学了一点编程(C语言), 最近看了些几下vc, 就不知天高地厚的想弄个程序用.
总是返回负值, 请问是为何? 多些!
终于搞定,
调用IHTMLDocument2::get_body()
调用IHTMLElemen::get_innertext();
多谢各位!