MySpider(C#)中的一个采集类的片段:下载获取网页
明晨网络MingchenNet.com旗下站点搜遍青岛www.qdseek.com是一个地区工作垂直搜索引擎,为了抓取信息,必须使用采集器——或者说网络蜘蛛。由于是持续抓取,所以应当使用WinForm。目前网络上有这样一些通用采集器,做得最好的应该就是火车头了,火车头使用C#开发。但是对于搜遍青岛www.qdseek.com来说,完全无法保证数据抽取的精准性。所以明晨网络自主开发了MySpider采集蜘蛛。最初MySpider使用VB6.0编写,后来发展为VB.NET 2005版本,现在发展到了C# 2008。这里给出一段代码,演示如何利用.NET抓取网络文本文件:
-
using System;
-
using System.Collections.Generic;
-
using System.Text;
-
using System.Net;
-
using System.IO;
-
using System.Text.RegularExpressions;
-
-
namespace MySpider
-
{
-
class Spider
-
{
-
#region GetHttpText获取网络文本代码
-
/// <summary>
-
/// GetHttpText获取网络文本代码
-
/// </summary>
-
/// <param name="Url">网址</param>
-
/// <param name="Method">提交方式,GET或者POST</param>
-
/// <param name="PostData">提交数据</param>
-
/// <param name="Charset">编码模式</param>
-
/// <param name="UserAgent">客户端标识</param>
-
/// <param name="Referer">来路</param>
-
/// <param name="cookie">COOKIE</param>
-
/// <returns>网络文本文件源代码</returns>
-
public string GetHttpText(
-
string Url,
-
string Method,
-
string PostData,
-
string Charset,
-
string UserAgent,
-
string Referer,
-
CookieContainer cookie)
-
-
-
{
-
-
string sRet="";
-
try
-
{
-
-
if (Url == "" || Url == null) { return ""; }
-
//默认编码为GB2312
-
if (Charset == "" || Charset == null) { Charset = "GB2312"; }
-
//默认客户端标示
-
if (UserAgent == "" || UserAgent == null) { UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"; }
-
//默认来路
-
if (Referer == "" || Referer == null) { Referer = Url; }
-
//提交方式转换为大写GET或者POST
-
Method = Method.ToUpper();
-
//定义接收对象
-
HttpWebResponse response;
-
//创建http请求对象
-
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(Url);
-
//设置cookie
-
request.CookieContainer = cookie;
-
//设置提交方式
-
request.Method = Method;
-
//设置客户端标示
-
request.UserAgent = UserAgent;
-
//设置来路
-
request.Referer = Referer;
-
//如果用POST方式提交
-
if (Method == "POST")
-
{
-
byte[] postBytes = Encoding.UTF8.GetBytes(PostData);
-
//设置提交内容类型
-
request.ContentType = "application/x-www-form-urlencoded";
-
//设置提交内容长度
-
request.ContentLength = postBytes.Length;
-
//设置提交内容
-
Stream reqStream = request.GetRequestStream();
-
reqStream.Write(postBytes, 0, postBytes.Length);
-
reqStream.Close();
-
-
}
-
response = (HttpWebResponse)request.GetResponse();
-
Stream stream = response.GetResponseStream();
-
StreamReader reader = new StreamReader(stream, Encoding.GetEncoding(Charset));
-
sRet = reader.ReadToEnd();
-
stream.Close();
-
reader.Close();
-
response.Close();
-
return sRet;
-
}
-
catch { return ""; }
-
}
-
#endregion
-
-
#region GetHttpText获取网络文本代码,重载无Cookie
-
/// <summary>
-
/// GetHttpText获取网络文本代码,重载无Cookie
-
/// </summary>
-
/// <param name="Url">网址</param>
-
/// <param name="Method">提交方式,GET或者POST</param>
-
/// <param name="PostData">提交数据</param>
-
/// <param name="Charset">编码模式</param>
-
/// <param name="UserAgent">客户端标识</param>
-
/// <param name="Referer">来路</param>
-
/// <returns>网络文本文件源代码</returns>
-
public string GetHttpText(
-
string Url,
-
string Method,
-
string PostData,
-
string Charset,
-
string UserAgent,
-
string Referer)
-
{
-
string sRet="";
-
sRet = GetHttpText(Url, Method, PostData, Charset, UserAgent, Referer, null);
-
return sRet;
-
}
-
#endregion
-
}
-
-
}
文章源自:明晨网络,明晨网络原创,《MySpider(C#)中的一个采集类的片段:下载获取网页》,http://www.mingchennet.com/tec/code/dotnet/5.htm