Python urllib2笔记(爬虫)
来自: http://my.oschina.net/v5871314/blog/612742
0、简单例子
利用Python的urllib2库,可以很方便的完成网页抓取功能,下列代码抓取百度主页并打印。
# -*- coding: utf-8 -*- import urllib import urllib2 response = urllib2.urlopen("http://www.baidu.com") print response.read()
代码分析
先来看看urllib2.urlopen()函数的原型。
urllib2.
urlopen
(url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]])
Open the URL url, which can be either a string or a Request
object.
i)timeout参数用于设置超时时间(以秒为单位)
ii)data参数用于即为待提交的参数,需要用urllib.urlencode()函数进行编码
iii)url参数即为请求的url字符串或者Request对象
1、提交数据
A)POST请求
# -*- coding: utf-8 -*- import urllib import urllib2 url = 'http://httpbin.org/post' post_data = {'key1':'value1', 'key2':'value2'} formal_post_data = urllib.urlencode(post_data) response = urllib2.urlopen("http://httpbin.org/post",formal_post_data) print response.read()
运行结果:
B)GET请求(get图)
# -*- coding: utf-8 -*- import urllib import urllib2 get_data = {'key1':'value1', 'key2':'value2'} formal_get_data = urllib.urlencode(get_data) url = 'http://httpbin.org/get' + '?' + formal_get_data response = urllib2.urlopen(url) print response.read()
运行结果:
2、Request对象
注意 urllib2.
urlopen
()函数的第一个参数也可以是Request对象,Request对象的引入将更加方便的封装数据
原型 urllib2.
Request
(url[, data][, headers][, origin_req_host][, unverifiable])
# -*- coding: utf-8 -*- import urllib import urllib2 url = 'http://httpbin.org/post' post_data = {'key1':'value1', 'key2':'value2'} formal_post_data = urllib.urlencode(post_data) #set headers headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} request = urllib2.Request(url, formal_post_data, headers) response = urllib2.urlopen(request) #supposed it is encoded in utf-8 content = response.read().decode('utf-8') print content
运行结果:
Request的有关函数
# -*- coding: utf-8 -*- import urllib import urllib2 url = 'http://httpbin.org/post' post_data = {'key1':'value1', 'key2':'value2'} formal_post_data = urllib.urlencode(post_data) #set headers headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} request = urllib2.Request(url, formal_post_data, headers) print u'返回请求的方法post/GET' method = request.get_method() print 'get_method===>' + method print u'返回提交的数据' data = request.get_data() print 'request.get_data()===>',data print u'返回参数中的url' full_url = request.get_full_url() print 'request.get_full_url()===>',full_url print u'返回请求的schema' request_type = request.get_type() print 'request.get_type()===>',request_type print u'返回请求的主机' host = request.get_host() print 'request.get_host()===>',host print u'返回选择器 - URL 中发送到服务器中的部分' selector = request.get_selector() print 'request.get_selector()===>',selector print u'返回选择器请求头部' header_items = request.header_items() print 'request.header_items()===>',header_items ##get_header(header_name, default=None) 获得指定的header ## Request.add_header(key, val)可添加头部 ## Request.has_header(header) 检查是否实例拥有参数中的头 ## Request.has_data() 检查是否含有POST数据
运行结果:
2、Response对象
urllib2.
urlopen
()函数返回的response对象有以下方法
geturl() — 返回所获取资源的URL, 通常用于决定是否跟着一个重定向
info() — 返回页面的元信息,例如头部信息,信息以 mimetools.表单的形式显现。
getcode() — 返回响应的HTTP状态码.
# -*- coding: utf-8 -*- import urllib import urllib2 url = 'http://httpbin.org/post' post_data = {'key1':'value1', 'key2':'value2'} formal_post_data = urllib.urlencode(post_data) #set headers headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} request = urllib2.Request(url, formal_post_data, headers) response = urllib2.urlopen(request) print u'获得真实url(重定向后的url)' print response.geturl() print u'获得返回状态码' print response.code print u'页面的元信息' print response.info()
运行结果:
3、常用代码
# -*- coding: utf-8 -*- import urllib import urllib2 url = 'http://httpbin.org/post' post_data = {'key1':'value1', 'key2':'value2'} formal_post_data = urllib.urlencode(post_data) #set headers headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} request = urllib2.Request(url, formal_post_data, headers) response = urllib2.urlopen(request) #supposed it is encoded in utf-8 content = response.read().decode('utf-8') print content