urllib在Python2.x中内置的库是urllib和urllib2,在Python3.x中合并为urllib库。
urllib是系统内置库,提供了一系列用于操作URL的功能。
爬虫请求模块
urllib提供的功能就是利用程序去执行各种HTTP请求。如果要模拟浏览器完成特定功能,需要把请求伪装成浏览器。伪装的方法是先监控浏览器发出的请求,再根据浏览器的请求头来伪装,User-Agent头就是用来标识浏览器的。
版本
- python2 :urllib 、urllib2
- python3 :urllib 、requests(下一篇会介绍)
常用方法
urllib.request模块
可以非常方便地抓取URL内容,也就是发送一个GET请求到指定的页面,然后返回HTTP的响应:
urllib.request.urlopen(“URL”)
- 作用 :向网站发起1个请求并获取响应
字节流 = res.read()
字符串 = res.read().decode(“utf-8”)
encode() : 字符串 转为 字节流
decode() : 字节流 转为 字符串
- 重构User-Agent
- urlopen()不支持重构User-Agent
- 支持重构User-Agent
1
| urllib.request.Request(url,headers={"User-Agent":""})
|
urllib.request.Request(url,headers=字典)
User-Agent是爬虫和反爬虫斗争的第1步,发送请求必须带User-Agent
使用流程
1 2 3
| req = 创建请求对象(Request(url,headers=...)) res = 获取响应对象(urlopen(req) html = 获取响应内容res.read().decode("utf-8")
|
响应对象res的方法
1 2 3 4 5 6
| 1. res.read() :读取响应内容 2. res.getcode() :获取HTTP响应码 200 :成功 4XX :服务器页面出错 5XX :服务器出错 3. geturl() :返回实际数据的URL(防止重定向问题)
|
urllib.request.Request(url,data=data,headers=字典)
1 2 3 4 5 6 7 8 9 10 11 12
| data : Form表单数据以bytes类型提交,不能是str
1、把Form表单数据定义为字典:data 2、urlencode(data) 3、encode("utf-8")转为bytes
```
### urllib.parse模块 #### 1. urllib.parse.urlencode({字典}) 例如: ``` py wd = {"wd":"摩擦科技"} s = urllib.parse.urlencode(wd) s的值 :"wd=%E8%BE%BE%E5%86%85%E....."
|
2. urllib.parse.quote(“字符串”)
例如:
1 2 3
| s1 = "摩擦科技" s2 = urllib.parse.quote(s1) s2的值 :"%e8%e3%d5...."
|
写入本地
1 2
| with open("文件名.txt","w",encoding="gb18030") as f: f.write(字符串)
|
3. urllib.parse.unquote(“%e8%d3%f8…”)
练习
1 : 百度贴吧数据抓取
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
| import urllib.request import urllib.parse import random import time
h_list = [ {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60"}, {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0"}, {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2"}, {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71"} ]
baseurl = "http://tieba.baidu.com/f?" name = input("请输入贴吧名称:") begin = int(input("请输入起始页:")) end = int(input("请输入终止页:"))
kw = {"kw":name} kw = urllib.parse.urlencode(kw)
for page in range(begin,end+1): pn = (page-1)*50 url = baseurl + kw + "&pn=" + str(pn) headers = random.choice(h_list) req = urllib.request.Request(url,headers=headers) res = urllib.request.urlopen(req) html = res.read().decode("utf-8") time.sleep(0.5) filename = "第"+str(page)+"页.html" with open(filename,"w",encoding="gb18030") as f: f.write(html) print("第%d页保存成功" % page)
|
2 : GET方式爬取百度贴吧
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
| import urllib.request import urllib.parse
class BaiduSpider: def __init__(self): self.headers = {"User-Agent":"Mozilla/5.0"} self.baseurl = "http://tieba.baidu.com/f?" def getPage(self,url): req = urllib.request.Request(url,headers=self.headers) res = urllib.request.urlopen(req) html = res.read().decode("utf-8") return html def writePage(self,filename,html): with open(filename,"w",encoding="utf-8") as f: f.write(html) def workOn(self): name = input("请输入贴吧名称:") begin = int(input("请输入起始页:")) end = int(input("请输入终止页:")) kw = {"kw":name} kw = urllib.parse.urlencode(kw) for page in range(begin,end+1): pn = (page-1)*50 url = self.baseurl + kw + "&pn=" + str(pn) html = self.getPage(url) filename = "第"+str(page)+"页.html" self.writePage(filename,html) print("第%d页爬取成功" % page)
if __name__ == "__main__": spider = BaiduSpider() spider.workOn()
|
3 : POST方式有道词典
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
| import urllib.request import json
key = input("请输入要翻译的内容:")
data = { "i":key, "from":"AUTO", "to":"AUTO", "smartresult":"dict", "client":"fanyideskweb", "salt":"1543043553775", "sign":"802536ba0b13500261edf93830d299a0", "doctype":"json", "version":"2.1", "keyfrom":"fanyi.web", "action":"FY_BY_REALTIME", "typoResult":"false" }
data = urllib.parse.urlencode(data)
data = data.encode("utf-8")
url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule" headers = {"User-Agent":"Mozilla/5.0"} req = urllib.request.Request(url,data=data,headers=headers) res = urllib.request.urlopen(req) html = res.read().decode("utf-8")
r_dict = json.loads(html) r = r_dict["translateResult"][0][0]["tgt"] print(r)
|