爬虫-requests模块

发表于 2018-08-08 分类于教程

本文字数： 3.5k

Python内置的urllib模块，用于访问网络资源。但是，它用起来比较麻烦，而且，缺少很多实用的高级功能。
它是一个Python第三方库，处理URL资源特别方便。

简单实用

res = get(url, params=params, headers=headers)

1 2	请求：会自动对params进行编码,并和前面url进行拼接

参数：
url： 请求地址
params：请求参数
headers：请求头部
timeout: 超时时间
proxies：代理参数【见附录1】

响应对象res的属性
1、encoding ：指定响应编码, res.encoding = "utf-8"
2、text     ：字符串
3、content  ：字节流
4、status_code ：HTTP响应码
5、url         ：返回实际数据的URL地址

非结构化数据保存

1
2
3

html = res.content
with open("xxx","wb") as f:
    f.write(html)

post(url, data=data, headers=headers)

1	data ：Form表单数据,字典,不用编码,不用转码

进阶指南

requests的方便之处还在于，对于特定类型的响应，例如JSON，可以直接获取：
1
2
r = requests.get(*******)
r.json()
requests默认使用application/x-www-form-urlencoded对POST数据编码。如果要传递JSON数据，可以直接传入json参数：
1
2
params = {'key': 'value'}
r = requests.post(url, json=params) # 内部自动序列化为JSON
类似的，上传文件需要更复杂的编码格式，但是requests把它简化成files参数：
1
2
upload_files = {'file': open('report.xls', 'rb')}
r = requests.post(url, files=upload_files)
在读取文件时，注意务必使用’rb’即二进制模式读取，这样获取的bytes长度才是文件的长度。
除了能轻松获取响应内容外，requests对获取HTTP响应的其他信息也非常简单。例如，获取响应头：
1
2
>>> r.headers
{Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Content-Encoding': 'gzip', ...}

有道翻译案例

import requests
import json

# 得到data并把它转为bytes
key = input("请输入要翻译的内容:")
# 1. 把Form表单数据定义为字典,F12->Form Data
data = {
        "i":key,
        "from":"AUTO",
        "to":"AUTO",
        "smartresult":"dict",
        "client":"fanyideskweb",
        "salt":"15437179202229",
        "sign":"b5fee8d2268e22191d3e03ea884d5666",
        "doctype":"json",
        "version":"2.1",
        "keyfrom":"fanyi.web",
        "action":"FY_BY_REALTIME",
        "typoResult":"false"
    }

# 发请求获响应
# url为抓包抓到的POST的地址,去掉translate_o中的 _o
url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"
headers = {"User-Agent":"Mozilla/5.0"}
# 用requests模块的post方法,data参数为Form表单数据,必须为字典
res = requests.post(url,data=data,headers=headers)
res.encoding = "utf-8"
html = res.text


# 把json格式的字符串转为Python中的字典
r_dict = json.loads(html)
r = r_dict["translateResult"][0][0]["tgt"]
print(r)

附录一

查询本机公网IP

百度搜索IP
请求地址：http://httpbin.org/get

代理参数：proxies –> 字典

获取代理IP的网站

普通代理：字典
proxies = {“协议”:”协议://IP地址:端口号”}

1 2	proxies = {"http":"http://183.129.207.82:11328"} res = requests.get(url,proxies=proies,headers..)

私密代理：字典
proxies = {“协议”:”协议://用户名:密码@IP:端口”}
1
proxies = {"http":"http://309435365:szayclhp@116.255.162.107:16816" }

示例

普通代理

import requests
import random

url = "http://httpbin.org/get"
#url = "http://www.baidu.com/"
headers = {"User-Agent":"Mozilla/5.0"}

# IP代理池
proxyList = [
        {"":""},
        {"":""},
        {"":""},
    ]
proxies = random.choice(proxyList)

res = requests.get(url,proxies=proxies,
                   headers=headers)
res.encoding = "utf-8"
html = res.text
print(html)

私密代理

import requests

#url = "http://httpbin.org/get"
url = "http://www.baidu.com/"
headers = {"User-Agent":"Mozilla/5.0"}
proxies = {"http":"http://309435365:szayclhp@116.255.162.107:16816"}

res = requests.get(url,proxies=proxies,
                   headers=headers)
res.encoding = "utf-8"
print(res.text)