爬虫-BeautifulSoup

发表于 2018-08-08 分类于教程

本文字数： 924

一个灵活又方便的网页解析库，处理高效，支持多种解析器。
利用它就不用编写正则表达式也能方便的实现网页信息的抓取

概念

HTML或XML的解析器，依赖于lxml

解析速度对比
正则 > xpath > beautifulsoup4

1	python -m pip install beautifulsoup4

或

1	conda install beautifulsoup4

1	from bs4 import BeautifulSoup

1
2
3

# html是需要解析的文档
# lxml是指定的解析库
soup = BeautifulSoup(html, 'lxml')

1 2	rObj = soup.find('div',attrs={"id":"aaa"}) rList = soup.find_all('div',attrs={})

soup.find_all()

1 2	r_list=soup.find_all(属性名="属性值") r_list=soup.find_all(class_="test")

如果属性名是class；由于class是Python的关键字，所以我们要写class_

1
2
3

r_list=soup.find_all("节点名", attrs={"名":"值"})

r_list=soup.find_all("div", attrs={"class":"test"})