亦庄网站建设公司,网站ico添加,中国关于生态文明建设的网站,企业网站建设的请示1 Beautiful说明
BeautifulSoup库是灵活又方便的网页解析库#xff0c;处理高效#xff0c;支持多种解析器。利用它不用编写正则表达式即可方便地实线网页信息的提取。
安装
pip3 install beautifulsoup4解析库
解析器使用方法优势劣势Python标准库BeautifulSoup(markup,…1 Beautiful说明
BeautifulSoup库是灵活又方便的网页解析库处理高效支持多种解析器。利用它不用编写正则表达式即可方便地实线网页信息的提取。
安装
pip3 install beautifulsoup4解析库
解析器使用方法优势劣势Python标准库BeautifulSoup(markup, “html.parser”)Python的内置标准库、执行速度适中 、文档容错能力强Python 2.7.3 or 3.2.2)前的版本中文容错能力差lxml HTML 解析器BeautifulSoup(markup, “lxml”)速度快、文档容错能力强需要安装C语言库lxml XML 解析器BeautifulSoup(markup, “xml”)速度快、唯一支持XML的解析器需要安装C语言库html5libBeautifulSoup(markup, “html5lib”)最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档速度慢、不依赖外部扩展
2 基本使用
html
htmlheadtitleThe Dormouses story/title/head
body
p classtitle namedromousebThe Dormouses story/b/p
p classstoryOnce upon a time there were three little sisters; and their names were
a hrefhttp://example.com/elsie classsister idlink1!-- Elsie --/a,
a hrefhttp://example.com/lacie classsister idlink2Lacie/a and
a hrefhttp://example.com/tillie classsister idlink3Tillie/a;
and they lived at the bottom of a well./p
p classstory.../pfrom bs4 import BeautifulSoupsoup BeautifulSoup(html, lxml) # 传入解析器lxml
print(soup.prettify()) # 格式化代码自动补全
print(soup.title.string) # 得到title标签里的内容报错
3 标签选择器
选择元素
html
htmlheadtitleThe Dormouses story/title/head
body
p classtitle namedromousebThe Dormouses story/b/p
p classstoryOnce upon a time there were three little sisters; and their names were
a hrefhttp://example.com/elsie classsister idlink1!-- Elsie --/a,
a hrefhttp://example.com/lacie classsister idlink2Lacie/a and
a hrefhttp://example.com/tillie classsister idlink3Tillie/a;
and they lived at the bottom of a well./p
p classstory.../pfrom bs4 import BeautifulSoupsoup BeautifulSoup(html, lxml) # 传入解析器lxml
print(soup.title) # 选择了title标签
print(type(soup.title)) # 查看类型
print(soup.head)获取名称
获得标签的名称
html
htmlheadtitleThe Dormouses story/title/head
body
p classtitle namedromousebThe Dormouses story/b/p
p classstoryOnce upon a time there were three little sisters; and their names were
a hrefhttp://example.com/elsie classsister idlink1!-- Elsie --/a,
a hrefhttp://example.com/lacie classsister idlink2Lacie/a and
a hrefhttp://example.com/tillie classsister idlink3Tillie/a;
and they lived at the bottom of a well./p
p classstory.../pfrom bs4 import BeautifulSoupsoup BeautifulSoup(html, lxml) # 传入解析器lxml
print(soup.title.name)获取属性
html
htmlheadtitleThe Dormouses story/title/head
body
p classtitle namedromousebThe Dormouses story/b/p
p classstoryOnce upon a time there were three little sisters; and their names were
a hrefhttp://example.com/elsie classsister idlink1!-- Elsie --/a,
a hrefhttp://example.com/lacie classsister idlink2Lacie/a and
a hrefhttp://example.com/tillie classsister idlink3Tillie/a;
and they lived at the bottom of a well./p
p classstory.../pfrom bs4 import BeautifulSoupsoup BeautifulSoup(html, lxml) # 传入解析器lxml
print(soup.p.attrs[name])#获取p标签中name这个属性的值
print(soup.p[name])#另一种写法比较直接获取内容
html
htmlheadtitleThe Dormouses story/title/head
body
p classtitle namedromousebThe Dormouses story/b/p
p classstoryOnce upon a time there were three little sisters; and their names were
a hrefhttp://example.com/elsie classsister idlink1!-- Elsie --/a,
a hrefhttp://example.com/lacie classsister idlink2Lacie/a and
a hrefhttp://example.com/tillie classsister idlink3Tillie/a;
and they lived at the bottom of a well./p
p classstory.../pfrom bs4 import BeautifulSoupsoup BeautifulSoup(html, lxml) # 传入解析器lxml
print(soup.p.string)嵌套选择
html
htmlheadtitleThe Dormouses story/title/head
body
p classtitle namedromousebThe Dormouses story/b/p
p classstoryOnce upon a time there were three little sisters; and their names were
a hrefhttp://example.com/elsie classsister idlink1!-- Elsie --/a,
a hrefhttp://example.com/lacie classsister idlink2Lacie/a and
a hrefhttp://example.com/tillie classsister idlink3Tillie/a;
and they lived at the bottom of a well./p
p classstory.../pfrom bs4 import BeautifulSoupsoup BeautifulSoup(html, lxml) # 传入解析器lxml
print(soup.head.title.string)子节点和子孙节点
contents方式
html
htmlheadtitleThe Dormouses story/title/headbodyp classstoryOnce upon a time there were three little sisters; and their names werea hrefhttp://example.com/elsie classsister idlink1spanElsie/span/aa hrefhttp://example.com/lacie classsister idlink2Lacie/a anda hrefhttp://example.com/tillie classsister idlink3Tillie/aand they lived at the bottom of a well./pp classstory.../p
from bs4 import BeautifulSoupsoup BeautifulSoup(html, lxml) # 传入解析器lxml
print(soup.p.contents) # 获取指定标签的子节点类型是list输出结果
[\n Once upon a time there were three little sisters; and their names were\n , a classsister hrefhttp://example.com/elsie idlink1
spanElsie/span
/a, \n, a classsister hrefhttp://example.com/lacie idlink2Lacie/a, \n and\n , a classsister hrefhttp://example.com/tillie idlink3Tillie/a, \n and they lived at the bottom of a well.\n ]Process finished with exit code 0
child方式
html
htmlheadtitleThe Dormouses story/title/headbodyp classstoryOnce upon a time there were three little sisters; and their names werea hrefhttp://example.com/elsie classsister idlink1spanElsie/span/aa hrefhttp://example.com/lacie classsister idlink2Lacie/a anda hrefhttp://example.com/tillie classsister idlink3Tillie/aand they lived at the bottom of a well./pp classstory.../p
from bs4 import BeautifulSoupsoup BeautifulSoup(html, lxml) # 传入解析器lxml
print(soup.p.children)#获取指定标签的子节点的迭代器对象
for i,children in enumerate(soup.p.children):#i接受索引children接受内容print(i,children)2为空是因为标签与标签之间空一行
子孙节点
html
htmlheadtitleThe Dormouses story/title/headbodyp classstoryOnce upon a time there were three little sisters; and their names werea hrefhttp://example.com/elsie classsister idlink1spanElsie/span/aa hrefhttp://example.com/lacie classsister idlink2Lacie/a anda hrefhttp://example.com/tillie classsister idlink3Tillie/aand they lived at the bottom of a well./pp classstory.../p
from bs4 import BeautifulSoupsoup BeautifulSoup(html, lxml) # 传入解析器lxml
print(soup.p.descendants)#获取指定标签的子孙节点的迭代器对象
for i,child in enumerate(soup.p.descendants):#i接受索引child接受内容print(i,child)父节点和祖先节点
parent
html
htmlheadtitleThe Dormouses story/title/headbodyp classstoryOnce upon a time there were three little sisters; and their names werea hrefhttp://example.com/elsie classsister idlink1spanElsie/span/aa hrefhttp://example.com/lacie classsister idlink2Lacie/a anda hrefhttp://example.com/tillie classsister idlink3Tillie/aand they lived at the bottom of a well./pp classstory.../p
from bs4 import BeautifulSoupsoup BeautifulSoup(html, lxml) # 传入解析器lxml
print(soup.a.parent)#获取指定标签的父节点打印出了a节点的父节点p标签
parents
html
htmlheadtitleThe Dormouses story/title/headbodyp classstoryOnce upon a time there were three little sisters; and their names werea hrefhttp://example.com/elsie classsister idlink1spanElsie/span/aa hrefhttp://example.com/lacie classsister idlink2Lacie/a anda hrefhttp://example.com/tillie classsister idlink3Tillie/aand they lived at the bottom of a well./pp classstory.../p
from bs4 import BeautifulSoupsoup BeautifulSoup(html, lxml) # 传入解析器lxml
print(list(enumerate(soup.a.parents)))#获取指定标签的祖先节点输出结果 [(0, p classstoryOnce upon a time there were three little sisters; and their names werea classsister hrefhttp://example.com/elsie idlink1
spanElsie/span
/a
a classsister hrefhttp://example.com/lacie idlink2Lacie/a anda classsister hrefhttp://example.com/tillie idlink3Tillie/aand they lived at the bottom of a well./p), (1, body
p classstoryOnce upon a time there were three little sisters; and their names werea classsister hrefhttp://example.com/elsie idlink1
spanElsie/span
/a
a classsister hrefhttp://example.com/lacie idlink2Lacie/a anda classsister hrefhttp://example.com/tillie idlink3Tillie/aand they lived at the bottom of a well./p
p classstory.../p
/body), (2, html
head
titleThe Dormouses story/title
/head
body
p classstoryOnce upon a time there were three little sisters; and their names werea classsister hrefhttp://example.com/elsie idlink1
spanElsie/span
/a
a classsister hrefhttp://example.com/lacie idlink2Lacie/a anda classsister hrefhttp://example.com/tillie idlink3Tillie/aand they lived at the bottom of a well./p
p classstory.../p
/body/html), (3, html
head
titleThe Dormouses story/title
/head
body
p classstoryOnce upon a time there were three little sisters; and their names werea classsister hrefhttp://example.com/elsie idlink1
spanElsie/span
/a
a classsister hrefhttp://example.com/lacie idlink2Lacie/a anda classsister hrefhttp://example.com/tillie idlink3Tillie/aand they lived at the bottom of a well./p
p classstory.../p
/body/html)]Process finished with exit code 0兄弟节点
html
htmlheadtitleThe Dormouses story/title/headbodyp classstoryOnce upon a time there were three little sisters; and their names werea hrefhttp://example.com/elsie classsister idlink1spanElsie/span/aa hrefhttp://example.com/lacie classsister idlink2Lacie/a anda hrefhttp://example.com/tillie classsister idlink3Tillie/aand they lived at the bottom of a well./pp classstory.../p
from bs4 import BeautifulSoupsoup BeautifulSoup(html, lxml) # 传入解析器lxml
print(list(enumerate(soup.a.next_siblings)))#获取指定标签的后面的兄弟节点
print(list(enumerate(soup.a.previous_siblings)))#获取指定标签的前面的兄弟节点输出结果
[(0, \n), (1, a classsister hrefhttp://example.com/lacie idlink2Lacie/a), (2, \n and\n ), (3, a classsister hrefhttp://example.com/tillie idlink3Tillie/a), (4, \n and they lived at the bottom of a well.\n )]
[(0, \n Once upon a time there were three little sisters; and their names were\n )]Process finished with exit code 04 标准选择器
find_all( name , attrs , recursive , text , **kwargs ) 可根据标签名、属性、内容查找文档。
name
html
div classpaneldiv classpanel-headingh4Hello/h4/divdiv classpanel-bodyul classlist idlist-1li classelementFoo/lili classelementBar/lili classelementJay/li/ulul classlist list-small idlist-2li classelementFoo/lili classelementBar/li/ul/div
/divfrom bs4 import BeautifulSoupsoup BeautifulSoup(html, lxml)
print(soup.find_all(ul)) # 查找所有ul标签下的内容
print(type(soup.find_all(ul)[0])) # 查看其类型嵌套地查找标签下的子标签:
html
div classpaneldiv classpanel-headingh4Hello/h4/divdiv classpanel-bodyul classlist idlist-1li classelementFoo/lili classelementBar/lili classelementJay/li/ulul classlist list-small idlist-2li classelementFoo/lili classelementBar/li/ul/div
/divfrom bs4 import BeautifulSoupsoup BeautifulSoup(html, lxml)
for ul in soup.find_all(ul):print(ul.find_all(li))attrs
通过属性进行元素的查找
html
div classpaneldiv classpanel-headingh4Hello/h4/divdiv classpanel-bodyul classlist idlist-1 nameelementsli classelementFoo/lili classelementBar/lili classelementJay/li/ulul classlist list-small idlist-2li classelementFoo/lili classelementBar/li/ul/div
/div
from bs4 import BeautifulSoupsoup BeautifulSoup(html, lxml)
print(soup.find_all(attrs{id: list-1})) # 传入的是一个字典类型也就是想要查找的属性
print(soup.find_all(attrs{name: elements}))特殊类型的参数查找
from bs4 import BeautifulSoup
soup BeautifulSoup(html, lxml)
print(soup.find_all(idlist-1))#id是个特殊的属性可以直接使用
print(soup.find_all(class_element)) #class是关键字所以要用class_text
根据文本内容来进行选择
html
div classpaneldiv classpanel-headingh4Hello/h4/divdiv classpanel-bodyul classlist idlist-1li classelementFoo/lili classelementBar/lili classelementJay/li/ulul classlist list-small idlist-2li classelementFoo/lili classelementBar/li/ul/div
/divfrom bs4 import BeautifulSoup
soup BeautifulSoup(html, lxml)
print(soup.find_all(textFoo))#查找文本为Foo的内容但是返回的不是标签text在做内容匹配的时候比较方便但是在做内容查找的时候并不是太方便。
其他方式
find
find用法和findall一模一样但是返回的是找到的第一个符合条件的内容输出。find_parents() find_parent()
find_parents()返回所有祖先节点find_parent()返回直接父节点。find_next_siblings() ,find_next_sibling()
1返回后面的所有兄弟节点2返回后面的第一个兄弟节点find_previous_siblings(),find_previous_sibling()
1返回前面所有兄弟节点…find_all_next(),find_next()
1返回节点后所有符合条件的节点2返回后面第一个符合条件的节点find_all_previous()和find_previous()
同理。5 CSS选择器
通过select()直接传入CSS选择器即可完成选择
html
div classpaneldiv classpanel-headingh4Hello/h4/divdiv classpanel-bodyul classlist idlist-1li classelementFoo/lili classelementBar/lili classelementJay/li/ulul classlist list-small idlist-2li classelementFoo/lili classelementBar/li/ul/div
/divfrom bs4 import BeautifulSoupsoup BeautifulSoup(html, lxml)
print(soup.select(.panel .panel-heading)) # .代表class中间需要空格来分隔
print(soup.select(ul li)) # 选择ul标签下面的li标签
print(soup.select(#list-2 .element)) # #代表id。这句的意思是查找id为list-2的标签下的classelement的元素
print(type(soup.select(ul)[0])) # 打印节点类型层层嵌套的选择
html
div classpaneldiv classpanel-headingh4Hello/h4/divdiv classpanel-bodyul classlist idlist-1li classelementFoo/lili classelementBar/lili classelementJay/li/ulul classlist list-small idlist-2li classelementFoo/lili classelementBar/li/ul/div
/divfrom bs4 import BeautifulSoup
soup BeautifulSoup(html, lxml)
for ul in soup.select(ul):print(ul.select(li))获取属性
html
div classpaneldiv classpanel-headingh4Hello/h4/divdiv classpanel-bodyul classlist idlist-1li classelementFoo/lili classelementBar/lili classelementJay/li/ulul classlist list-small idlist-2li classelementFoo/lili classelementBar/li/ul/div
/divfrom bs4 import BeautifulSoupsoup BeautifulSoup(html, lxml)
for ul in soup.select(ul):print(ul[id]) # 用[ ]即可获取属性print(ul.attrs[id]) # 另一种写法获取内容
html
div classpaneldiv classpanel-headingh4Hello/h4/divdiv classpanel-bodyul classlist idlist-1li classelementFoo/lili classelementBar/lili classelementJay/li/ulul classlist list-small idlist-2li classelementFoo/lili classelementBar/li/ul/div
/divfrom bs4 import BeautifulSoup
soup BeautifulSoup(html, lxml)
for li in soup.select(li):print(li.get_text())6 总结
推荐使用lxml解析库必要时使用html.parser标签选择筛选功能弱但是速度快建议使用find()、find_all() 查询匹配单个结果或者多个结果如果对CSS选择器熟悉建议使用select()记住常用的获取属性和文本值的方法