gpt4 book ai didi

python beautiful soup库入门安装教程

转载 作者:qq735679552 更新时间:2022-09-27 22:32:09 30 4
gpt4 key购买 nike

CFSDN坚持开源创造价值,我们致力于搭建一个资源共享平台,让每一个IT人在这里找到属于你的精彩世界.

这篇CFSDN的博客文章python beautiful soup库入门安装教程由作者收集整理,如果你对这篇文章有兴趣,记得点赞哟.

beautiful soup库的安装

?
1
pip install beautifulsoup4

beautiful soup库的理解

beautiful soup库是解析、遍历、维护“标签树”的功能库 。

beautiful soup库的引用

?
1
2
from bs4 import BeautifulSoup
import bs4

BeautifulSoup类

BeautifulSoup对应一个HTML/XML文档的全部内容 。

回顾demo.html

?
1
2
3
4
5
import requests
 
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
print (demo)
?
1
2
3
4
5
6
< html >< head >< title >This is a python demo page</ title ></ head >
< body >
< p class = "title" >< b >The demo python introduces several python courses.</ b ></ p >
< p class = "course" >Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
< a href = "http://www.icourse163.org/course/BIT-268001" class = "py1" id = "link1" >Basic Python</ a > and < a href = "http://www.icourse163.org/course/BIT-1001870001" class = "py2" id = "link2" >Advanced Python</ a >.</ p >
</ body ></ html >

Tag标签

  。

基本元素 说明
Tag 标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾

  。

?
1
2
3
4
5
6
7
8
import requests
from bs4 import BeautifulSoup
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
soup = BeautifulSoup(demo, "html.parser" )
print (soup.title)
tag = soup.a
print (tag)
?
1
2
< title >This is a python demo page</ title >
< a  href = "http://www.icourse163.org/course/BIT-268001" >Basic Python</ a >

任何存在于HTML语法中的标签都可以用soup.访问获得。当HTML文档中存在多个相同对应内容时,soup.返回第一个 。

Tag的name

  。

基本元素 说明
Name 标签的名字,

… 。

的名字是'p',格式:.name

  。

?
1
2
3
4
5
6
7
8
import requests
from bs4 import BeautifulSoup
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
soup = BeautifulSoup(demo, "html.parser" )
print (soup.a.name)
print (soup.a.parent.name)
print (soup.a.parent.parent.name)
?
1
2
3
a
p  
body

Tag的attrs(属性)

  。

基本元素 说明
Attributes 标签的属性,字典形式组织,格式:.attrs

  。

?
1
2
3
4
5
6
7
8
9
10
11
import requests
from bs4 import BeautifulSoup
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
soup = BeautifulSoup(demo, "html.parser" )
tag = soup.a
print (tag.attrs)
print (tag.attrs[ 'class' ])
print (tag.attrs[ 'href' ])
print ( type (tag.attrs))
print ( type (tag))
?
1
2
3
4
5
{ 'href' : 'http://www.icourse163.org/course/BIT-268001' , 'class' : [ 'py1' ], 'id' : 'link1' }
[ 'py1' ]
http: / / www.icourse163.org / course / BIT - 268001
< class 'dict' >
< class 'bs4.element.Tag' >

Tag的NavigableString

Tag的NavigableString 。

  。

基本元素 说明
NavigableString 标签内非属性字符串,<>…</>中字符串,格式:.string

  。

Tag的Comment 。

  。

基本元素 说明
Comment 标签内字符串的注释部分,一种特殊的Comment类型

  。

?
1
2
3
4
5
6
7
import requests
from bs4 import BeautifulSoup
newsoup = BeautifulSoup( "<b><!--This is a comment--></b><p>This is not a comment</p>" , "html.parser" )
print (newsoup.b.string)
print ( type (newsoup.b.string))
print (newsoup.p.string)
print ( type (newsoup.p.string))
?
1
2
3
4
This is a comment
< class 'bs4.element.Comment' >
This is not a comment
< class 'bs4.element.NavigableString' >

HTML基本格式

标签树的下行遍历

  。

属性 说明
.contents 子节点的列表,将所有儿子结点存入列表
.children 子节点的迭代类型,与.contents类似,用于循环遍历儿子结点
.descendents 子孙节点的迭代类型,包含所有子孙节点,用于循环遍历

  。

BeautifulSoup类型是标签树的根节点 。

?
1
2
3
4
5
6
7
8
9
10
11
import requests
from bs4 import BeautifulSoup
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
 
soup = BeautifulSoup(demo, "html.parser" )
print (soup.head)
print (soup.head.contents)
print (soup.body.contents)
print ( len (soup.body.contents))
print (soup.body.contents[ 1 ])
?
1
2
3
4
5
6
7
8
<head><title>This is a python demo page< / title>< / head>
[<title>This is a python demo page< / title>]
[ '\n' , <p ><b>The demo python introduces several python courses.< / b>< / p>, '\n' , <p >Python
is a wonderful general - purpose programming language. You can learn Python from novice to professional by tracking the
following courses:
<a  href = "http://www.icourse163.org/course/BIT-268001" >Basic Python< / a> and <a  href = "http://www.icourse163.org/course/BIT-1001870001" >Advanced Python< / a>.< / p>, '\n' ]
5
<p ><b>The demo python introduces several python courses.< / b>< / p>
?
1
2
3
4
for child in soup.body.children:
     print (child)  #遍历儿子结点
for child in soup.body.descendants:
     print (child) #遍历子孙节点

标签树的上行遍历

  。

属性 说明
.parent 节点的父亲标签
.parents 节点先辈标签的迭代类型,用于循环遍历先辈节点

  。

?
1
2
3
4
5
6
7
8
import requests
from bs4 import BeautifulSoup
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
 
soup = BeautifulSoup(demo, "html.parser" )
print (soup.title.parent)
print (soup.html.parent)
?
1
2
3
4
5
6
7
< head >< title >This is a python demo page</ title ></ head >
< html >< head >< title >This is a python demo page</ title ></ head >
< body >
< p >< b >The demo python introduces several python courses.</ b ></ p >
< p >Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
< a  href = "http://www.icourse163.org/course/BIT-268001" >Basic Python</ a > and < a  href = "http://www.icourse163.org/course/BIT-1001870001" >Advanced Python</ a >.</ p >
</ body ></ html >
?
1
2
3
4
5
6
7
8
9
10
11
import requests
from bs4 import BeautifulSoup
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
 
soup = BeautifulSoup(demo, "html.parser" )
for parent in soup.a.parents:
     if parent is None :
         print (parent)
     else :
         print (parent.name)
?
1
2
3
4
p
body     
html     
[document]

标签的平行遍历

属性 说明
.next_sibling 返回按照HTML文本顺序的下一个平行节点标签
.previous.sibling 返回按照HTML文本顺序的上一个平行节点标签
.next_siblings 迭代类型,返回按照HTML文本顺序的后续所有平行节点标签
.previous.siblings 迭代类型,返回按照HTML文本顺序的前续所有平行节点标签
?
1
2
3
4
5
6
7
8
9
10
11
12
13
import requests
from bs4 import BeautifulSoup
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
 
soup = BeautifulSoup(demo, "html.parser" )
print (soup.a.next_sibling)
print (soup.a.next_sibling.next_sibling)
 
print (soup.a.previous_sibling)
print (soup.a.previous_sibling.previous_sibling)
 
print (soup.a.parent)
?
1
2
3
4
5
6
7
and
<a class = "py2" href = "http://www.icourse163.org/course/BIT-1001870001" id = "link2" >Advanced Python< / a>
Python is a wonderful general - purpose programming language. You can learn Python from novice to professional by tracking the following courses:
 
None
<p class = "course" >Python is a wonderful general - purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class = "py1" href = "http://www.icourse163.org/course/BIT-268001" id = "link1" >Basic Python< / a> and <a class = "py2" href = "http://www.icourse163.org/course/BIT-1001870001" id = "link2" >Advanced Python< / a>.< / p>
?
1
2
3
4
for sibling in soup.a.next_sibling:
     print (sibling)  #遍历后续节点
for sibling in soup.a.previous_sibling:
     print (sibling)  #遍历前续节点

python beautiful soup库入门安装教程

bs库的prettify()方法

?
1
2
3
4
5
6
7
import requests
from bs4 import BeautifulSoup
r = requests.get( "http://python123.io/ws/demo.html" )
demo = r.text
 
soup = BeautifulSoup(demo, "html.parser" )
print (soup.prettify())
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
< html >
  < head >
   < title >
    This is a python demo page
   </ title >
  </ head >
  < body >
   < p class = "title" >
    < b >
     The demo python introduces several python courses.
    </ b >
   </ p >
   < p class = "course" >
    Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
     Basic Python
    </ a >
    and
    < a class = "py2" href = "http://www.icourse163.org/course/BIT-1001870001" id = "link2" >
     Advanced Python
    </ a >
    .
   </ p >
  </ body >
</ html >

.prettify()为HTML文本<>及其内容增加更加'\n' .prettify()可用于标签,方法:.prettify() 。

bs4库的编码

bs4库将任何HTML输入都变成utf-8编码 python 3.x默认支持编码是utf-8,解析无障碍 。

?
1
2
3
4
5
6
7
import requests
from bs4 import BeautifulSoup
 
soup = BeautifulSoup( "<p>中文</p>" , "html.parser" )
print (soup.p.string)
 
print (soup.p.prettify())
?
1
2
3
4
5
中文
 
<p> 
  中文
< / p>

到此这篇关于python beautiful soup库入门安装教程的文章就介绍到这了,更多相关python beautiful soup库入门内容请搜索我以前的文章或继续浏览下面的相关文章希望大家以后多多支持我! 。

原文链接:https://blog.csdn.net/weixin_46530492/article/details/119960182 。

最后此篇关于python beautiful soup库入门安装教程的文章就讲到这里了,如果你想了解更多关于python beautiful soup库入门安装教程的内容请搜索CFSDN的文章或继续浏览相关文章,希望大家以后支持我的博客! 。

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com