gpt4 book ai didi

python - 如何在 BeautifulSoup 中添加 'href contains' 条件

转载 作者:太空宇宙 更新时间:2023-11-04 01:50:31 25 4
gpt4 key购买 nike

我正在尝试从网页中提取链接。在这样做的同时,我获得了所有链接。需要提取只有watch?v=

的页面
import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import json
import ast
import json
import os
from urllib.request import Request, urlopen
# For ignoring SSL certificate errors

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

# Input from user

#url = input('Enter Youtube Video Url- ')
#url = 'https://www.youtube.com/watch?v=MxnkDj8PIxQ'
url = 'https://www.youtube.com/feed/trending'
# Making the website believe that you are accessing it using a mozilla browser

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

# Creating a BeautifulSoup object of the html page for easy extraction of data.

soup = BeautifulSoup(webpage, 'html.parser')
html = soup.prettify('utf-8')
for a in soup.find_all('a', href=True):
print ("Found the URL:", a['href'])

我的输出

Found the URL: /watch?v=EJe3xxkzj5Y
Found the URL: /watch?v=Thf60JU8E98
Found the URL: /watch?v=Thf60JU8E98
Found the URL: /user/adityamusic
Found the URL: /channel/Muzik

我的预期输出应该只包含 watch?v= 的链接

Found the URL: /watch?v=EJe3xxkzj5Y
Found the URL: /watch?v=Thf60JU8E98

最佳答案

你不需要正则表达式。你可以使用下面的 css 选择器。

url = 'https://www.youtube.com/feed/trending'

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

soup = BeautifulSoup(webpage, 'html.parser')
html = soup.prettify('utf-8')
for a in soup.select('a[href^="/watch?v="]'):
print ("Found the URL:", a['href'])

输出:

Found the URL: /watch?v=NEAWC9eK1Ts
Found the URL: /watch?v=NEAWC9eK1Ts
Found the URL: /watch?v=xOGtIKE1Us8
Found the URL: /watch?v=xOGtIKE1Us8
Found the URL: /watch?v=i23NEQEFpgQ
Found the URL: /watch?v=i23NEQEFpgQ
Found the URL: /watch?v=cMqkXu4iQcU
Found the URL: /watch?v=cMqkXu4iQcU
Found the URL: /watch?v=vtiRzuH7miI
Found the URL: /watch?v=vtiRzuH7miI
Found the URL: /watch?v=28HABZJ358g
Found the URL: /watch?v=28HABZJ358g
Found the URL: /watch?v=lrzMFW2glIU
Found the URL: /watch?v=lrzMFW2glIU
Found the URL: /watch?v=nLCvijAhVLY
Found the URL: /watch?v=nLCvijAhVLY
Found the URL: /watch?v=VZiVePJCpZI
Found the URL: /watch?v=VZiVePJCpZI
Found the URL: /watch?v=gEBolPQc_EA
Found the URL: /watch?v=gEBolPQc_EA
Found the URL: /watch?v=ho_Mafw9UAk
Found the URL: /watch?v=ho_Mafw9UAk
Found the URL: /watch?v=bwOS7fxjS9E
Found the URL: /watch?v=bwOS7fxjS9E
Found the URL: /watch?v=mGD1RBhtJNg
Found the URL: /watch?v=mGD1RBhtJNg
Found the URL: /watch?v=84sHN6_MyMo
Found the URL: /watch?v=84sHN6_MyMo
Found the URL: /watch?v=waXb8QGdEYQ
Found the URL: /watch?v=waXb8QGdEYQ
Found the URL: /watch?v=kRAPxo59EbU
Found the URL: /watch?v=kRAPxo59EbU
Found the URL: /watch?v=hzmbCSHcSts
Found the URL: /watch?v=hzmbCSHcSts
Found the URL: /watch?v=AByj4Do85QM
Found the URL: /watch?v=AByj4Do85QM
Found the URL: /watch?v=s7u58Wd2H_Q
Found the URL: /watch?v=s7u58Wd2H_Q
Found the URL: /watch?v=dY2OeY5QEC4
Found the URL: /watch?v=dY2OeY5QEC4
Found the URL: /watch?v=V4XLiNRxoVM
Found the URL: /watch?v=V4XLiNRxoVM
Found the URL: /watch?v=6GlFZRXBQyg
Found the URL: /watch?v=6GlFZRXBQyg
Found the URL: /watch?v=OA-APVqZXYA
Found the URL: /watch?v=OA-APVqZXYA
Found the URL: /watch?v=6Kr9REM0JYQ
Found the URL: /watch?v=6Kr9REM0JYQ
Found the URL: /watch?v=sd5iLfPt0-o
Found the URL: /watch?v=sd5iLfPt0-o
Found the URL: /watch?v=nfcAHfDuNzw
Found the URL: /watch?v=nfcAHfDuNzw
Found the URL: /watch?v=FLTOiQ8gXp4
Found the URL: /watch?v=FLTOiQ8gXp4
Found the URL: /watch?v=ZOGxOQxXjdo
Found the URL: /watch?v=ZOGxOQxXjdo
Found the URL: /watch?v=Geyg_F5pfHE
Found the URL: /watch?v=Geyg_F5pfHE
Found the URL: /watch?v=4Kv_Gkz4wPc
Found the URL: /watch?v=4Kv_Gkz4wPc
Found the URL: /watch?v=FbtdKI_0Y5s
Found the URL: /watch?v=FbtdKI_0Y5s
Found the URL: /watch?v=fhMma6QzR3E
Found the URL: /watch?v=fhMma6QzR3E
Found the URL: /watch?v=NQEzIrC6bCs
Found the URL: /watch?v=NQEzIrC6bCs
Found the URL: /watch?v=nNhYqLbsAGk
Found the URL: /watch?v=nNhYqLbsAGk
Found the URL: /watch?v=iaQMT9Y3saM
Found the URL: /watch?v=iaQMT9Y3saM
Found the URL: /watch?v=v7Hu-14z-zQ
Found the URL: /watch?v=v7Hu-14z-zQ
Found the URL: /watch?v=RDb1MGsyY5I
Found the URL: /watch?v=RDb1MGsyY5I
Found the URL: /watch?v=KQetemT1sWc
Found the URL: /watch?v=KQetemT1sWc
Found the URL: /watch?v=ALimx-H8C6s
Found the URL: /watch?v=ALimx-H8C6s
Found the URL: /watch?v=3aUj5ilB0jw
Found the URL: /watch?v=3aUj5ilB0jw
Found the URL: /watch?v=eFBI8E1W6Vo
Found the URL: /watch?v=eFBI8E1W6Vo
Found the URL: /watch?v=iXtUX2kx6io
Found the URL: /watch?v=iXtUX2kx6io
Found the URL: /watch?v=BNgmYFwUjjw
Found the URL: /watch?v=BNgmYFwUjjw
Found the URL: /watch?v=XHmRJroAjrE
Found the URL: /watch?v=XHmRJroAjrE
Found the URL: /watch?v=XRiUNPf-_-4
Found the URL: /watch?v=XRiUNPf-_-4
Found the URL: /watch?v=uc-_KXfHcXQ
Found the URL: /watch?v=uc-_KXfHcXQ
Found the URL: /watch?v=BK7ojj5H72A
Found the URL: /watch?v=BK7ojj5H72A
Found the URL: /watch?v=Yv72aYbOEB0
Found the URL: /watch?v=Yv72aYbOEB0
Found the URL: /watch?v=il94Ke4E28s
Found the URL: /watch?v=il94Ke4E28s
Found the URL: /watch?v=aDZxEYmcCGo
Found the URL: /watch?v=aDZxEYmcCGo
Found the URL: /watch?v=T8ADlJtr4a0
Found the URL: /watch?v=T8ADlJtr4a0
Found the URL: /watch?v=d1010B3sKNQ
Found the URL: /watch?v=d1010B3sKNQ
Found the URL: /watch?v=PllHgkC3yPs
Found the URL: /watch?v=PllHgkC3yPs
Found the URL: /watch?v=1ei355BrtVo
Found the URL: /watch?v=1ei355BrtVo
Found the URL: /watch?v=ZywVlyogLYM
Found the URL: /watch?v=ZywVlyogLYM
Found the URL: /watch?v=1JLUn2DFW4w
Found the URL: /watch?v=1JLUn2DFW4w
Found the URL: /watch?v=aDrVrz76z1A
Found the URL: /watch?v=aDrVrz76z1A
Found the URL: /watch?v=syNaiMVEbJo
Found the URL: /watch?v=syNaiMVEbJo
Found the URL: /watch?v=avqRA3rmvrk
Found the URL: /watch?v=avqRA3rmvrk
Found the URL: /watch?v=II5UsqP2JAk
Found the URL: /watch?v=II5UsqP2JAk
Found the URL: /watch?v=-_ou2tKKA3U
Found the URL: /watch?v=-_ou2tKKA3U
Found the URL: /watch?v=_p_7yerGQq8
Found the URL: /watch?v=_p_7yerGQq8
Found the URL: /watch?v=bwzLiQZDw2I
Found the URL: /watch?v=bwzLiQZDw2I
Found the URL: /watch?v=ltNm4MdykBE
Found the URL: /watch?v=ltNm4MdykBE
Found the URL: /watch?v=UIL9CiUDHp0
Found the URL: /watch?v=UIL9CiUDHp0
Found the URL: /watch?v=t0_HF7tkGdA
so on...............

获取前10条记录。

for a in soup.select('a[href^="/watch?v="]')[:10]:
print ("Found the URL:", a['href'])

如果你想获取最后 10 条记录。

for a in soup.select('a[href^="/watch?v="]')[-10:]:
print ("Found the URL:", a['href'])

关于python - 如何在 BeautifulSoup 中添加 'href contains' 条件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58146077/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com