gpt4 book ai didi

python - 使用 Python 3.4 和 BeautifulSoup 、Requests 抓取文章

转载 作者:太空宇宙 更新时间:2023-11-03 16:39:20 24 4
gpt4 key购买 nike

我想抓取网站:

https://xueqiu.com/yaodewang

我想抓取他所有的文章。我像这样使用了 BeautifulSoup 和 Requests:

import requests
from bs4 import BeautifulSoup
url = 'https://xueqiu.com/yaodewang'
header = {'user-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36'}
r = requests.get(url,headers = header).content
soup = BeautifulSoup(r,'lxml')
artile = soup.find_all('ul',{'class':'status-list'})
print(artile)

结果什么都没有!它是返回:

 []

所以,我又制定了这样的规则:

# art = soup.find_all('div',{'class':'allStatuses no-head'})
# art = soup.find_all('div',{'class':'status_bd'})
# art = soup.find_all('div',{'class':'status_content container active tab-pane'})

但是,它返回了一些不正确的单词。我想要这样的内容 enter image description here

我需要你的帮助,非常感谢!

最佳答案

所需的数据实际上并不位于具有 status-list 类的元素内部。如果您检查源代码,您会发现一个空容器:

<div class="status_bd">
<div id="statusLists" class="allStatuses no-head"></div>
</div>

相反,状态位于您需要定位的 script 元素内,提取所需的对象,从 JSON 加载到 Python 字典中并提取所需的信息:

import json
import re
import requests
from bs4 import BeautifulSoup

url = 'https://xueqiu.com/yaodewang'
headers = {
'user-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36'
}
r = requests.get(url, headers=headers).content
soup = BeautifulSoup(r, 'lxml')

pattern = re.compile(r"SNB\.data\.statuses = ({.*?});", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)

data = json.loads(pattern.search(script.text).group(1))
for item in data["statuses"]:
print(item["description"])

打印:

The best advice: Remember common courtesy and act toward others as you want them to act toward you.
Lighten up! It&#39;s the weekend. we&#39;re just having a little fun! Industrial Bank is expected to rise,next week...
...
点.点.点... 点到这个,学位、学历、成绩单翻译一下要50块、100块的...

关于python - 使用 Python 3.4 和 BeautifulSoup 、Requests 抓取文章,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36962292/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com