gpt4 book ai didi

python - 为什么这个网站不能用 bs4 抓取?

转载 作者:行者123 更新时间:2023-12-04 09:32:25 27 4
gpt4 key购买 nike

我是一个学习网络爬虫的初学者,由于某种原因我无法爬网this地点。当我在 Chrome 中检查它时,代码看起来不错,但是当我用 BeautifulSoup 阅读它时,它不再是可刮的。汤提到“谷歌分析”,我真的不知道那是什么。

最佳答案

该站点的内容是通过 JavaScript 加载的,但您可以使用 requests模块来获取各个章节。章节的 URL 格式为 https://detroitbecometext.github.io/assets/html/chapterXY.html (example)。
例如这个脚本:

import re
import requests
from bs4 import BeautifulSoup


url = 'https://detroitbecometext.github.io/chapters'
asset_url = 'https://detroitbecometext.github.io/assets/html/'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')
main_js = requests.get('https://detroitbecometext.github.io/' + soup.select_one('script[src^="main."]')['src']).text

for ch in re.findall(r'(chapter[\d.]+\.html?)', main_js):
soup = BeautifulSoup(requests.get(asset_url + ch).content, 'html.parser')
print(soup.get_text())
print('-' * 80)
打印所有章节的所有文本:
...


Out of the elevator

SWAT: Negotiator on site. Repeat, negotiator on site.
Caroline Phillips: No, stop... I... I... I can't leave her. Oh, oh please, please, you gotta save my little girl... Wait... you're
sending an android?
SWAT: Alright, ma'am. We need to go.
Caroline Phillips: You can't...you can't do that! You W- Why aren't you sending a real
person? Don't let that thing near her! Keep that thing away from my daughter! KEEP IT AWAY!


...

关于python - 为什么这个网站不能用 bs4 抓取?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62777498/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com