gpt4 book ai didi

python - 美丽汤 : how to show the inside of a div that won't show?

转载 作者:太空宇宙 更新时间:2023-11-04 11:07:29 28 4
gpt4 key购买 nike

我是 BeautifulSoup 的新手,我遇到了一些我不明白的问题,我认为这个问题可能尚未得到解答,但在这种情况下,我找到的答案都没有帮助我。

我需要访问 div 的内部以检索网站的词汇表条目,但是使用 BeautifulSoup,该 div 的内部似乎根本“不显示”。你能帮帮我吗?

这是网站上的 html:

<!DOCTYPE html>
<html lang="en-US" style="margin-top: 0px !important;">
<head>...</head>
<body>
<header>...</header>
<section id="glossary" class="search-off">
<dl class="title">
<dt>Glossary</dt>
</dl>
<div class="content">
<aside id="glossary-aside">
<div></div>
<ul></ul>
</aside>
<div id="glossary-list" class="list">
<dl data-id="2103">...</dl>
<dl data-id="1105">
<dt>ABV (Alcohol by volume)</dt>
<dd>
<p style="margin-bottom: 0cm; text-align: justify;"><span style="font-family: Arial Cyr,sans-serif;"><span style="font-size: x-small;"><span style="font-size: small;"><span style="font-size: medium;">Alcohol by volume (ABV) is the measure of an alcoholic beverage’s alcohol content. Wines may have alcohol content from 4% ABV to 18% ABV; however, wines’ typical alcohol content ranges from 12.5% to 14.5% ABV. You can find a particular wine’s alcohol content by checking the label.</span></span></span></span><span style="font-size: medium;">&nbsp;</span></p>
</dd>
</dl>
<dl data-id="1106">...</dl>
<dl data-id="1213">...</dl>
<dl data-id="2490">...</dl>
<dl data-id="11705">...</dl>
<dl data-id="1782">...</dl>
</div>
<div id="glossary-single" class="list">...</div>
</div>
<div class="s_content">
<div id="glossary-s_list" class="list"></div>
</div>
</section>
<footer></footer>
</body>
</html>

我需要访问不同的 <dl> <div id="glossary-list" class="list"> 中的标签.

我的代码如下:

url_winevibe = requests.get("http://winevibe.com/glossary")
soup = BeautifulSoup(html, "lxml")
ct = url_winevibe.find("div", {"id":"glossary-list"}).findAll("dl")

我尝试了各种方法,包括获取后代和子代,但我得到的只是一个空列表。

如果我尝试 ct = soup.find("div", {"id":"glossary-list"})并打印出来,我得到:<div class="list" id="glossary-list"></div> .在我看来,div 的内部以某种方式被阻塞,但我不太确定。

有没有人知道如何访问它?

最佳答案

第一个解决方案 url 基于我对数据加载位置的研究!我确实看到它是通过 XHR 从不同的 url 加载的,其中 JavaScript 呈现:

import requests
import json

r = requests.get('http://winevibe.com/wp-json/glossary/key/?l=en').json()
hoks = json.loads(r)
for item in hoks:
print(item['key'])

第二种解决方案:

from selenium import webdriver
from bs4 import BeautifulSoup
import time

browser = webdriver.Firefox()
url = 'http://winevibe.com/glossary/'
browser.get(url)
time.sleep(20) # wait 20 seconds for the site to load.
html = browser.page_source
soup = BeautifulSoup(html, features='html.parser')
for item in soup.findAll('div', attrs={'id': 'glossary-list'}):
for dt in item.findAll('dt'):
print(dt.text)

you can use browser.close() to close the browser

输出:

enter image description here

这是通过聊天处理所有用户请求的最终代码:

import requests
import json

r = requests.get('http://winevibe.com/wp-json/glossary/key/?l=en').json()
data = json.loads(r)
result = ([(item['key'], item['id']) for item in data])
text = []
for item in result:
try:
r = requests.get(
f"http://winevibe.com/wp-json/glossary/text/?id={item[1]}").json()
data = json.loads(r)
print(f"Getting Text For: {item[0]}")
text.append(data[0]['text'])
except KeyboardInterrupt:
print('Good Bye')
break

with open('result.txt', 'w+') as f:
for a, b in zip(result, text):
lines = ', '.join([a[0], b.replace('\n', '')]) + '\n'
f.write(lines)

关于python - 美丽汤 : how to show the inside of a div that won't show?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59090591/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com