gpt4 book ai didi

python - 如何使用 BeautifulSoup 从嵌套在
  • 中的 中提取文本,而
  • 嵌套在
      中?
  • 转载 作者:行者123 更新时间:2023-11-27 22:53:25 27 4
    gpt4 key购买 nike

    我想从 this page 中提取 Here's what's new 部分的项目,从 future 几周开始,到一般增强结束。

    检查代码我看到了 <span > 嵌套在 <li> 下然后嵌套在 <ul id="GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B"> 下.我尝试用 Python 3 和 BeautifulSoup 提取它最近几天,但无济于事。我正在粘贴我在下面尝试过的代码。

    有人会这么好心地指导我正确的方向吗?

    1#

    from urllib.request import urlopen # open URLs 
    from bs4 import BeautifulSoup # BS

    import sys # sys.exit()

    page_url = 'https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS'

    try:
    page = urlopen(page_url)
    except:
    sys.exit("No internet connection. Program exiting...")

    soup = BeautifulSoup(page, 'html.parser')

    try:
    for ultag in soup.find_all('ul', {'id': 'GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B'}):
    print(ultag.text)
    for spantag in ultag.find_all('span'):
    print(spantag)
    except:
    print("Couldn't get What's new :(")

    2#

    from urllib.request import urlopen # open URLs 
    from bs4 import BeautifulSoup # BS

    import sys # sys.exit()

    page_url = 'https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS'

    try:
    page = urlopen(page_url)
    except:
    sys.exit("No internet connection. Program exiting...")

    soup = BeautifulSoup(page, 'html.parser')

    uls = []
    for ul in uls:
    for ul in soup.findAll('ul', {'id': 'GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B'}):
    if soup.find('ul'):
    break
    uls.append(ul)
    print(uls)
    for li in uls:
    print(li.text)

    理想情况下代码应该返回:

    在接下来的几周内,您只需在“开始前”对话框中单击一下,即可阅读您拥有的项目。

    性能改进、错误修复和其他一般增强功能。

    但是两者都没有给我任何东西。好像找不到ul使用该 ID 但如果您 print(soup)一切看起来都不错:

    <ul id="GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B">
    <li>
    <span class="a-list-item"><span><strong>Read Now</strong></span>: In the coming weeks, you will be able to read items that you own with a single click from the �Before You Go� dialog.</span></li>

    <li>
    <span class="a-list-item">Performance improvements, bug fixes, and other general enhancements.<br></li>


    </ul>

    最佳答案

    对于 bs4 4.7.1+,您可以使用 :contains 和 :has 来隔离

    import requests
    from bs4 import BeautifulSoup as bs

    r = requests.get('https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS')
    soup = bs(r.content, 'lxml')
    text = [i.text.strip() for i in soup.select('p:has(strong:contains("Here’s what’s new:")), p:has(strong:contains("Here’s what’s new:")) + p + ul li')]
    print(text)

    enter image description here

    目前,您还可以删除 :contains

    text = [i.text.strip() for i in soup.select('p:has(strong), p:has(strong) + p + ul li')]
    print(text)

    + 是一个 css 相邻兄弟组合器。阅读更多 here .引用:

    Adjacent sibling combinator

    The + combinator selects adjacent siblings. This means that the second element directly follows the first, and both share the same parent.

    Syntax: A + B

    Example: h2 + p will match all <p> elements that directly follow an <h2>.

    关于python - 如何使用 BeautifulSoup 从嵌套在 <li> 中的 <span> 中提取文本,而 <li> 嵌套在 <ul> 中?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57725818/

    27 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com