gpt4 book ai didi

python - 使用 BeautifulSoup 解析长 html 失败,输出已解析一半

转载 作者:行者123 更新时间:2023-12-01 05:04:18 27 4
gpt4 key购买 nike

我使用以下脚本来解析特定基金的基金价格:

import pandas as pd
from bs4 import BeautifulSoup
from ghost import Ghost
ghost = Ghost()
page,resources = ghost.open('http://bank.hangseng.com/1/PA_1_1_P1/ComSvlet_MiniSite_eng_gif?app=eINVCFundDetailsOV&pri_fund_code=U44217')
page,resources = ghost.evaluate("agree()", expect_loading=True)
page,resources = ghost.evaluate("MM_changeview('eINVCFundPriceDividend')", expect_loading=True)
# ghost.capture_to("hangseng.png")
soup = BeautifulSoup(page.content)
soup

输出的soup前半部分是可以的,但是标签全部变成大写,BeautifulSoup无法解析它们,如下所示:

<td class="LightGrey" valign="TOP"><font class="CONTENT">22-07-2014</font></td><td class="LightGrey" valign="TOP"><font class="CONTENT">10.95000</font></td><td class="LightGrey" valign="TOP"><font class="CONTENT">11.39000</font></td><td class="LightGrey" valign="TOP"><font class="CONTENT">10.95000</font></td>
</tr>
T R V A L I G N = " t o p " a l i g n = " c e n t e r " &gt;
T D C L A S S = " L i g h t G r e y " V A L I G N = " T O P " &gt; F O N T C L A S S = " C O N T E N T " &gt; 2 1 - 0 7 - 2 0 1 4 / F O N T &gt; / T D &gt; T D C L A S S = " L i g h t G r e y " V A L I G N = " T O P " &gt; F O N T C L A S S = " C O N T E N T " &gt; 1 0 . 9 6 0 0 0 / F O N T &gt; / T D &gt; T D C L A S S = " L i g h t G r e y " V A L I G N = " T O P " &gt; F O N T C L A S S = " C O N T E N T " &gt; 1 1 . 4 0 0 0 0 / F O N T &gt; / T D &gt; T D C L A S S = " L i g h t G r e y " V A L I G N = " T O P " &gt; F O N T C L A S S = " C O N T E N T " &gt; 1 0 . 9 6 0 0 0 / F O N T &gt; / T D &gt;
/ T R &gt;

您可以看到输出在日期2014-07-22之后变成垃圾。

发生了什么?

最佳答案

我从 spaced output beautifulsoup 找到了解决方案

page.content
soup = BeautifulSoup(page.content,'html.parser')

现在可以完美运行了。

关于python - 使用 BeautifulSoup 解析长 html 失败,输出已解析一半,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25374725/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com