gpt4 book ai didi

python - 如何在Python中使用BeautifulSoup从html中提取特定文本?

转载 作者:行者123 更新时间:2023-12-01 00:15:41 25 4
gpt4 key购买 nike

我正在尝试使用Python中的BeautifulSoup函数从HTML中提取一些文本(更具体地说,它是Mdx,一种#dictionary文件)-它运行良好,但我没有得到我需要的东西。 # 我的代码如下:

from bs4 import BeautifulSoup
from lxml import etree

html = '''
abandon <link href="LM5style_vanilla.css" rel="stylesheet" type="text/css" /><link href="LM5style.css" rel="stylesheet" type="text/css" /><link href="LM5style_switch.css" rel="stylesheet" type="text/css" /><link href="LM5style_show.css" rel="stylesheet" type="text/css" /><script src="jquery-3.2.1.min.js" charset="utf-8" type="text/javascript" language="javascript"></script><script src="LM5Switch.js" charset="utf-8" type="text/javascript" language="javascript"></script><span class="lm5ppbody"><div class="entry_content"><h1 class="pagetitle" pagetype="0">abandon</h1><div class="dictionary"><div class="wordfams"><span class="LDOCE5pp_sensefold foldsign_fold"><span class="asset_intro">Word family</span><span class="foldsign"><span class="foldblank"> </span><span class="foldsignbar1"></span><span class="foldsignbar2"></span></span></span><span class="LDOCE_word_family" style="display:none;"> <span class="pos">noun</span> <span class="w" title="abandonment">abandonment</span> <span class="pos">adjective</span> <a class="crossRef w" href="bword://abandoned" title="abandoned">abandoned</a> <span class="pos">verb</span> <span class="w" title="abandon">abandon</span> </span></div><!-- End of DIV wordfams--><span class="dictentry"><span class="dictionary_intro span"><span class="lm5ppMenu"><span id="lm5ppMenu_logo"> </span><span class="lm5ppMenu_title"><span class="en_title">Longman Dictionary of Contemporary English 5++</span><span class="cn_title"><span class="cn_txt_menu">朗文当代英语 5++</span></span></span><span class="lm5ppMenu_title mini"><span class="en_title">LDOCE 5++</span><span class="cn_title"><span class="cn_txt_menu">朗文 5++</span></span></span></span></span><span class="dictlink"><a name="abandon__entry_0__a"></a><span class="ldoceEntry Entry" id="abandon__entry_0"><span class="frequent Head"><span class="HWD">a<span class="HYP"><span class="HYP">·</span></span>ban<span class="HYP"><span class="HYP">·</span></span>don</span><span class="HOMNUM">1</span><a class="PronCodes" href="sound://media/english/ameProns/abandon1.mp3"><span class="neutral span"> /</span><span class="PRON">əˈbændən</span><span class="neutral span">/</span></a> <span class="tooltip LEVEL" title="Core vocabulary: Medium-frequency"> ●●○</span> <span class="FREQ" title="Top 3000 written words">W3</span> <span class="AC" title="Academic Word list">AWL</span><span class="lm5pp_POS"> verb</span><span class="GRAM"><span class="neutral span"> [</span>transitive<span class="neutral span">]</span></span><a class="speaker brefile fa fa-volume-up" data-src-mp3="/media/english/breProns/abandon_v0205.mp3" href="sound://media/english/breProns/abandon_v0205.mp3" title="Play British pronunciation of abandon"> </a><a class="speaker amefile fa fa-volume-up" data-src-mp3="/media/english/ameProns/abandon1.mp3" href="sound://media/english/ameProns/abandon1.mp3" title="Play American pronunciation of abandon"> </a></span><a name="abandon__1__a"></a><span class="newline Sense" id="abandon__1"><span class="LDOCE5pp_sensefold"><span class="sensenum span">1</span><span class="foldsign"><span class="foldblank"> </span><span class="foldsignbar1"></span><span class="foldsignbar2"></span></span></span> <span class="ACTIV">LEAVE A RELATIONSHIP</span><span class="DEF LDOCE_switch_lang switch_siblings">to leave someone, especially someone you are <a class="defRef" href="bword://responsible" title="responsible">responsible</a> for</span><span class="DEF LDOCE_switch_lang switch_siblings"> <span class="cn_txt"> 抛弃,遗弃〔某人〕</span></span><span class="RELATEDWD"><span class="neutral span"> → </span><a href="bword://abandoned"> abandoned</a></span><span class="EXAMPLE"><a class="speaker exafile fa fa-volume-up" href="sound://media/english/exaProns/p008-000963493.mp3" title="Play Example"> </a><span class="english LDOCE_switch_lang switch_children">How could she abandon her own child?<span class="cn_txt"> 她怎么能抛弃自己的孩子呢?</span></span></span></span><a name="abandon__2__a"></a><span class="newline Sense" id="abandon__2"><span class="LDOCE5pp_sensefold"><span class="sensenum span">2</span><span class="foldsign"><span class="foldblank"> </span><span class="foldsignbar1"></span><span class="foldsignbar2"></span></span></span> <span class="ACTIV">LEAVE A PLACE</span><span class="DEF LDOCE_switch_lang switch_siblings">to go away from a place, <a class="defRef" href="bword://vehicle" title="vehicle">vehicle</a> etc permanently, especially because the situation makes it <a class="defRef" href="bword://impossible" title="impossible">impossible</a> for you to stay</span><span class="DEF LDOCE_switch_lang switch_siblings"> <span class="cn_txt"> 离弃,逃离〔某地方、交通工具等〕</span></span><span class="SYN"> <span class="synopp span">SYN</span><a href="bword://leave"> leave</a></span><span class="RELATEDWD"><span class="neutral span">, → </span><a href="bword://abandoned"> abandoned</a></span><span class="EXAMPLE"><a class="speaker exafile fa fa-volume-up" href="sound://media/english/exaProns/p008-000963497.mp3" title="Play Example"> </a><span class="english LDOCE_switch_lang switch_children">We had to abandon the car and walk the rest of the way.<span class="cn_txt"> 我们只好弃车,步行走完剩下的路。</span></span></span><span class="EXAMPLE"><a class="speaker exafile fa fa-volume-up" href="sound://media/english/exaProns/p008-000963498.mp3" title="Play Example"> </a><span class="english LDOCE_switch_lang switch_children">Fearing further attacks, most of the population had abandoned the city.<span class="cn_txt"> 因为害怕还要受到袭击,大多数市民已逃离该市。</span></span></span></span><a name="abandon__3__a"></a><span class="newline Sense" id="abandon__3"><span class="LDOCE5pp_sensefold"><span class="sensenum span">3</span><span class="foldsign"><span class="foldblank"> </span><span class="foldsignbar1"></span><span class="foldsignbar2"></span></span></span> <span class="ACTIV">STOP DOING something</span><span class="DEF LDOCE_switch_lang switch_siblings">to stop doing something because there are too many problems and it is impossible to continue</span><span class="DEF LDOCE_switch_lang switch_siblings"> <span class="cn_txt"> 放弃,中止</span></span><span class="EXAMPLE"><a class="speaker exafile fa fa-volume-up" href="sound://media/english/exaProns/p008-000963502.mp3" title="Play Example"> </a><span class="english LDOCE_switch_lang switch_children">The game had to be abandoned due to bad weather.<span class="cn_txt"> 由于天气不好,比赛不得不中止。</span></span></span><span class="EXAMPLE"><a class="speaker exafile fa fa-volume-up" href="sound://media/english/exaProns/p008-001732862.mp3" title="Play Example"> </a><span class="english LDOCE_switch_lang switch_children">They <span class="COLLOINEXA">abandoned</span> their <span class="COLLOINEXA">attempt</span> to recapture the castle.<span class="cn_txt"> 他们放弃了夺回城堡的努力。</span></span></span><span class="EXAMPLE"><a class="speaker exafile fa fa-volume-up" href="sound://media/english/exaProns/p008-001776706.mp3" title="Play Example"> </a><span class="english LDOCE_switch_lang switch_children">Because of the fog they <span class="COLLOINEXA">abandoned</span> their <span class="COLLOINEXA"<span>someone, </span><span>you </span></div></div>\n</span>\n
'''
soup = BeautifulSoup(html, 'lxml')
context = soup.find_all(class_="english LDOCE_switch_lang switch_children")
print(context)

#this is what it runs:[<span class="english LDOCE_switch_lang switch_children">How could she abandon her own child?<span class="cn_txt"> 她怎么能抛弃自己的孩子呢?</span></span>, <span class="english LDOCE_switch_lang switch_children">We had to abandon the car and walk the rest of the way.<span class="cn_txt"> 我们只好弃车,步行走完剩下的路。</span></span>, <span class="english LDOCE_switch_lang switch_children">Fearing further attacks, most of the population had abandoned the city.<span class="cn_txt"> 因为害怕还要受到袭击,大多数市民已逃离该市。</span></span>,

我需要的是所有的英文和中文样本,如下所示:

How could she abandon her own child?
她怎么能抛弃自己的孩子呢?

我已经尝试了好几天了。请帮我。非常感谢!

最佳答案

我希望我正确理解你的问题。如果你想提取英文短语和中文对应项,你可以使用这个例子(我不懂中文,所以我无法验证这是否是正确的输出):

from bs4 import BeautifulSoup

html = '''
abandon <link href="LM5style_vanilla.css" rel="stylesheet" type="text/css" /><link href="LM5style.css" rel="stylesheet" type="text/css" /><link href="LM5style_switch.css" rel="stylesheet" type="text/css" /><link href="LM5style_show.css" rel="stylesheet" type="text/css" /><script src="jquery-3.2.1.min.js" charset="utf-8" type="text/javascript" language="javascript"></script><script src="LM5Switch.js" charset="utf-8" type="text/javascript" language="javascript"></script><span class="lm5ppbody"><div class="entry_content"><h1 class="pagetitle" pagetype="0">abandon</h1><div class="dictionary"><div class="wordfams"><span class="LDOCE5pp_sensefold foldsign_fold"><span class="asset_intro">Word family</span><span class="foldsign"><span class="foldblank"> </span><span class="foldsignbar1"></span><span class="foldsignbar2"></span></span></span><span class="LDOCE_word_family" style="display:none;"> <span class="pos">noun</span> <span class="w" title="abandonment">abandonment</span> <span class="pos">adjective</span> <a class="crossRef w" href="bword://abandoned" title="abandoned">abandoned</a> <span class="pos">verb</span> <span class="w" title="abandon">abandon</span> </span></div><!-- End of DIV wordfams--><span class="dictentry"><span class="dictionary_intro span"><span class="lm5ppMenu"><span id="lm5ppMenu_logo"> </span><span class="lm5ppMenu_title"><span class="en_title">Longman Dictionary of Contemporary English 5++</span><span class="cn_title"><span class="cn_txt_menu">朗文当代英语 5++</span></span></span><span class="lm5ppMenu_title mini"><span class="en_title">LDOCE 5++</span><span class="cn_title"><span class="cn_txt_menu">朗文 5++</span></span></span></span></span><span class="dictlink"><a name="abandon__entry_0__a"></a><span class="ldoceEntry Entry" id="abandon__entry_0"><span class="frequent Head"><span class="HWD">a<span class="HYP"><span class="HYP">·</span></span>ban<span class="HYP"><span class="HYP">·</span></span>don</span><span class="HOMNUM">1</span><a class="PronCodes" href="sound://media/english/ameProns/abandon1.mp3"><span class="neutral span"> /</span><span class="PRON">əˈbændən</span><span class="neutral span">/</span></a> <span class="tooltip LEVEL" title="Core vocabulary: Medium-frequency"> ●●○</span> <span class="FREQ" title="Top 3000 written words">W3</span> <span class="AC" title="Academic Word list">AWL</span><span class="lm5pp_POS"> verb</span><span class="GRAM"><span class="neutral span"> [</span>transitive<span class="neutral span">]</span></span><a class="speaker brefile fa fa-volume-up" data-src-mp3="/media/english/breProns/abandon_v0205.mp3" href="sound://media/english/breProns/abandon_v0205.mp3" title="Play British pronunciation of abandon"> </a><a class="speaker amefile fa fa-volume-up" data-src-mp3="/media/english/ameProns/abandon1.mp3" href="sound://media/english/ameProns/abandon1.mp3" title="Play American pronunciation of abandon"> </a></span><a name="abandon__1__a"></a><span class="newline Sense" id="abandon__1"><span class="LDOCE5pp_sensefold"><span class="sensenum span">1</span><span class="foldsign"><span class="foldblank"> </span><span class="foldsignbar1"></span><span class="foldsignbar2"></span></span></span> <span class="ACTIV">LEAVE A RELATIONSHIP</span><span class="DEF LDOCE_switch_lang switch_siblings">to leave someone, especially someone you are <a class="defRef" href="bword://responsible" title="responsible">responsible</a> for</span><span class="DEF LDOCE_switch_lang switch_siblings"> <span class="cn_txt"> 抛弃,遗弃〔某人〕</span></span><span class="RELATEDWD"><span class="neutral span"> → </span><a href="bword://abandoned"> abandoned</a></span><span class="EXAMPLE"><a class="speaker exafile fa fa-volume-up" href="sound://media/english/exaProns/p008-000963493.mp3" title="Play Example"> </a><span class="english LDOCE_switch_lang switch_children">How could she abandon her own child?<span class="cn_txt"> 她怎么能抛弃自己的孩子呢?</span></span></span></span><a name="abandon__2__a"></a><span class="newline Sense" id="abandon__2"><span class="LDOCE5pp_sensefold"><span class="sensenum span">2</span><span class="foldsign"><span class="foldblank"> </span><span class="foldsignbar1"></span><span class="foldsignbar2"></span></span></span> <span class="ACTIV">LEAVE A PLACE</span><span class="DEF LDOCE_switch_lang switch_siblings">to go away from a place, <a class="defRef" href="bword://vehicle" title="vehicle">vehicle</a> etc permanently, especially because the situation makes it <a class="defRef" href="bword://impossible" title="impossible">impossible</a> for you to stay</span><span class="DEF LDOCE_switch_lang switch_siblings"> <span class="cn_txt"> 离弃,逃离〔某地方、交通工具等〕</span></span><span class="SYN"> <span class="synopp span">SYN</span><a href="bword://leave"> leave</a></span><span class="RELATEDWD"><span class="neutral span">, → </span><a href="bword://abandoned"> abandoned</a></span><span class="EXAMPLE"><a class="speaker exafile fa fa-volume-up" href="sound://media/english/exaProns/p008-000963497.mp3" title="Play Example"> </a><span class="english LDOCE_switch_lang switch_children">We had to abandon the car and walk the rest of the way.<span class="cn_txt"> 我们只好弃车,步行走完剩下的路。</span></span></span><span class="EXAMPLE"><a class="speaker exafile fa fa-volume-up" href="sound://media/english/exaProns/p008-000963498.mp3" title="Play Example"> </a><span class="english LDOCE_switch_lang switch_children">Fearing further attacks, most of the population had abandoned the city.<span class="cn_txt"> 因为害怕还要受到袭击,大多数市民已逃离该市。</span></span></span></span><a name="abandon__3__a"></a><span class="newline Sense" id="abandon__3"><span class="LDOCE5pp_sensefold"><span class="sensenum span">3</span><span class="foldsign"><span class="foldblank"> </span><span class="foldsignbar1"></span><span class="foldsignbar2"></span></span></span> <span class="ACTIV">STOP DOING something</span><span class="DEF LDOCE_switch_lang switch_siblings">to stop doing something because there are too many problems and it is impossible to continue</span><span class="DEF LDOCE_switch_lang switch_siblings"> <span class="cn_txt"> 放弃,中止</span></span><span class="EXAMPLE"><a class="speaker exafile fa fa-volume-up" href="sound://media/english/exaProns/p008-000963502.mp3" title="Play Example"> </a><span class="english LDOCE_switch_lang switch_children">The game had to be abandoned due to bad weather.<span class="cn_txt"> 由于天气不好,比赛不得不中止。</span></span></span><span class="EXAMPLE"><a class="speaker exafile fa fa-volume-up" href="sound://media/english/exaProns/p008-001732862.mp3" title="Play Example"> </a><span class="english LDOCE_switch_lang switch_children">They <span class="COLLOINEXA">abandoned</span> their <span class="COLLOINEXA">attempt</span> to recapture the castle.<span class="cn_txt"> 他们放弃了夺回城堡的努力。</span></span></span><span class="EXAMPLE"><a class="speaker exafile fa fa-volume-up" href="sound://media/english/exaProns/p008-001776706.mp3" title="Play Example"> </a><span class="english LDOCE_switch_lang switch_children">Because of the fog they <span class="COLLOINEXA">abandoned</span> their <span class="COLLOINEXA"<span>someone, </span><span>you </span></div></div>\n</span>\n
'''
soup = BeautifulSoup(html, 'lxml')

print('{:^80} {:^80}'.format('English', 'Chinese'))
print('-' * 160)
for english in soup.select('.english:has(.cn_txt)'):
cn_txt = english.select_one('.cn_txt').get_text(strip=True)
english.select_one('.cn_txt').extract()
eng_txt = english.get_text(separator=' ', strip=True)

print('{:<80} {:<80}'.format(eng_txt, cn_txt))

打印:

                                    English                                                                          Chinese                                     
----------------------------------------------------------------------------------------------------------------------------------------------------------------
How could she abandon her own child? 她怎么能抛弃自己的孩子呢?
We had to abandon the car and walk the rest of the way. 我们只好弃车,步行走完剩下的路。
Fearing further attacks, most of the population had abandoned the city. 因为害怕还要受到袭击,大多数市民已逃离该市。
The game had to be abandoned due to bad weather. 由于天气不好,比赛不得不中止。
They abandoned their attempt to recapture the castle. 他们放弃了夺回城堡的努力。

关于python - 如何在Python中使用BeautifulSoup从html中提取特定文本?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59348483/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com