gpt4 book ai didi

python - 如何从网站上包含特定字符串的所有段落中提取文本

转载 作者:行者123 更新时间:2023-11-30 21:53:18 24 4
gpt4 key购买 nike

我在 site 上遇到了问题。我想以表格形式提取我的本地语言及其含义

import requests
from bs4 import BeautifulSoup

res2 = requests.get('https://steemit.com/nigeria/@leopantro/50-yoruba-proverbs-and-idioms')
soup2 = BeautifulSoup(res2.content,'html')

Yoruba = []
English = []
for ol in soup2.findAll('ol'):
proverb = ol.find('li')
Yoruba.append(proverb.text)

我成功地将本地语言提取到列表,我还想将以字符串 Meaning: 开头的每个句子提取到另一个列表,例如:['Your status in生活决定了你对同龄人的态度”、“以成熟的方式行事,以避免坏名声。”等等。]

最佳答案

该脚本抓取谚语、翻译和含义,并从中创建一个 pandas DataFrame。 含义列表位于data['Meaning']内:

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup

res = requests.get('https://steemit.com/nigeria/@leopantro/50-yoruba-proverbs-and-idioms')
soup = BeautifulSoup(res.content,'html.parser')

data = {'Yoruba':[], 'Translation':[], 'Meaning':[]}
for youruba, translation, meaning in zip(soup.select('ol'), soup.select('ol + p'), soup.select('ol + p + p')):
data['Yoruba'].append(youruba.get_text(strip=True))
data['Translation'].append(re.sub(r'Translation:\s*', '', translation.get_text(strip=True)))
data['Meaning'].append(re.sub(r'Meaning:\s*', '', meaning.get_text(strip=True)))

# print(data['Meaning']) # <-- your meanings list

df = pd.DataFrame(data)
print(df)

打印:

                                               Yoruba                                        Translation                                            Meaning
0 Ile oba t'o jo, ewa lo busi When a king's palace burns down, the re-built ... Necessity is mother of invention, creativity i...
1 Gbogbo alangba lo d'anu dele, a ko mo eyi t'in... All lizards lie flat on their stomach and it i... Everyone looks the same on the outside but eve...
2 Ile la ti n ko eso re ode Charity begins at Home A man cannot give what he does not have good o...
3 A pę ko to jęun, ki ję ibaję The person that eat late, will not eat spoiled... It is more profitable to exercise patience whi...
4 Eewu bę loko Longę, Longę fun ara rę eewu ni There is danger at Longę's farm (Longę is a na... You should be extremely careful of situations ...
5 Bi Ēēgun nla ba ni ohùn o ri gontò, gontò na a... If a big masquerade claims it doesn't see the ... If an important man does not respect those les...
6 Kò sí ęni tí ó ma gùn ęşin tí kò ní ju ìpàkó. ... No one rides a horse without moving his head, ... Your status in life dictates your attitude tow...
7 Bí abá so òkò sójà ará ilé eni ní bá; He who throws a stone in the market will hit h... Be careful what you do unto others it may retu...
8 Agba ki wa loja, ki ori omo titun o wo. Do not go crazy, do not let the new baby look. Behave in a mature manner so avoid bad reputat...
9 Adìẹ funfun kò mọ ara rẹ̀lágbà The white chicken does not realize its age Respect yourself
10 Ọbẹ̀ kìí gbé inú àgbà mì The soup does not move round in an elder’s belly You should be able to keep secrets

... and so on

关于python - 如何从网站上包含特定字符串的所有段落中提取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59708109/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com