gpt4 book ai didi

python - 如何从维基百科中获取纯文本

转载 作者:行者123 更新时间:2023-12-03 00:56:17 24 4
gpt4 key购买 nike

我想编写一个仅获取维基百科描述部分的脚本。也就是说,当我说

/wiki bla bla bla

它将转到Wikipedia page for bla bla bla ,获取以下内容,并将其返回到聊天室:

"Bla Bla Bla" is the name of a song made by Gigi D'Agostino. He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It"

我该怎么做?

最佳答案

这里有一些不同的可能方法;使用适合您的任何一个。我下面的所有代码示例都使用 requests对于 API 的 HTTP 请求;如果您有 Pip,则可以使用 pip install requests 安装 requests。他们还都使用 Mediawiki API ,其中两个使用 query终点;如果您需要文档,请点击这些链接。

1。使用 extracts 属性直接从 API 获取整个页面或页面“提取”的纯文本表示形式

请注意,此方法仅适用于具有 TextExtracts extension 的 MediaWiki 网站。 。这尤其包括维基百科,但不包括一些较小的 Mediawiki 网站,例如 http://www.wikia.com/

您想要点击类似的网址

https://en.wikipedia.org/w/api.php?action=query&format=json&titles=Bla_Bla_Bla&prop=extracts&exintro&explaintext

详细来说,我们有以下参数(记录在 https://www.mediawiki.org/wiki/Extension:TextExtracts#query+extracts ):

  • action=queryformat=jsontitle=Bla_Bla_Bla 都是标准 MediaWiki API 参数
  • prop=extracts 让我们使用 TextExtracts 扩展
  • exintro 限制对第一个部分标题之前的内容的响应
  • explaintext 使响应中的摘录为纯文本而不是 HTML

然后解析 JSON 响应并提取摘录:

>>> import requests
>>> response = requests.get(
... 'https://en.wikipedia.org/w/api.php',
... params={
... 'action': 'query',
... 'format': 'json',
... 'titles': 'Bla Bla Bla',
... 'prop': 'extracts',
... 'exintro': True,
... 'explaintext': True,
... }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> print(page['extract'])
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

2。使用 parse 端点获取页面的完整 HTML,解析它并提取第一段

MediaWiki 有一个 parse endpoint您可以使用类似 https://en.wikipedia.org/w/api.php?action=parse&page=Bla_Bla_Bla 的 URL 进行访问获取页面的 HTML。然后您可以使用 HTML 解析器解析它,例如 lxml (首先使用 pip install lxml 安装它)以提取第一段。

例如:

>>> import requests
>>> from lxml import html
>>> response = requests.get(
... 'https://en.wikipedia.org/w/api.php',
... params={
... 'action': 'parse',
... 'page': 'Bla Bla Bla',
... 'format': 'json',
... }
... ).json()
>>> raw_html = response['parse']['text']['*']
>>> document = html.document_fromstring(raw_html)
>>> first_p = document.xpath('//p')[0]
>>> intro_text = first_p.text_content()
>>> print(intro_text)
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

3。自己解析维基文本

您可以使用query API获取页面的wiki文本,使用mwparserfromhell解析它(首先使用pip install mwparserfromhell安装它),然后使用strip_code将其减少为人类可读的文本。 strip_code 在撰写本文时还不能完美运行(如下面的示例所示),但希望能够改进。

>>> import requests
>>> import mwparserfromhell
>>> response = requests.get(
... 'https://en.wikipedia.org/w/api.php',
... params={
... 'action': 'query',
... 'format': 'json',
... 'titles': 'Bla Bla Bla',
... 'prop': 'revisions',
... 'rvprop': 'content',
... }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> wikicode = page['revisions'][0]['*']
>>> parsed_wikicode = mwparserfromhell.parse(wikicode)
>>> print(parsed_wikicode.strip_code())
{{dablink|For Ke$ha's song, see Blah Blah Blah (song). For other uses, see Blah (disambiguation)}}

"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

Background and writing
He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It"''.

Music video
The song also featured a popular music video in the style of La Linea. The music video shows a man with a floating head and no arms walking toward what appears to be a shark that multiplies itself and can change direction. This style was also used in "The Riddle", another song by Gigi D'Agostino, originally from British singer Nik Kershaw.

Chart performance
Chart (1999-00)PeakpositionIreland (IRMA)Search for Irish peaks23

References

External links


Category:1999 singles
Category:Gigi D'Agostino songs
Category:1999 songs
Category:ZYX Music singles
Category:Songs written by Gigi D'Agostino

关于python - 如何从维基百科中获取纯文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4452102/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com