- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我想编写一个仅获取维基百科描述部分的脚本。也就是说,当我说
/wiki bla bla bla
它将转到Wikipedia page for bla bla bla
,获取以下内容,并将其返回到聊天室:
"Bla Bla Bla" is the name of a song made by Gigi D'Agostino. He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It"
我该怎么做?
最佳答案
这里有一些不同的可能方法;使用适合您的任何一个。我下面的所有代码示例都使用 requests
对于 API 的 HTTP 请求;如果您有 Pip,则可以使用 pip install requests
安装 requests
。他们还都使用 Mediawiki API ,其中两个使用 query终点;如果您需要文档,请点击这些链接。
extracts
属性直接从 API 获取整个页面或页面“提取”的纯文本表示形式请注意,此方法仅适用于具有 TextExtracts extension 的 MediaWiki 网站。 。这尤其包括维基百科,但不包括一些较小的 Mediawiki 网站,例如 http://www.wikia.com/
您想要点击类似的网址
详细来说,我们有以下参数(记录在 https://www.mediawiki.org/wiki/Extension:TextExtracts#query+extracts ):
action=query
、format=json
和 title=Bla_Bla_Bla
都是标准 MediaWiki API 参数prop=extracts
让我们使用 TextExtracts 扩展exintro
限制对第一个部分标题之前的内容的响应explaintext
使响应中的摘录为纯文本而不是 HTML然后解析 JSON 响应并提取摘录:
>>> import requests
>>> response = requests.get(
... 'https://en.wikipedia.org/w/api.php',
... params={
... 'action': 'query',
... 'format': 'json',
... 'titles': 'Bla Bla Bla',
... 'prop': 'extracts',
... 'exintro': True,
... 'explaintext': True,
... }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> print(page['extract'])
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.
parse
端点获取页面的完整 HTML,解析它并提取第一段MediaWiki 有一个 parse
endpoint您可以使用类似 https://en.wikipedia.org/w/api.php?action=parse&page=Bla_Bla_Bla 的 URL 进行访问获取页面的 HTML。然后您可以使用 HTML 解析器解析它,例如 lxml (首先使用 pip install lxml
安装它)以提取第一段。
例如:
>>> import requests
>>> from lxml import html
>>> response = requests.get(
... 'https://en.wikipedia.org/w/api.php',
... params={
... 'action': 'parse',
... 'page': 'Bla Bla Bla',
... 'format': 'json',
... }
... ).json()
>>> raw_html = response['parse']['text']['*']
>>> document = html.document_fromstring(raw_html)
>>> first_p = document.xpath('//p')[0]
>>> intro_text = first_p.text_content()
>>> print(intro_text)
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.
您可以使用query
API获取页面的wiki文本,使用mwparserfromhell
解析它(首先使用pip install mwparserfromhell
安装它),然后使用strip_code
将其减少为人类可读的文本。 strip_code
在撰写本文时还不能完美运行(如下面的示例所示),但希望能够改进。
>>> import requests
>>> import mwparserfromhell
>>> response = requests.get(
... 'https://en.wikipedia.org/w/api.php',
... params={
... 'action': 'query',
... 'format': 'json',
... 'titles': 'Bla Bla Bla',
... 'prop': 'revisions',
... 'rvprop': 'content',
... }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> wikicode = page['revisions'][0]['*']
>>> parsed_wikicode = mwparserfromhell.parse(wikicode)
>>> print(parsed_wikicode.strip_code())
{{dablink|For Ke$ha's song, see Blah Blah Blah (song). For other uses, see Blah (disambiguation)}}
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.
Background and writing
He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It"''.
Music video
The song also featured a popular music video in the style of La Linea. The music video shows a man with a floating head and no arms walking toward what appears to be a shark that multiplies itself and can change direction. This style was also used in "The Riddle", another song by Gigi D'Agostino, originally from British singer Nik Kershaw.
Chart performance
Chart (1999-00)PeakpositionIreland (IRMA)Search for Irish peaks23
References
External links
Category:1999 singles
Category:Gigi D'Agostino songs
Category:1999 songs
Category:ZYX Music singles
Category:Songs written by Gigi D'Agostino
关于python - 如何从维基百科中获取纯文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4452102/
我是一名优秀的程序员,十分优秀!