gpt4 book ai didi

python - 有兴趣在维基百科 xml 转储中仅搜索与医学相关的术语

转载 作者:行者123 更新时间:2023-12-01 04:46:10 25 4
gpt4 key购买 nike

我想自动定义医学术语。然而,标准医学词典和 WordNet 还不够。我因此downloaded要使用的维基百科语料库。然而,当我下载 enwiki-latest-pages-articles.xml (顺便说一句,它以“无政府主义”一词开头 - 为什么不是像“AA”这样的东西?)我立即失败了 grep 由于文件大小,开始网上查找。我发现了我认为已经为此编写的库,例如 Perl 的 MediaWiki::DumpFile (我确实了解一些 Perl,但我更喜欢 Python,因为我的脚本就是用 Python 编写的),但是看起来他们中的大多数人创建或需要某种数据库(我只是想(尽管模糊地)匹配一个单词并抓取其介绍性段落的前几句话;例如,搜索“salmonella”将返回:

Salmonella /ˌsælməˈnɛlə/ is a genus of rod-shaped (bacillus) bacteria of the Enterobacteriaceae family. There are only two species of Salmonella, Salmonella bongori and Salmonella enterica, of which there are around six subspecies and innumerable serovars. The genus Escherichia, which includes the species E.coli belongs to the same family.Salmonellae are found worldwide in both cold-blooded and warm-blooded animals, and in the environment. They cause illnesses such as typhoid fever, paratyphoid fever, and food poisoning.[1].

出于我的目的(只是将其用作术语表),这些脚本是我想要的吗(我发现如果没有示例,文档很难理解)?例如,我想:

  1. 只是为了减少搜索 Material ,删除所有与医学无关的内容(我用 category 过滤器尝试过此操作,因为维基百科允许导出特定类别,但它们没有按我想要的方式工作;例如, “Medicine”只会返回大约 20 页,因此我更愿意以某种方式处理 xml 文件)。

  2. 允许我的 Python 脚本快速搜索维基百科语料库(例如,如果我想匹配 CHOLERAE,我希望它能带我找到 Vibrio cholerae< 的定义 与实际的维基百科搜索功能一样(只需将我带到顶部选择)。我已经编写了一种可以做到这一点的搜索引擎,但对于如此大的文件(40 GB)来说它会很慢。

提前为这可能是一个非常幼稚的问题道歉。

最佳答案

这是一种查询维基百科数据库而无需下载整个内容的方法。

import requests
import argparse

parser = argparse.ArgumentParser(description='Fetch wikipedia extracts.')
parser.add_argument('word', help='word to define')
args = parser.parse_args()

proxies = {
# See http://www.mediawiki.org/wiki/API:Main_page#API_etiquette
# "http": "http://localhost:3128",
}

headers = {
# http://www.mediawiki.org/wiki/API:Main_page#Identifying_your_client
"User-Agent": "Definitions/1.0 (Contact rob@example.com for info.)"
}

params = {
'action':'query',
'prop':'extracts',
'format':'json',
'exintro':1,
'explaintext':1,
'generator':'search',
'gsrsearch':args.word,
'gsrlimit':1,
'continue':''
}

r = requests.get('http://en.wikipedia.org/w/api.php',
params=params,
headers=headers,
proxies=proxies)
json = r.json()
if "query" in json:
result = json["query"]["pages"].items()[0][1]["extract"]
print result.encode('utf-8')
else:
print "No definition."

以下是一些结果。请注意,即使单词拼写错误,它仍然会返回结果。

$ python define.py CHOLERAE
Vibrio cholerae is a Gram-negative, comma-shaped bacterium. Some strains of V. cholerae cause the disease cholera. V. cholerae is a facultative anaerobic organism and has a flagellum at one cell pole. V. cholerae was first isolated as the cause of cholera by Italian anatomist Filippo Pacini in 1854, but his discovery was not widely known until Robert Koch, working independently 30 years later, publicized the knowledge and the means of fighting the disease.
$ python define.py salmonella
Salmonella /ˌsælməˈnɛlə/ is a genus of rod-shaped (bacillus) bacteria of the Enterobacteriaceae family. There are only two species of Salmonella, Salmonella bongori and Salmonella enterica, of which there are around six subspecies and innumerable serovars. The genus Escherichia, which includes the species E.coli belongs to the same family.
Salmonellae are found worldwide in both cold-blooded and warm-blooded animals, and in the environment. They cause illnesses such as typhoid fever, paratyphoid fever, and food poisoning.
$ python define.py salmanela
Salmonella /ˌsælməˈnɛlə/ is a genus of rod-shaped (bacillus) bacteria of the Enterobacteriaceae family. There are only two species of Salmonella, Salmonella bongori and Salmonella enterica, of which there are around six subspecies and innumerable serovars. The genus Escherichia, which includes the species E.coli belongs to the same family.
Salmonellae are found worldwide in both cold-blooded and warm-blooded animals, and in the environment. They cause illnesses such as typhoid fever, paratyphoid fever, and food poisoning.

关于python - 有兴趣在维基百科 xml 转储中仅搜索与医学相关的术语,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29357643/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com