gpt4 book ai didi

python - 如何在python中对Wikipedia类别进行分组?

转载 作者:行者123 更新时间:2023-12-01 20:26:19 24 4
gpt4 key购买 nike

对于数据集的每个概念,我都存储了相应的维基百科类别。例如,考虑以下5个概念及其对应的维基百科类别。


高甘油三酯血症:['Category:Lipid metabolism disorders', 'Category:Medical conditions related to obesity']
酶抑制剂:['Category:Enzyme inhibitors', 'Category:Medicinal chemistry', 'Category:Metabolism']
搭桥手术:['Category:Surgery stubs', 'Category:Surgical procedures and techniques']
珀斯:['Category:1829 establishments in Australia', 'Category:Australian capital cities', 'Category:Metropolitan areas of Australia', 'Category:Perth, Western Australia', 'Category:Populated places established in 1829']
气候:['Category:Climate', 'Category:Climatology', 'Category:Meteorological concepts']


如您所见,前三个概念属于医学领域(而其余两个术语不是医学术语)。

更确切地说,我想将我的概念分为医学和非医学领域。但是,仅使用类别来划分概念非常困难。例如,即使enzyme inhibitorbypass surgery这两个概念在医学领域,它们的类别也非常不同。

因此,我想知道是否有一种方法可以获取类别的parent category(例如,enzyme inhibitorbypass surgery的类别属于medical父类别)

我当前正在使用pymediawikipywikibot。但是,我不仅限于这两个库,并且很高兴也可以使用其他库来解决。

编辑

正如@IlmariKaronen所建议的,我也使用了categories of categories,得到的结果如下(category附近的小字体是categories of the category)。
enter image description here

但是,我仍然找不到使用这些类别详细信息来确定给定术语是医学术语还是非医学术语的方法。

此外,正如@IlmariKaronen指出的,使用Wikiproject细节可能是潜在的。但是,似乎Medicine wikiproject似乎没有所有医学术语。因此,我们还需要检查其他wikiproject。

编辑:
我当前从Wikipedia概念中提取类别的代码如下。可以使用pywikibotpymediawiki如下进行操作。


使用库pymediawiki

导入mediawiki为pw

p = wikipedia.page('enzyme inhibitor')
print(p.categories)

使用库 pywikibot

import pywikibot as pw

site = pw.Site('en', 'wikipedia')

print([
cat.title()
for cat in pw.Page(site, 'support-vector machine').categories()
if 'hidden' not in cat.categoryinfo
])



类别的类别也可以通过@IlmariKaronen的答案中所示的相同方法进行。

如果您正在寻找更长的测试概念列表,我在下面提到了更多示例。

['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'newyork', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']


对于非常长的列表,请检查下面的链接。 https://docs.google.com/document/d/1BYllMyDlw-Rb4uMh89VjLml2Bl9Y7oUlopM-Z4F6pN0/edit?usp=sharing

注意:我不希望该解决方案能100%起作用(如果所提出的算法能够检测到许多对我足够的医学概念)

如果需要,我很乐意提供更多详细信息。

最佳答案

解决方案概述

好吧,我将从多个方向解决这个问题。这里有一些很好的建议,如果我是您,我将使用这些方法的组合(多数表决,预测标签,在您的二元案例中,超过50%的分类器都同意)。

我正在考虑以下方法:


主动学习(我下面提供的示例方法)
MediaWiki backlinks作为@TavoGC的答案提供
@Stanislav Kralin和/或parent categories提供的@Meena Nagarajan作为对您的问题的注释提供的SPARQL祖先类别(这两个类别可能会基于它们的差异而单独成为一个集合,但为此您必须联系两个创建者并比较他们的结果)。


这样,三分之二的人就必须同意某个概念是医学上的概念,这可以最大程度地减少错误的可能性。

当我们讨论它时,我会反对@ananand_v.singhthis answer中提出的方法,因为:


距离度量不应该是欧几里德式的,余弦相似性度量要好得多(例如,用spaCy使用),因为它不考虑向量的大小(并且不应该这样,它是对word2vec或GloVe进行训练的方式)
如果我理解正确,将会创建许多人工簇,而我们仅需要两个簇:医学和非医学簇。此外,药物的质心不以药物本身为中心。这带来了其他问题,比如说质心远离药物,并且其他词,例如computerhuman(或您认为不适合医学的其他词)可能会进入群集。
很难评估结果,甚至更严格地说,这是主观的。此外,单词向量很难可视化和理解(对于许多单词,使用PCA / TSNE /类似物将它们投射到较低的尺寸[2D / 3D]中,会给我们带来完全无意义的结果[是的,我尝试这样做,PCA对于较长的数据集,大约有5%的解释方差,真的,真的很低])。


基于上面突出显示的问题,我提出了使用active learning的解决方案,这是解决此类问题的一种非常被遗忘的方法。

主动学习法

在机器学习的这一子集中,当我们很难提出确切的算法时(例如,一个术语成为medical类别的一部分意味着什么),我们要求人类“专家”(实际上并不是必须是专家)以提供一些答案。

知识编码

正如anand_v.singh所指出的,词向量是最有前途的方法之一,我也将在这里使用它(尽管与IMO不同,它的使用也更加简洁)。

我不会在回答中重复他的观点,因此我将加两分钱:


请勿使用上下文化的词嵌入作为当前可用的最新技术水平(例如BERT
检查您有多少个概念没有表示形式(例如,表示为零的向量)。应该选中它(并在我的代码中选中它,到时候再进行讨论),您可以使用其中包含大多数嵌入内容。


使用spaCy衡量相似度

此类用于度量编码为spaCy的GloVe单词向量的medicine与其他所有概念之间的相似性。

class Similarity:
def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
# In our case it will be medicine
self.centroid = centroid

# spaCy's Language model (english), which will be used to return similarity to
# centroid of each concept
self.nlp = nlp
self.n_threads: int = n_threads
self.batch_size: int = batch_size

self.missing: typing.List[int] = []

def __call__(self, concepts):
concepts_similarity = []
# nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
for i, concept in enumerate(
self.nlp.pipe(
concepts, n_threads=self.n_threads, batch_size=self.batch_size
)
):
if concept.has_vector:
concepts_similarity.append(self.centroid.similarity(concept))
else:
# If document has no vector, it's assumed to be totally dissimilar to centroid
concepts_similarity.append(-1)
self.missing.append(i)

return np.array(concepts_similarity)


该代码将为每个概念返回一个数字,以衡量其与质心的相似程度。此外,它记录缺少其表示形式的概念的索引。可以这样称呼它:

import json
import typing

import numpy as np
import spacy

nlp = spacy.load("en_vectors_web_lg")

centroid = nlp("medicine")

concepts = json.load(open("concepts_new.txt"))
concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
concepts
)


您可以用数据代替 new_concepts.json

查看 spacy.load,注意我已经使用过 en_vectors_web_lg。它由685.000个唯一的单词向量组成(很多),并且可能针对您的情况开箱即用。安装spaCy后,您必须单独下载它,以上链接中提供了更多信息。

另外,您可能要使用多个质心词,例如添加 diseasehealth之类的单词,并将其单词向量平均。我不确定这是否会对您的案件产生积极影响。

其他可能性可能是使用多个质心并计算每个概念与多个质心之间的相似度。在这种情况下,我们可能会有一些阈值,这可能会删除一些 false positives,但可能会漏掉一些可能被认为与 medicine相似的术语。此外,这会使情况变得更加复杂,但是如果您的结果不令人满意,则应考虑上述两个选项(并且只有在这些选择的情况下,不要事先考虑就不要采用这种方法)。

现在,我们对概念的相似性进行了粗略的衡量。但是,某个概念与医学有0.1的积极相似性意味着什么?这是应该归类为医学的概念吗?也许那已经太遥远了?

询问专家

要获得阈值(以下术语将被视为非医学术语),最简单的方法是要求人类为我们分类一些概念(这就是主动学习的目的)。是的,我知道这是一种非常简单的主动学习形式,但无论如何我都会认为。

我用 sklearn-like接口编写了一个类,要求人类对概念进行分类,直到达到最佳阈值(或最大迭代次数)为止。

class ActiveLearner:
def __init__(
self,
concepts,
concepts_similarity,
max_steps: int,
samples: int,
step: float = 0.05,
change_multiplier: float = 0.7,
):
sorting_indices = np.argsort(-concepts_similarity)
self.concepts = concepts[sorting_indices]
self.concepts_similarity = concepts_similarity[sorting_indices]

self.max_steps: int = max_steps
self.samples: int = samples
self.step: float = step
self.change_multiplier: float = change_multiplier

# We don't have to ask experts for the same concepts
self._checked_concepts: typing.Set[int] = set()
# Minimum similarity between vectors is -1
self._min_threshold: float = -1
# Maximum similarity between vectors is 1
self._max_threshold: float = 1

# Let's start from the highest similarity to ensure minimum amount of steps
self.threshold_: float = 1



samples参数描述了在每次迭代过程中将向专家显示多少示例(这是最大值,如果已经请求了样本或样本不足以显示样本,则返回的将更少)。
step表示每次迭代中的阈值下降(我们从1开始表示完全相似)。
change_multiplier-如果专家回答的概念不相关(或大部分不相关,则返回多个),则将步乘以该浮点数。它用于在每次迭代中确定 step变化之间的准确阈值。
根据概念的相似性对概念进行排序(概念越相似,则越高)


下面的函数要求专家提出意见,并根据其答案找到最佳阈值。

def _ask_expert(self, available_concepts_indices):
# Get random concepts (the ones above the threshold)
concepts_to_show = set(
np.random.choice(
available_concepts_indices, len(available_concepts_indices)
).tolist()
)
# Remove those already presented to an expert
concepts_to_show = concepts_to_show - self._checked_concepts
self._checked_concepts.update(concepts_to_show)
# Print message for an expert and concepts to be classified
if concepts_to_show:
print("\nAre those concepts related to medicine?\n")
print(
"\n".join(
f"{i}. {concept}"
for i, concept in enumerate(
self.concepts[list(concepts_to_show)[: self.samples]]
)
),
"\n",
)
return input("[y]es / [n]o / [any]quit ")
return "y"


示例问题如下所示:

Are those concepts related to medicine?                                                      

0. anesthetic drug
1. child and adolescent psychiatry
2. tertiary care center
3. sex therapy
4. drug design
5. pain disorder
6. psychiatric rehabilitation
7. combined oral contraceptive
8. family practitioner committee
9. cancer family syndrome
10. social psychology
11. drug sale
12. blood system

[y]es / [n]o / [any]quit y


...解析专家的答案:

# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
if decision.lower() == "y":
# You can't go higher as current threshold is related to medicine
self._max_threshold = self.threshold_
if self.threshold_ - self.step < self._min_threshold:
return False
# Lower the threshold
self.threshold_ -= self.step
return True
if decision.lower() == "n":
# You can't got lower than this, as current threshold is not related to medicine already
self._min_threshold = self.threshold_
# Multiply threshold to pinpoint exact spot
self.step *= self.change_multiplier
if self.threshold_ + self.step < self._max_threshold:
return False
# Lower the threshold
self.threshold_ += self.step
return True
return False


最后是 ActiveLearner的完整代码,它相应地为专家找到了最佳的相似阈值:

class ActiveLearner:
def __init__(
self,
concepts,
concepts_similarity,
samples: int,
max_steps: int,
step: float = 0.05,
change_multiplier: float = 0.7,
):
sorting_indices = np.argsort(-concepts_similarity)
self.concepts = concepts[sorting_indices]
self.concepts_similarity = concepts_similarity[sorting_indices]

self.samples: int = samples
self.max_steps: int = max_steps
self.step: float = step
self.change_multiplier: float = change_multiplier

# We don't have to ask experts for the same concepts
self._checked_concepts: typing.Set[int] = set()
# Minimum similarity between vectors is -1
self._min_threshold: float = -1
# Maximum similarity between vectors is 1
self._max_threshold: float = 1

# Let's start from the highest similarity to ensure minimum amount of steps
self.threshold_: float = 1

def _ask_expert(self, available_concepts_indices):
# Get random concepts (the ones above the threshold)
concepts_to_show = set(
np.random.choice(
available_concepts_indices, len(available_concepts_indices)
).tolist()
)
# Remove those already presented to an expert
concepts_to_show = concepts_to_show - self._checked_concepts
self._checked_concepts.update(concepts_to_show)
# Print message for an expert and concepts to be classified
if concepts_to_show:
print("\nAre those concepts related to medicine?\n")
print(
"\n".join(
f"{i}. {concept}"
for i, concept in enumerate(
self.concepts[list(concepts_to_show)[: self.samples]]
)
),
"\n",
)
return input("[y]es / [n]o / [any]quit ")
return "y"

# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
if decision.lower() == "y":
# You can't go higher as current threshold is related to medicine
self._max_threshold = self.threshold_
if self.threshold_ - self.step < self._min_threshold:
return False
# Lower the threshold
self.threshold_ -= self.step
return True
if decision.lower() == "n":
# You can't got lower than this, as current threshold is not related to medicine already
self._min_threshold = self.threshold_
# Multiply threshold to pinpoint exact spot
self.step *= self.change_multiplier
if self.threshold_ + self.step < self._max_threshold:
return False
# Lower the threshold
self.threshold_ += self.step
return True
return False

def fit(self):
for _ in range(self.max_steps):
available_concepts_indices = np.nonzero(
self.concepts_similarity >= self.threshold_
)[0]
if available_concepts_indices.size != 0:
decision = self._ask_expert(available_concepts_indices)
if not self._parse_expert_decision(decision):
break
else:
self.threshold_ -= self.step
return self


总而言之,您将不得不手动回答一些问题,但是我认为这种方法更加准确。

此外,您不必遍历所有样本,而只是其中的一小部分。您可以决定构成医学术语的样本数量(是否显示了40个医学样本和10个非医学样本,是否仍应视为医学术语?),因此您可以根据自己的喜好微调此方法。如果存在异常值(例如,50个样本中有1个是非医学样本),我认为该阈值仍然有效。

再一次:此方法应与其他方法混合使用,以最大程度地减少错误分类的机会。

分类器

当我们从专家那里获得阈值时,分类将是瞬时的,这是一个简单的分类类:

class Classifier:
def __init__(self, centroid, threshold: float):
self.centroid = centroid
self.threshold: float = threshold

def predict(self, concepts_pipe):
predictions = []
for concept in concepts_pipe:
predictions.append(self.centroid.similarity(concept) > self.threshold)
return predictions


为了简洁起见,这是最终的源代码:

import json
import typing

import numpy as np
import spacy


class Similarity:
def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
# In our case it will be medicine
self.centroid = centroid

# spaCy's Language model (english), which will be used to return similarity to
# centroid of each concept
self.nlp = nlp
self.n_threads: int = n_threads
self.batch_size: int = batch_size

self.missing: typing.List[int] = []

def __call__(self, concepts):
concepts_similarity = []
# nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
for i, concept in enumerate(
self.nlp.pipe(
concepts, n_threads=self.n_threads, batch_size=self.batch_size
)
):
if concept.has_vector:
concepts_similarity.append(self.centroid.similarity(concept))
else:
# If document has no vector, it's assumed to be totally dissimilar to centroid
concepts_similarity.append(-1)
self.missing.append(i)

return np.array(concepts_similarity)


class ActiveLearner:
def __init__(
self,
concepts,
concepts_similarity,
samples: int,
max_steps: int,
step: float = 0.05,
change_multiplier: float = 0.7,
):
sorting_indices = np.argsort(-concepts_similarity)
self.concepts = concepts[sorting_indices]
self.concepts_similarity = concepts_similarity[sorting_indices]

self.samples: int = samples
self.max_steps: int = max_steps
self.step: float = step
self.change_multiplier: float = change_multiplier

# We don't have to ask experts for the same concepts
self._checked_concepts: typing.Set[int] = set()
# Minimum similarity between vectors is -1
self._min_threshold: float = -1
# Maximum similarity between vectors is 1
self._max_threshold: float = 1

# Let's start from the highest similarity to ensure minimum amount of steps
self.threshold_: float = 1

def _ask_expert(self, available_concepts_indices):
# Get random concepts (the ones above the threshold)
concepts_to_show = set(
np.random.choice(
available_concepts_indices, len(available_concepts_indices)
).tolist()
)
# Remove those already presented to an expert
concepts_to_show = concepts_to_show - self._checked_concepts
self._checked_concepts.update(concepts_to_show)
# Print message for an expert and concepts to be classified
if concepts_to_show:
print("\nAre those concepts related to medicine?\n")
print(
"\n".join(
f"{i}. {concept}"
for i, concept in enumerate(
self.concepts[list(concepts_to_show)[: self.samples]]
)
),
"\n",
)
return input("[y]es / [n]o / [any]quit ")
return "y"

# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
if decision.lower() == "y":
# You can't go higher as current threshold is related to medicine
self._max_threshold = self.threshold_
if self.threshold_ - self.step < self._min_threshold:
return False
# Lower the threshold
self.threshold_ -= self.step
return True
if decision.lower() == "n":
# You can't got lower than this, as current threshold is not related to medicine already
self._min_threshold = self.threshold_
# Multiply threshold to pinpoint exact spot
self.step *= self.change_multiplier
if self.threshold_ + self.step < self._max_threshold:
return False
# Lower the threshold
self.threshold_ += self.step
return True
return False

def fit(self):
for _ in range(self.max_steps):
available_concepts_indices = np.nonzero(
self.concepts_similarity >= self.threshold_
)[0]
if available_concepts_indices.size != 0:
decision = self._ask_expert(available_concepts_indices)
if not self._parse_expert_decision(decision):
break
else:
self.threshold_ -= self.step
return self


class Classifier:
def __init__(self, centroid, threshold: float):
self.centroid = centroid
self.threshold: float = threshold

def predict(self, concepts_pipe):
predictions = []
for concept in concepts_pipe:
predictions.append(self.centroid.similarity(concept) > self.threshold)
return predictions


if __name__ == "__main__":
nlp = spacy.load("en_vectors_web_lg")

centroid = nlp("medicine")

concepts = json.load(open("concepts_new.txt"))
concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
concepts
)

learner = ActiveLearner(
np.array(concepts), concepts_similarity, samples=20, max_steps=50
).fit()
print(f"Found threshold {learner.threshold_}\n")

classifier = Classifier(centroid, learner.threshold_)
pipe = nlp.pipe(concepts, n_threads=-1, batch_size=4096)
predictions = classifier.predict(pipe)
print(
"\n".join(
f"{concept}: {label}"
for concept, label in zip(concepts[20:40], predictions[20:40])
)
)


在回答了一些问题之后,将阈值设为0.1( [-1, 0.1)之间的所有内容均被视为非医疗性质,而 [0.1, 1]之间的所有内容均被视为医疗性质),我得到以下结果:

kartagener s syndrome: True
summer season: True
taq: False
atypical neuroleptic: True
anterior cingulate: False
acute respiratory distress syndrome: True
circularity: False
mutase: False
adrenergic blocking drug: True
systematic desensitization: True
the turning point: True
9l: False
pyridazine: False
bisoprolol: False
trq: False
propylhexedrine: False
type 18: True
darpp 32: False
rickettsia conorii: False
sport shoe: True


如您所见,这种方法远非完美,因此上一节描述了可能的改进:

可能的改进

如开头所述,将我的方法与其他答案混合使用,可能会排除诸如 sport shoe属于 medicine的想法,而主动学习方法在上述两种启发式方法之间平局的情况下将更具决定性。

我们也可以创建一个活跃的学习合奏。而不是一个阈值(例如0.1),我们将使用多个阈值(增加或减少),假设它们是 0.1, 0.2, 0.3, 0.4, 0.5

假设 sport shoe得到,对于每个阈值,它们分别是这样的 True/False

True True False False False

进行多数表决,我们将在2票中以3标记为 non-medical。此外,如果阈值低于它,我也可以缓解过于严格的阈值(如果 True/False看起来像这样: True True True False False)。

我想出了可能的最终改进:在上面的代码中,我使用了 Doc vector,这是单词vector创造此概念的意思。假设缺少一个单词(由零组成的矢量),在这种情况下,它将被推离 medicine重心。您可能不希望这样做(因为某些小众医学术语[诸如 gpv的缩写或其他缩写]可能会缺少它们的表示形式),在这种情况下,您只能平均那些与零不同的向量。

我知道这篇文章很长,因此,如果您有任何问题,请在下面发布。

关于python - 如何在python中对Wikipedia类别进行分组?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54625493/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com