- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
对于数据集的每个概念,我都存储了相应的维基百科类别。例如,考虑以下5个概念及其对应的维基百科类别。
高甘油三酯血症:['Category:Lipid metabolism disorders', 'Category:Medical conditions related to obesity']
酶抑制剂:['Category:Enzyme inhibitors', 'Category:Medicinal chemistry', 'Category:Metabolism']
搭桥手术:['Category:Surgery stubs', 'Category:Surgical procedures and techniques']
珀斯:['Category:1829 establishments in Australia', 'Category:Australian capital cities', 'Category:Metropolitan areas of Australia', 'Category:Perth, Western Australia', 'Category:Populated places established in 1829']
气候:['Category:Climate', 'Category:Climatology', 'Category:Meteorological concepts']
如您所见,前三个概念属于医学领域(而其余两个术语不是医学术语)。
更确切地说,我想将我的概念分为医学和非医学领域。但是,仅使用类别来划分概念非常困难。例如,即使enzyme inhibitor
和bypass surgery
这两个概念在医学领域,它们的类别也非常不同。
因此,我想知道是否有一种方法可以获取类别的parent category
(例如,enzyme inhibitor
和bypass surgery
的类别属于medical
父类别)
我当前正在使用pymediawiki
和pywikibot
。但是,我不仅限于这两个库,并且很高兴也可以使用其他库来解决。
编辑
正如@IlmariKaronen所建议的,我也使用了categories of categories
,得到的结果如下(category
附近的小字体是categories of the category
)。
但是,我仍然找不到使用这些类别详细信息来确定给定术语是医学术语还是非医学术语的方法。
此外,正如@IlmariKaronen指出的,使用Wikiproject
细节可能是潜在的。但是,似乎Medicine
wikiproject似乎没有所有医学术语。因此,我们还需要检查其他wikiproject。
编辑:
我当前从Wikipedia概念中提取类别的代码如下。可以使用pywikibot
或pymediawiki
如下进行操作。
使用库pymediawiki
导入mediawiki为pw
p = wikipedia.page('enzyme inhibitor')
print(p.categories)
pywikibot
import pywikibot as pw
site = pw.Site('en', 'wikipedia')
print([
cat.title()
for cat in pw.Page(site, 'support-vector machine').categories()
if 'hidden' not in cat.categoryinfo
])
['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'newyork', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']
最佳答案
解决方案概述
好吧,我将从多个方向解决这个问题。这里有一些很好的建议,如果我是您,我将使用这些方法的组合(多数表决,预测标签,在您的二元案例中,超过50%的分类器都同意)。
我正在考虑以下方法:
主动学习(我下面提供的示例方法)
MediaWiki backlinks作为@TavoGC的答案提供
@Stanislav Kralin和/或parent categories提供的@Meena Nagarajan作为对您的问题的注释提供的SPARQL祖先类别(这两个类别可能会基于它们的差异而单独成为一个集合,但为此您必须联系两个创建者并比较他们的结果)。
这样,三分之二的人就必须同意某个概念是医学上的概念,这可以最大程度地减少错误的可能性。
当我们讨论它时,我会反对@ananand_v.singh在this answer中提出的方法,因为:
距离度量不应该是欧几里德式的,余弦相似性度量要好得多(例如,用spaCy使用),因为它不考虑向量的大小(并且不应该这样,它是对word2vec或GloVe进行训练的方式)
如果我理解正确,将会创建许多人工簇,而我们仅需要两个簇:医学和非医学簇。此外,药物的质心不以药物本身为中心。这带来了其他问题,比如说质心远离药物,并且其他词,例如computer
或human
(或您认为不适合医学的其他词)可能会进入群集。
很难评估结果,甚至更严格地说,这是主观的。此外,单词向量很难可视化和理解(对于许多单词,使用PCA / TSNE /类似物将它们投射到较低的尺寸[2D / 3D]中,会给我们带来完全无意义的结果[是的,我尝试这样做,PCA对于较长的数据集,大约有5%的解释方差,真的,真的很低])。
基于上面突出显示的问题,我提出了使用active learning的解决方案,这是解决此类问题的一种非常被遗忘的方法。
主动学习法
在机器学习的这一子集中,当我们很难提出确切的算法时(例如,一个术语成为medical
类别的一部分意味着什么),我们要求人类“专家”(实际上并不是必须是专家)以提供一些答案。
知识编码
正如anand_v.singh所指出的,词向量是最有前途的方法之一,我也将在这里使用它(尽管与IMO不同,它的使用也更加简洁)。
我不会在回答中重复他的观点,因此我将加两分钱:
请勿使用上下文化的词嵌入作为当前可用的最新技术水平(例如BERT)
检查您有多少个概念没有表示形式(例如,表示为零的向量)。应该选中它(并在我的代码中选中它,到时候再进行讨论),您可以使用其中包含大多数嵌入内容。
使用spaCy衡量相似度
此类用于度量编码为spaCy的GloVe单词向量的medicine
与其他所有概念之间的相似性。
class Similarity:
def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
# In our case it will be medicine
self.centroid = centroid
# spaCy's Language model (english), which will be used to return similarity to
# centroid of each concept
self.nlp = nlp
self.n_threads: int = n_threads
self.batch_size: int = batch_size
self.missing: typing.List[int] = []
def __call__(self, concepts):
concepts_similarity = []
# nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
for i, concept in enumerate(
self.nlp.pipe(
concepts, n_threads=self.n_threads, batch_size=self.batch_size
)
):
if concept.has_vector:
concepts_similarity.append(self.centroid.similarity(concept))
else:
# If document has no vector, it's assumed to be totally dissimilar to centroid
concepts_similarity.append(-1)
self.missing.append(i)
return np.array(concepts_similarity)
import json
import typing
import numpy as np
import spacy
nlp = spacy.load("en_vectors_web_lg")
centroid = nlp("medicine")
concepts = json.load(open("concepts_new.txt"))
concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
concepts
)
new_concepts.json
。
en_vectors_web_lg
。它由685.000个唯一的单词向量组成(很多),并且可能针对您的情况开箱即用。安装spaCy后,您必须单独下载它,以上链接中提供了更多信息。
disease
或
health
之类的单词,并将其单词向量平均。我不确定这是否会对您的案件产生积极影响。
medicine
相似的术语。此外,这会使情况变得更加复杂,但是如果您的结果不令人满意,则应考虑上述两个选项(并且只有在这些选择的情况下,不要事先考虑就不要采用这种方法)。
sklearn-like
接口编写了一个类,要求人类对概念进行分类,直到达到最佳阈值(或最大迭代次数)为止。
class ActiveLearner:
def __init__(
self,
concepts,
concepts_similarity,
max_steps: int,
samples: int,
step: float = 0.05,
change_multiplier: float = 0.7,
):
sorting_indices = np.argsort(-concepts_similarity)
self.concepts = concepts[sorting_indices]
self.concepts_similarity = concepts_similarity[sorting_indices]
self.max_steps: int = max_steps
self.samples: int = samples
self.step: float = step
self.change_multiplier: float = change_multiplier
# We don't have to ask experts for the same concepts
self._checked_concepts: typing.Set[int] = set()
# Minimum similarity between vectors is -1
self._min_threshold: float = -1
# Maximum similarity between vectors is 1
self._max_threshold: float = 1
# Let's start from the highest similarity to ensure minimum amount of steps
self.threshold_: float = 1
samples
参数描述了在每次迭代过程中将向专家显示多少示例(这是最大值,如果已经请求了样本或样本不足以显示样本,则返回的将更少)。
step
表示每次迭代中的阈值下降(我们从1开始表示完全相似)。
change_multiplier
-如果专家回答的概念不相关(或大部分不相关,则返回多个),则将步乘以该浮点数。它用于在每次迭代中确定
step
变化之间的准确阈值。
def _ask_expert(self, available_concepts_indices):
# Get random concepts (the ones above the threshold)
concepts_to_show = set(
np.random.choice(
available_concepts_indices, len(available_concepts_indices)
).tolist()
)
# Remove those already presented to an expert
concepts_to_show = concepts_to_show - self._checked_concepts
self._checked_concepts.update(concepts_to_show)
# Print message for an expert and concepts to be classified
if concepts_to_show:
print("\nAre those concepts related to medicine?\n")
print(
"\n".join(
f"{i}. {concept}"
for i, concept in enumerate(
self.concepts[list(concepts_to_show)[: self.samples]]
)
),
"\n",
)
return input("[y]es / [n]o / [any]quit ")
return "y"
Are those concepts related to medicine?
0. anesthetic drug
1. child and adolescent psychiatry
2. tertiary care center
3. sex therapy
4. drug design
5. pain disorder
6. psychiatric rehabilitation
7. combined oral contraceptive
8. family practitioner committee
9. cancer family syndrome
10. social psychology
11. drug sale
12. blood system
[y]es / [n]o / [any]quit y
# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
if decision.lower() == "y":
# You can't go higher as current threshold is related to medicine
self._max_threshold = self.threshold_
if self.threshold_ - self.step < self._min_threshold:
return False
# Lower the threshold
self.threshold_ -= self.step
return True
if decision.lower() == "n":
# You can't got lower than this, as current threshold is not related to medicine already
self._min_threshold = self.threshold_
# Multiply threshold to pinpoint exact spot
self.step *= self.change_multiplier
if self.threshold_ + self.step < self._max_threshold:
return False
# Lower the threshold
self.threshold_ += self.step
return True
return False
ActiveLearner
的完整代码,它相应地为专家找到了最佳的相似阈值:
class ActiveLearner:
def __init__(
self,
concepts,
concepts_similarity,
samples: int,
max_steps: int,
step: float = 0.05,
change_multiplier: float = 0.7,
):
sorting_indices = np.argsort(-concepts_similarity)
self.concepts = concepts[sorting_indices]
self.concepts_similarity = concepts_similarity[sorting_indices]
self.samples: int = samples
self.max_steps: int = max_steps
self.step: float = step
self.change_multiplier: float = change_multiplier
# We don't have to ask experts for the same concepts
self._checked_concepts: typing.Set[int] = set()
# Minimum similarity between vectors is -1
self._min_threshold: float = -1
# Maximum similarity between vectors is 1
self._max_threshold: float = 1
# Let's start from the highest similarity to ensure minimum amount of steps
self.threshold_: float = 1
def _ask_expert(self, available_concepts_indices):
# Get random concepts (the ones above the threshold)
concepts_to_show = set(
np.random.choice(
available_concepts_indices, len(available_concepts_indices)
).tolist()
)
# Remove those already presented to an expert
concepts_to_show = concepts_to_show - self._checked_concepts
self._checked_concepts.update(concepts_to_show)
# Print message for an expert and concepts to be classified
if concepts_to_show:
print("\nAre those concepts related to medicine?\n")
print(
"\n".join(
f"{i}. {concept}"
for i, concept in enumerate(
self.concepts[list(concepts_to_show)[: self.samples]]
)
),
"\n",
)
return input("[y]es / [n]o / [any]quit ")
return "y"
# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
if decision.lower() == "y":
# You can't go higher as current threshold is related to medicine
self._max_threshold = self.threshold_
if self.threshold_ - self.step < self._min_threshold:
return False
# Lower the threshold
self.threshold_ -= self.step
return True
if decision.lower() == "n":
# You can't got lower than this, as current threshold is not related to medicine already
self._min_threshold = self.threshold_
# Multiply threshold to pinpoint exact spot
self.step *= self.change_multiplier
if self.threshold_ + self.step < self._max_threshold:
return False
# Lower the threshold
self.threshold_ += self.step
return True
return False
def fit(self):
for _ in range(self.max_steps):
available_concepts_indices = np.nonzero(
self.concepts_similarity >= self.threshold_
)[0]
if available_concepts_indices.size != 0:
decision = self._ask_expert(available_concepts_indices)
if not self._parse_expert_decision(decision):
break
else:
self.threshold_ -= self.step
return self
class Classifier:
def __init__(self, centroid, threshold: float):
self.centroid = centroid
self.threshold: float = threshold
def predict(self, concepts_pipe):
predictions = []
for concept in concepts_pipe:
predictions.append(self.centroid.similarity(concept) > self.threshold)
return predictions
import json
import typing
import numpy as np
import spacy
class Similarity:
def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
# In our case it will be medicine
self.centroid = centroid
# spaCy's Language model (english), which will be used to return similarity to
# centroid of each concept
self.nlp = nlp
self.n_threads: int = n_threads
self.batch_size: int = batch_size
self.missing: typing.List[int] = []
def __call__(self, concepts):
concepts_similarity = []
# nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
for i, concept in enumerate(
self.nlp.pipe(
concepts, n_threads=self.n_threads, batch_size=self.batch_size
)
):
if concept.has_vector:
concepts_similarity.append(self.centroid.similarity(concept))
else:
# If document has no vector, it's assumed to be totally dissimilar to centroid
concepts_similarity.append(-1)
self.missing.append(i)
return np.array(concepts_similarity)
class ActiveLearner:
def __init__(
self,
concepts,
concepts_similarity,
samples: int,
max_steps: int,
step: float = 0.05,
change_multiplier: float = 0.7,
):
sorting_indices = np.argsort(-concepts_similarity)
self.concepts = concepts[sorting_indices]
self.concepts_similarity = concepts_similarity[sorting_indices]
self.samples: int = samples
self.max_steps: int = max_steps
self.step: float = step
self.change_multiplier: float = change_multiplier
# We don't have to ask experts for the same concepts
self._checked_concepts: typing.Set[int] = set()
# Minimum similarity between vectors is -1
self._min_threshold: float = -1
# Maximum similarity between vectors is 1
self._max_threshold: float = 1
# Let's start from the highest similarity to ensure minimum amount of steps
self.threshold_: float = 1
def _ask_expert(self, available_concepts_indices):
# Get random concepts (the ones above the threshold)
concepts_to_show = set(
np.random.choice(
available_concepts_indices, len(available_concepts_indices)
).tolist()
)
# Remove those already presented to an expert
concepts_to_show = concepts_to_show - self._checked_concepts
self._checked_concepts.update(concepts_to_show)
# Print message for an expert and concepts to be classified
if concepts_to_show:
print("\nAre those concepts related to medicine?\n")
print(
"\n".join(
f"{i}. {concept}"
for i, concept in enumerate(
self.concepts[list(concepts_to_show)[: self.samples]]
)
),
"\n",
)
return input("[y]es / [n]o / [any]quit ")
return "y"
# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
if decision.lower() == "y":
# You can't go higher as current threshold is related to medicine
self._max_threshold = self.threshold_
if self.threshold_ - self.step < self._min_threshold:
return False
# Lower the threshold
self.threshold_ -= self.step
return True
if decision.lower() == "n":
# You can't got lower than this, as current threshold is not related to medicine already
self._min_threshold = self.threshold_
# Multiply threshold to pinpoint exact spot
self.step *= self.change_multiplier
if self.threshold_ + self.step < self._max_threshold:
return False
# Lower the threshold
self.threshold_ += self.step
return True
return False
def fit(self):
for _ in range(self.max_steps):
available_concepts_indices = np.nonzero(
self.concepts_similarity >= self.threshold_
)[0]
if available_concepts_indices.size != 0:
decision = self._ask_expert(available_concepts_indices)
if not self._parse_expert_decision(decision):
break
else:
self.threshold_ -= self.step
return self
class Classifier:
def __init__(self, centroid, threshold: float):
self.centroid = centroid
self.threshold: float = threshold
def predict(self, concepts_pipe):
predictions = []
for concept in concepts_pipe:
predictions.append(self.centroid.similarity(concept) > self.threshold)
return predictions
if __name__ == "__main__":
nlp = spacy.load("en_vectors_web_lg")
centroid = nlp("medicine")
concepts = json.load(open("concepts_new.txt"))
concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
concepts
)
learner = ActiveLearner(
np.array(concepts), concepts_similarity, samples=20, max_steps=50
).fit()
print(f"Found threshold {learner.threshold_}\n")
classifier = Classifier(centroid, learner.threshold_)
pipe = nlp.pipe(concepts, n_threads=-1, batch_size=4096)
predictions = classifier.predict(pipe)
print(
"\n".join(
f"{concept}: {label}"
for concept, label in zip(concepts[20:40], predictions[20:40])
)
)
[-1, 0.1)
之间的所有内容均被视为非医疗性质,而
[0.1, 1]
之间的所有内容均被视为医疗性质),我得到以下结果:
kartagener s syndrome: True
summer season: True
taq: False
atypical neuroleptic: True
anterior cingulate: False
acute respiratory distress syndrome: True
circularity: False
mutase: False
adrenergic blocking drug: True
systematic desensitization: True
the turning point: True
9l: False
pyridazine: False
bisoprolol: False
trq: False
propylhexedrine: False
type 18: True
darpp 32: False
rickettsia conorii: False
sport shoe: True
sport shoe
属于
medicine
的想法,而主动学习方法在上述两种启发式方法之间平局的情况下将更具决定性。
0.1, 0.2, 0.3, 0.4, 0.5
。
sport shoe
得到,对于每个阈值,它们分别是这样的
True/False
:
True True False False False
,
non-medical
。此外,如果阈值低于它,我也可以缓解过于严格的阈值(如果
True/False
看起来像这样:
True True True False False
)。
Doc
vector,这是单词vector创造此概念的意思。假设缺少一个单词(由零组成的矢量),在这种情况下,它将被推离
medicine
重心。您可能不希望这样做(因为某些小众医学术语[诸如
gpv
的缩写或其他缩写]可能会缺少它们的表示形式),在这种情况下,您只能平均那些与零不同的向量。
关于python - 如何在python中对Wikipedia类别进行分组?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54625493/
您好,我正在处理 BIRT 报告。我有一个查询,我必须对父级的重复数据进行分组,但子级也不能分组! 在我的查询中: item 是父项,item_ledger_entry 是子项。我有来自 item.N
我正在使用 GA API。 这是针对 MCF 目标报告(底部)的标准目标完成指标表(顶部) 看一下这个: 总数加起来 (12,238),但看看按 channel 分组的分割有多么不同!我以为这些会很接
我正在开发一个流量计数器,我想获得 IP 和重复计数,但是如何? 就像是 :select ip, count(ip) from Redirect 返回 : null total ip count 重定
我尝试编写一个正则表达式来匹配条件表达式,例如: a!=2 1+2=2+a 我尝试提取运算符。我当前的正则表达式是“.+([!=<>]+).+” 但问题是匹配器总是尝试匹配组中可能的最短字符串
在 MS Transact SQL 中,假设我有一个这样的表(订单): Order Date Order Total Customer # 09/30/2008 8
我想按 m.ID 分组,并对每个 m.id 求和 (pm.amount_construction* prod.anzahl) 实际上我有以下结果: Meterial_id | amount_const
我想根据多列中的值对值进行分组。这是一个例子: 我想得到输出: {{-30,-50,20},{-20,30,60},{-30,NULL or other value, 20}} 我设法到达: SELE
我正在尝试找出运行此查询的最佳方式。我基本上需要返回在我们的系统中只下了一个订单的客户的“登录”字段列表(登录字段基本上是客户 ID/ key )。 我们系统的一些背景...... 客户在同一日期下的
给定以下mysql结果集: id code name importance '1234', 'ID-CS-B', 'Chocolate Sauce'
大家好,我的数据框中有以下列: LC_REF 1 DT 16 2C 2 DT 16 2C 3 DT 16 2C 1 DT 16 3C 6 DT 16 3C 3
我有这样的 mongoDB 集合 { "_id" : "EkKTRrpH4FY9AuRLj", "stage" : 10, }, { "_id" : "EkKTRrpH4FY9
假设我有一组数据对,其中 index 0 是值,index 1 是类型: input = [ ('11013331', 'KAT'), ('9085267',
java中用stream进行去重,排序,分组 一、distinct 1. 八大基本数据类型 List collect = ListUtil.of(1, 2, 3, 1, 2).stream().fil
基本上,我从 TABLE_A 中的这个开始 France - 100 France - 200 France - 300 Mexico - 50 Mexico - 50 Mexico - 56 Pol
我希望这个正则表达式 ([A-Z]+)$ 将选择此示例中的最后一次出现: AB.012.00.022ABC-1 AB.013.00.022AB-1 AB.014.00.022ABAB-1 但我没有匹配
我创建了一个数据透视表,但数据没有组合在一起。 任何人都可以帮助我获得所需的格式吗? 我为获取数据透视表而编写的查询: DECLARE @cols AS NVARCHAR(MAX), -- f
我想按时间段(月,周,日,小时,...)选择计数和分组。例如,我想选择行数并将它们按 24 小时分组。 我的表创建如下。日期是时间戳。 CREATE TABLE MSG ( MSG_ID dec
在 SQL Server 2005 中,我有一个包含如下数据的表: WTN------------Date 555-111-1212 2009-01-01 555-111-1212 2009-
题 假设我有 k 个标量列,如果它们沿着每列彼此在一定距离内,我想对它们进行分组。 假设简单 k 是 2 并且它们是我唯一的列。 pd.DataFrame(list(zip(sorted(choice
问题 在以下数据框中 df : import random import pandas as pd random.seed(999) sz = 50 qty = {'one': 1, 'two': 2
我是一名优秀的程序员,十分优秀!