- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
对于数据集的每个概念,我都存储了相应的维基百科类别。例如,考虑以下5个概念及其对应的维基百科类别。
高甘油三酯血症:['Category:Lipid metabolism disorders', 'Category:Medical conditions related to obesity']
酶抑制剂:['Category:Enzyme inhibitors', 'Category:Medicinal chemistry', 'Category:Metabolism']
搭桥手术:['Category:Surgery stubs', 'Category:Surgical procedures and techniques']
珀斯:['Category:1829 establishments in Australia', 'Category:Australian capital cities', 'Category:Metropolitan areas of Australia', 'Category:Perth, Western Australia', 'Category:Populated places established in 1829']
气候:['Category:Climate', 'Category:Climatology', 'Category:Meteorological concepts']
如您所见,前三个概念属于医学领域(而其余两个术语不是医学术语)。
更确切地说,我想将我的概念分为医学和非医学领域。但是,仅使用类别来划分概念非常困难。例如,即使enzyme inhibitor
和bypass surgery
这两个概念在医学领域,它们的类别也非常不同。
因此,我想知道是否有一种方法可以获取类别的parent category
(例如,enzyme inhibitor
和bypass surgery
的类别属于medical
父类别)
我当前正在使用pymediawiki
和pywikibot
。但是,我不仅限于这两个库,并且很高兴也可以使用其他库来解决。
编辑
正如@IlmariKaronen所建议的,我也使用了categories of categories
,得到的结果如下(category
附近的小字体是categories of the category
)。
但是,我仍然找不到使用这些类别详细信息来确定给定术语是医学术语还是非医学术语的方法。
此外,正如@IlmariKaronen指出的,使用Wikiproject
细节可能是潜在的。但是,似乎Medicine
wikiproject似乎没有所有医学术语。因此,我们还需要检查其他wikiproject。
编辑:
我当前从Wikipedia概念中提取类别的代码如下。可以使用pywikibot
或pymediawiki
如下进行操作。
使用库pymediawiki
导入mediawiki为pw
p = wikipedia.page('enzyme inhibitor')
print(p.categories)
pywikibot
import pywikibot as pw
site = pw.Site('en', 'wikipedia')
print([
cat.title()
for cat in pw.Page(site, 'support-vector machine').categories()
if 'hidden' not in cat.categoryinfo
])
['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'newyork', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']
最佳答案
解决方案概述
好吧,我将从多个方向解决这个问题。这里有一些很好的建议,如果我是您,我将使用这些方法的组合(多数表决,预测标签,在您的二元案例中,超过50%的分类器都同意)。
我正在考虑以下方法:
主动学习(我下面提供的示例方法)
MediaWiki backlinks作为@TavoGC的答案提供
@Stanislav Kralin和/或parent categories提供的@Meena Nagarajan作为对您的问题的注释提供的SPARQL祖先类别(这两个类别可能会基于它们的差异而单独成为一个集合,但为此您必须联系两个创建者并比较他们的结果)。
这样,三分之二的人就必须同意某个概念是医学上的概念,这可以最大程度地减少错误的可能性。
当我们讨论它时,我会反对@ananand_v.singh在this answer中提出的方法,因为:
距离度量不应该是欧几里德式的,余弦相似性度量要好得多(例如,用spaCy使用),因为它不考虑向量的大小(并且不应该这样,它是对word2vec或GloVe进行训练的方式)
如果我理解正确,将会创建许多人工簇,而我们仅需要两个簇:医学和非医学簇。此外,药物的质心不以药物本身为中心。这带来了其他问题,比如说质心远离药物,并且其他词,例如computer
或human
(或您认为不适合医学的其他词)可能会进入群集。
很难评估结果,甚至更严格地说,这是主观的。此外,单词向量很难可视化和理解(对于许多单词,使用PCA / TSNE /类似物将它们投射到较低的尺寸[2D / 3D]中,会给我们带来完全无意义的结果[是的,我尝试这样做,PCA对于较长的数据集,大约有5%的解释方差,真的,真的很低])。
基于上面突出显示的问题,我提出了使用active learning的解决方案,这是解决此类问题的一种非常被遗忘的方法。
主动学习法
在机器学习的这一子集中,当我们很难提出确切的算法时(例如,一个术语成为medical
类别的一部分意味着什么),我们要求人类“专家”(实际上并不是必须是专家)以提供一些答案。
知识编码
正如anand_v.singh所指出的,词向量是最有前途的方法之一,我也将在这里使用它(尽管与IMO不同,它的使用也更加简洁)。
我不会在回答中重复他的观点,因此我将加两分钱:
请勿使用上下文化的词嵌入作为当前可用的最新技术水平(例如BERT)
检查您有多少个概念没有表示形式(例如,表示为零的向量)。应该选中它(并在我的代码中选中它,到时候再进行讨论),您可以使用其中包含大多数嵌入内容。
使用spaCy衡量相似度
此类用于度量编码为spaCy的GloVe单词向量的medicine
与其他所有概念之间的相似性。
class Similarity:
def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
# In our case it will be medicine
self.centroid = centroid
# spaCy's Language model (english), which will be used to return similarity to
# centroid of each concept
self.nlp = nlp
self.n_threads: int = n_threads
self.batch_size: int = batch_size
self.missing: typing.List[int] = []
def __call__(self, concepts):
concepts_similarity = []
# nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
for i, concept in enumerate(
self.nlp.pipe(
concepts, n_threads=self.n_threads, batch_size=self.batch_size
)
):
if concept.has_vector:
concepts_similarity.append(self.centroid.similarity(concept))
else:
# If document has no vector, it's assumed to be totally dissimilar to centroid
concepts_similarity.append(-1)
self.missing.append(i)
return np.array(concepts_similarity)
import json
import typing
import numpy as np
import spacy
nlp = spacy.load("en_vectors_web_lg")
centroid = nlp("medicine")
concepts = json.load(open("concepts_new.txt"))
concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
concepts
)
new_concepts.json
。
en_vectors_web_lg
。它由685.000个唯一的单词向量组成(很多),并且可能针对您的情况开箱即用。安装spaCy后,您必须单独下载它,以上链接中提供了更多信息。
disease
或
health
之类的单词,并将其单词向量平均。我不确定这是否会对您的案件产生积极影响。
medicine
相似的术语。此外,这会使情况变得更加复杂,但是如果您的结果不令人满意,则应考虑上述两个选项(并且只有在这些选择的情况下,不要事先考虑就不要采用这种方法)。
sklearn-like
接口编写了一个类,要求人类对概念进行分类,直到达到最佳阈值(或最大迭代次数)为止。
class ActiveLearner:
def __init__(
self,
concepts,
concepts_similarity,
max_steps: int,
samples: int,
step: float = 0.05,
change_multiplier: float = 0.7,
):
sorting_indices = np.argsort(-concepts_similarity)
self.concepts = concepts[sorting_indices]
self.concepts_similarity = concepts_similarity[sorting_indices]
self.max_steps: int = max_steps
self.samples: int = samples
self.step: float = step
self.change_multiplier: float = change_multiplier
# We don't have to ask experts for the same concepts
self._checked_concepts: typing.Set[int] = set()
# Minimum similarity between vectors is -1
self._min_threshold: float = -1
# Maximum similarity between vectors is 1
self._max_threshold: float = 1
# Let's start from the highest similarity to ensure minimum amount of steps
self.threshold_: float = 1
samples
参数描述了在每次迭代过程中将向专家显示多少示例(这是最大值,如果已经请求了样本或样本不足以显示样本,则返回的将更少)。
step
表示每次迭代中的阈值下降(我们从1开始表示完全相似)。
change_multiplier
-如果专家回答的概念不相关(或大部分不相关,则返回多个),则将步乘以该浮点数。它用于在每次迭代中确定
step
变化之间的准确阈值。
def _ask_expert(self, available_concepts_indices):
# Get random concepts (the ones above the threshold)
concepts_to_show = set(
np.random.choice(
available_concepts_indices, len(available_concepts_indices)
).tolist()
)
# Remove those already presented to an expert
concepts_to_show = concepts_to_show - self._checked_concepts
self._checked_concepts.update(concepts_to_show)
# Print message for an expert and concepts to be classified
if concepts_to_show:
print("\nAre those concepts related to medicine?\n")
print(
"\n".join(
f"{i}. {concept}"
for i, concept in enumerate(
self.concepts[list(concepts_to_show)[: self.samples]]
)
),
"\n",
)
return input("[y]es / [n]o / [any]quit ")
return "y"
Are those concepts related to medicine?
0. anesthetic drug
1. child and adolescent psychiatry
2. tertiary care center
3. sex therapy
4. drug design
5. pain disorder
6. psychiatric rehabilitation
7. combined oral contraceptive
8. family practitioner committee
9. cancer family syndrome
10. social psychology
11. drug sale
12. blood system
[y]es / [n]o / [any]quit y
# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
if decision.lower() == "y":
# You can't go higher as current threshold is related to medicine
self._max_threshold = self.threshold_
if self.threshold_ - self.step < self._min_threshold:
return False
# Lower the threshold
self.threshold_ -= self.step
return True
if decision.lower() == "n":
# You can't got lower than this, as current threshold is not related to medicine already
self._min_threshold = self.threshold_
# Multiply threshold to pinpoint exact spot
self.step *= self.change_multiplier
if self.threshold_ + self.step < self._max_threshold:
return False
# Lower the threshold
self.threshold_ += self.step
return True
return False
ActiveLearner
的完整代码,它相应地为专家找到了最佳的相似阈值:
class ActiveLearner:
def __init__(
self,
concepts,
concepts_similarity,
samples: int,
max_steps: int,
step: float = 0.05,
change_multiplier: float = 0.7,
):
sorting_indices = np.argsort(-concepts_similarity)
self.concepts = concepts[sorting_indices]
self.concepts_similarity = concepts_similarity[sorting_indices]
self.samples: int = samples
self.max_steps: int = max_steps
self.step: float = step
self.change_multiplier: float = change_multiplier
# We don't have to ask experts for the same concepts
self._checked_concepts: typing.Set[int] = set()
# Minimum similarity between vectors is -1
self._min_threshold: float = -1
# Maximum similarity between vectors is 1
self._max_threshold: float = 1
# Let's start from the highest similarity to ensure minimum amount of steps
self.threshold_: float = 1
def _ask_expert(self, available_concepts_indices):
# Get random concepts (the ones above the threshold)
concepts_to_show = set(
np.random.choice(
available_concepts_indices, len(available_concepts_indices)
).tolist()
)
# Remove those already presented to an expert
concepts_to_show = concepts_to_show - self._checked_concepts
self._checked_concepts.update(concepts_to_show)
# Print message for an expert and concepts to be classified
if concepts_to_show:
print("\nAre those concepts related to medicine?\n")
print(
"\n".join(
f"{i}. {concept}"
for i, concept in enumerate(
self.concepts[list(concepts_to_show)[: self.samples]]
)
),
"\n",
)
return input("[y]es / [n]o / [any]quit ")
return "y"
# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
if decision.lower() == "y":
# You can't go higher as current threshold is related to medicine
self._max_threshold = self.threshold_
if self.threshold_ - self.step < self._min_threshold:
return False
# Lower the threshold
self.threshold_ -= self.step
return True
if decision.lower() == "n":
# You can't got lower than this, as current threshold is not related to medicine already
self._min_threshold = self.threshold_
# Multiply threshold to pinpoint exact spot
self.step *= self.change_multiplier
if self.threshold_ + self.step < self._max_threshold:
return False
# Lower the threshold
self.threshold_ += self.step
return True
return False
def fit(self):
for _ in range(self.max_steps):
available_concepts_indices = np.nonzero(
self.concepts_similarity >= self.threshold_
)[0]
if available_concepts_indices.size != 0:
decision = self._ask_expert(available_concepts_indices)
if not self._parse_expert_decision(decision):
break
else:
self.threshold_ -= self.step
return self
class Classifier:
def __init__(self, centroid, threshold: float):
self.centroid = centroid
self.threshold: float = threshold
def predict(self, concepts_pipe):
predictions = []
for concept in concepts_pipe:
predictions.append(self.centroid.similarity(concept) > self.threshold)
return predictions
import json
import typing
import numpy as np
import spacy
class Similarity:
def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
# In our case it will be medicine
self.centroid = centroid
# spaCy's Language model (english), which will be used to return similarity to
# centroid of each concept
self.nlp = nlp
self.n_threads: int = n_threads
self.batch_size: int = batch_size
self.missing: typing.List[int] = []
def __call__(self, concepts):
concepts_similarity = []
# nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
for i, concept in enumerate(
self.nlp.pipe(
concepts, n_threads=self.n_threads, batch_size=self.batch_size
)
):
if concept.has_vector:
concepts_similarity.append(self.centroid.similarity(concept))
else:
# If document has no vector, it's assumed to be totally dissimilar to centroid
concepts_similarity.append(-1)
self.missing.append(i)
return np.array(concepts_similarity)
class ActiveLearner:
def __init__(
self,
concepts,
concepts_similarity,
samples: int,
max_steps: int,
step: float = 0.05,
change_multiplier: float = 0.7,
):
sorting_indices = np.argsort(-concepts_similarity)
self.concepts = concepts[sorting_indices]
self.concepts_similarity = concepts_similarity[sorting_indices]
self.samples: int = samples
self.max_steps: int = max_steps
self.step: float = step
self.change_multiplier: float = change_multiplier
# We don't have to ask experts for the same concepts
self._checked_concepts: typing.Set[int] = set()
# Minimum similarity between vectors is -1
self._min_threshold: float = -1
# Maximum similarity between vectors is 1
self._max_threshold: float = 1
# Let's start from the highest similarity to ensure minimum amount of steps
self.threshold_: float = 1
def _ask_expert(self, available_concepts_indices):
# Get random concepts (the ones above the threshold)
concepts_to_show = set(
np.random.choice(
available_concepts_indices, len(available_concepts_indices)
).tolist()
)
# Remove those already presented to an expert
concepts_to_show = concepts_to_show - self._checked_concepts
self._checked_concepts.update(concepts_to_show)
# Print message for an expert and concepts to be classified
if concepts_to_show:
print("\nAre those concepts related to medicine?\n")
print(
"\n".join(
f"{i}. {concept}"
for i, concept in enumerate(
self.concepts[list(concepts_to_show)[: self.samples]]
)
),
"\n",
)
return input("[y]es / [n]o / [any]quit ")
return "y"
# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
if decision.lower() == "y":
# You can't go higher as current threshold is related to medicine
self._max_threshold = self.threshold_
if self.threshold_ - self.step < self._min_threshold:
return False
# Lower the threshold
self.threshold_ -= self.step
return True
if decision.lower() == "n":
# You can't got lower than this, as current threshold is not related to medicine already
self._min_threshold = self.threshold_
# Multiply threshold to pinpoint exact spot
self.step *= self.change_multiplier
if self.threshold_ + self.step < self._max_threshold:
return False
# Lower the threshold
self.threshold_ += self.step
return True
return False
def fit(self):
for _ in range(self.max_steps):
available_concepts_indices = np.nonzero(
self.concepts_similarity >= self.threshold_
)[0]
if available_concepts_indices.size != 0:
decision = self._ask_expert(available_concepts_indices)
if not self._parse_expert_decision(decision):
break
else:
self.threshold_ -= self.step
return self
class Classifier:
def __init__(self, centroid, threshold: float):
self.centroid = centroid
self.threshold: float = threshold
def predict(self, concepts_pipe):
predictions = []
for concept in concepts_pipe:
predictions.append(self.centroid.similarity(concept) > self.threshold)
return predictions
if __name__ == "__main__":
nlp = spacy.load("en_vectors_web_lg")
centroid = nlp("medicine")
concepts = json.load(open("concepts_new.txt"))
concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
concepts
)
learner = ActiveLearner(
np.array(concepts), concepts_similarity, samples=20, max_steps=50
).fit()
print(f"Found threshold {learner.threshold_}\n")
classifier = Classifier(centroid, learner.threshold_)
pipe = nlp.pipe(concepts, n_threads=-1, batch_size=4096)
predictions = classifier.predict(pipe)
print(
"\n".join(
f"{concept}: {label}"
for concept, label in zip(concepts[20:40], predictions[20:40])
)
)
[-1, 0.1)
之间的所有内容均被视为非医疗性质,而
[0.1, 1]
之间的所有内容均被视为医疗性质),我得到以下结果:
kartagener s syndrome: True
summer season: True
taq: False
atypical neuroleptic: True
anterior cingulate: False
acute respiratory distress syndrome: True
circularity: False
mutase: False
adrenergic blocking drug: True
systematic desensitization: True
the turning point: True
9l: False
pyridazine: False
bisoprolol: False
trq: False
propylhexedrine: False
type 18: True
darpp 32: False
rickettsia conorii: False
sport shoe: True
sport shoe
属于
medicine
的想法,而主动学习方法在上述两种启发式方法之间平局的情况下将更具决定性。
0.1, 0.2, 0.3, 0.4, 0.5
。
sport shoe
得到,对于每个阈值,它们分别是这样的
True/False
:
True True False False False
,
non-medical
。此外,如果阈值低于它,我也可以缓解过于严格的阈值(如果
True/False
看起来像这样:
True True True False False
)。
Doc
vector,这是单词vector创造此概念的意思。假设缺少一个单词(由零组成的矢量),在这种情况下,它将被推离
medicine
重心。您可能不希望这样做(因为某些小众医学术语[诸如
gpv
的缩写或其他缩写]可能会缺少它们的表示形式),在这种情况下,您只能平均那些与零不同的向量。
关于python - 如何在python中对Wikipedia类别进行分组?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54625493/
我需要将文本放在 中在一个 Div 中,在另一个 Div 中,在另一个 Div 中。所以这是它的样子: #document Change PIN
奇怪的事情发生了。 我有一个基本的 html 代码。 html,头部, body 。(因为我收到了一些反对票,这里是完整的代码) 这是我的CSS: html { backgroun
我正在尝试将 Assets 中的一组图像加载到 UICollectionview 中存在的 ImageView 中,但每当我运行应用程序时它都会显示错误。而且也没有显示图像。 我在ViewDidLoa
我需要根据带参数的 perl 脚本的输出更改一些环境变量。在 tcsh 中,我可以使用别名命令来评估 perl 脚本的输出。 tcsh: alias setsdk 'eval `/localhome/
我使用 Windows 身份验证创建了一个新的 Blazor(服务器端)应用程序,并使用 IIS Express 运行它。它将显示一条消息“Hello Domain\User!”来自右上方的以下 Ra
这是我的方法 void login(Event event);我想知道 Kotlin 中应该如何 最佳答案 在 Kotlin 中通配符运算符是 * 。它指示编译器它是未知的,但一旦知道,就不会有其他类
看下面的代码 for story in book if story.title.length < 140 - var story
我正在尝试用 C 语言学习字符串处理。我写了一个程序,它存储了一些音乐轨道,并帮助用户检查他/她想到的歌曲是否存在于存储的轨道中。这是通过要求用户输入一串字符来完成的。然后程序使用 strstr()
我正在学习 sscanf 并遇到如下格式字符串: sscanf("%[^:]:%[^*=]%*[*=]%n",a,b,&c); 我理解 %[^:] 部分意味着扫描直到遇到 ':' 并将其分配给 a。:
def char_check(x,y): if (str(x) in y or x.find(y) > -1) or (str(y) in x or y.find(x) > -1):
我有一种情况,我想将文本文件中的现有行包含到一个新 block 中。 line 1 line 2 line in block line 3 line 4 应该变成 line 1 line 2 line
我有一个新项目,我正在尝试设置 Django 调试工具栏。首先,我尝试了快速设置,它只涉及将 'debug_toolbar' 添加到我的已安装应用程序列表中。有了这个,当我转到我的根 URL 时,调试
在 Matlab 中,如果我有一个函数 f,例如签名是 f(a,b,c),我可以创建一个只有一个变量 b 的函数,它将使用固定的 a=a1 和 c=c1 调用 f: g = @(b) f(a1, b,
我不明白为什么 ForEach 中的元素之间有多余的垂直间距在 VStack 里面在 ScrollView 里面使用 GeometryReader 时渲染自定义水平分隔线。 Scrol
我想知道,是否有关于何时使用 session 和 cookie 的指南或最佳实践? 什么应该和什么不应该存储在其中?谢谢! 最佳答案 这些文档很好地了解了 session cookie 的安全问题以及
我在 scipy/numpy 中有一个 Nx3 矩阵,我想用它制作一个 3 维条形图,其中 X 轴和 Y 轴由矩阵的第一列和第二列的值、高度确定每个条形的 是矩阵中的第三列,条形的数量由 N 确定。
假设我用两种不同的方式初始化信号量 sem_init(&randomsem,0,1) sem_init(&randomsem,0,0) 现在, sem_wait(&randomsem) 在这两种情况下
我怀疑该值如何存储在“WORD”中,因为 PStr 包含实际输出。? 既然Pstr中存储的是小写到大写的字母,那么在printf中如何将其给出为“WORD”。有人可以吗?解释一下? #include
我有一个 3x3 数组: var my_array = [[0,1,2], [3,4,5], [6,7,8]]; 并想获得它的第一个 2
我意识到您可以使用如下方式轻松检查焦点: var hasFocus = true; $(window).blur(function(){ hasFocus = false; }); $(win
我是一名优秀的程序员,十分优秀!