- mongodb - 在 MongoDB mapreduce 中,如何展平值对象?
- javascript - 对象传播与 Object.assign
- html - 输入类型 ="submit"Vs 按钮标签它们可以互换吗?
- sql - 使用 MongoDB 而不是 MS SQL Server 的优缺点
鉴于我的知识数据库中有以下内容:
1 0 6 20 0 0 6 20
1 0 3 6 0 0 3 6
1 0 15 45 0 0 15 45
1 0 17 44 0 0 17 44
1 0 2 5 0 0 2 5
我希望能够找到以下向量的最近邻:
1 0 5 16 0 0 5 16
根据距离度量。所以在这种情况下,给定一个特定的阈值,我应该发现列出的第一个向量是给定向量的近邻。目前,我的知识数据库的大小约为数百万,因此计算每个点的距离度量然后进行比较证明是昂贵的。是否有任何替代方法可以显着加快实现这一目标?
我对几乎任何方法都持开放态度,包括在 MySQL 中使用空间索引(除了我不完全确定如何解决这个问题)或某种散列(这很好,但我也不完全相信当然)。
最佳答案
在 Python 中(来自 www.comp.mq.edu.au/):
def count_different_values(k_v1s, k_v2s):
"""kv1s and kv2s should be dictionaries mapping keys to
values. count_different_values() returns the number of keys in
k_v1s and k_v2s that don't have the same value"""
ks = set(k_v1s.iterkeys()) | set(k_v2s.iterkeys())
return sum(1 for k in ks if k_v1s.get(k) != k_v2s.get(k))
def sum_square_diffs(x0s, x1s):
"""x1s and x2s should be equal-lengthed sequences of numbers.
sum_square_differences() returns the sum of the squared differences
of x1s and x2s."""
sum((pow(x1-x2,2) for x1,x2 in zip(x1s,x2s)))
def incr(x_c, x, inc=1):
"""increments the value associated with key x in dictionary x_c
by inc, or sets it to inc if key x is not in dictionary x_c."""
x_c[x] = x_c.get(x, 0) + inc
def count_items(xs, x_c=None):
"""returns a dictionary x_c whose keys are the items in xs, and
whose values are the number of times each item occurs in xs."""
if x_c == None:
x_c = {}
for x in xs:
incr(x_c, x)
return x_c
def second(xy):
"""returns the second element in a sequence"""
return xy[1]
def most_frequent(xs):
"""returns the most frequent item in xs"""
x_c = count_items(xs)
return sorted(x_c.iteritems(), key=second, reverse=True)[0][0]
class kNN_classifier:
"""This is a k-nearest-neighbour classifer."""
def __init__(self, train_data, k, distf):
self.train_data = train_data
self.k = min(k, len(train_data))
self.distf = distf
def classify(self, x):
Ns = sorted(self.train_data,
key=lambda xy: self.distf(xy[0], x))
return most_frequent((y for x,y in Ns[:self.k]))
def batch_classify(self, xs):
return [self.classify(x) for x in xs]
def train(train_data, k=1, distf=count_different_values):
"""Returns a kNN_classifer that contains the data, the number of
nearest neighbours k and the distance function"""
return kNN_classifier(train_data, k, distf)
也是 www.umanitoba.ca/的另一个实现
#!/usr/bin/env python
# This code is part of the Biopython distribution and governed by its
# license. Please see the LICENSE file that should have been included
# as part of this package.
"""
This module provides code for doing k-nearest-neighbors classification.
k Nearest Neighbors is a supervised learning algorithm that classifies
a new observation based the classes in its surrounding neighborhood.
Glossary:
distance The distance between two points in the feature space.
weight The importance given to each point for classification.
Classes:
kNN Holds information for a nearest neighbors classifier.
Functions:
train Train a new kNN classifier.
calculate Calculate the probabilities of each class, given an observation.
classify Classify an observation into a class.
Weighting Functions:
equal_weight Every example is given a weight of 1.
"""
import numpy
class kNN:
"""Holds information necessary to do nearest neighbors classification.
Members:
classes Set of the possible classes.
xs List of the neighbors.
ys List of the classes that the neighbors belong to.
k Number of neighbors to look at.
"""
def __init__(self):
"""kNN()"""
self.classes = set()
self.xs = []
self.ys = []
self.k = None
def equal_weight(x, y):
"""equal_weight(x, y) -> 1"""
# everything gets 1 vote
return 1
def train(xs, ys, k, typecode=None):
"""train(xs, ys, k) -> kNN
Train a k nearest neighbors classifier on a training set. xs is a
list of observations and ys is a list of the class assignments.
Thus, xs and ys should contain the same number of elements. k is
the number of neighbors that should be examined when doing the
classification.
"""
knn = kNN()
knn.classes = set(ys)
knn.xs = numpy.asarray(xs, typecode)
knn.ys = ys
knn.k = k
return knn
def calculate(knn, x, weight_fn=equal_weight, distance_fn=None):
"""calculate(knn, x[, weight_fn][, distance_fn]) -> weight dict
Calculate the probability for each class. knn is a kNN object. x
is the observed data. weight_fn is an optional function that
takes x and a training example, and returns a weight. distance_fn
is an optional function that takes two points and returns the
distance between them. If distance_fn is None (the default), the
Euclidean distance is used. Returns a dictionary of the class to
the weight given to the class.
"""
x = numpy.asarray(x)
order = [] # list of (distance, index)
if distance_fn:
for i in range(len(knn.xs)):
dist = distance_fn(x, knn.xs[i])
order.append((dist, i))
else:
# Default: Use a fast implementation of the Euclidean distance
temp = numpy.zeros(len(x))
# Predefining temp allows reuse of this array, making this
# function about twice as fast.
for i in range(len(knn.xs)):
temp[:] = x - knn.xs[i]
dist = numpy.sqrt(numpy.dot(temp,temp))
order.append((dist, i))
order.sort()
# first 'k' are the ones I want.
weights = {} # class -> number of votes
for k in knn.classes:
weights[k] = 0.0
for dist, i in order[:knn.k]:
klass = knn.ys[i]
weights[klass] = weights[klass] + weight_fn(x, knn.xs[i])
return weights
def classify(knn, x, weight_fn=equal_weight, distance_fn=None):
"""classify(knn, x[, weight_fn][, distance_fn]) -> class
Classify an observation into a class. If not specified, weight_fn will
give all neighbors equal weight. distance_fn is an optional function
that takes two points and returns the distance between them. If
distance_fn is None (the default), the Euclidean distance is used.
"""
weights = calculate(
knn, x, weight_fn=weight_fn, distance_fn=distance_fn)
most_class = None
most_weight = None
for klass, weight in weights.items():
if most_class is None or weight > most_weight:
most_class = klass
most_weight = weight
return most_class
关于python - 寻找给定向量的 k 最近邻?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5684370/
题: 是否有一种简单的方法可以获取正在运行的应用程序中泄漏的资源类型列表? IOW 通过连接到应用程序? 我知道 memproof 可以做到,但它会减慢速度,以至于应用程序甚至无法持续一分钟。大多数任
正确地说下面的代码会将自定义日志发送到.net核心中的Docker容器的stdout和stderr吗? console.Writeline(...) console.error(..) 最佳答案 如果
我想将一个任务多次重复,放入 for 循环中。我必须将时间序列对象存储为 IExchangeItem , openDA 中的一个特殊类(数据同化软件)。 这是任务之一(有效): HashMap ite
我需要从文件中读取一个数组。该数组在文件中不是连续排序的,必须跳转“偏移”字节才能获得下一个元素。假设我读取一个非常大的文件,什么更有效率。 1) 使用增量相对位置。 2)使用绝对位置。 选项 1:
我有一个安装程序(使用 Advanced Installer 制作)。我有一个必须与之交互的应用程序,但我不知道如何找到该安装的 MSIHANDLE。我查看了 Microsoft 引用资料,但没有发现
我在替换正则表达式中的“joe.”等内容时遇到问题。这是代码 var objects = new Array("joe","sam"); code = "joe.id was here so was
我有 A 类。A 类负责管理 B 对象的生命周期,它包含 B 对象的容器,即 map。 ,每个 B 对象都包含 C 对象的容器,即 map .我有一个全局 A 对象用于整个应用程序。 我有以下问题:我
任何人都可以告诉我在哪里可以找到 freeImage.so 吗?我一直在努力寻找相同的东西但没有成功..任何帮助将不胜感激。我已经尝试将 freeimage.a 转换为 freeImage .so 并
在单元测试期间,我想将生成的 URL 与测试中定义的静态 URL 进行比较。对于此比较,最好有一个 TestCase.assertURLEqual 或类似的,它可以让您比较两个字符串格式的 URL,如
'find ./ -name *.jpg' 我正在尝试优化上述语句的“查找”命令。 在查找实现中处理“-name”谓词的方法。 static boolean pred__name __common (
请原谅我在这里的困惑,但我已经阅读了关于 python 中的 seek() 函数的文档(在不得不使用它之后),虽然它帮助了我,但我仍然对它的实际含义有点困惑,任何非常感谢您的解释,谢谢。 最佳答案 关
我在我正在使用的库中找到了这个语句。它应该检查集群中的当前节点是否是领导者。这是语句:(!(cluster.Leader?.IsRemote ?? true)) 为什么不直接使用 (cluster.L
我发现 JsonParser 在 javax.json.stream 中,但我不知道在哪里可以找到它。谁能帮帮我? https://docs.oracle.com/javaee/7/api/javax
关闭。这个问题需要更多focused .它目前不接受答案。 想改善这个问题吗?更新问题,使其仅关注一个问题 editing this post . 6年前关闭。 Improve this questi
如果 git 存储库中有新的更改可用,我有一个多分支管道作业设置为每分钟由 Jenkinsfile 构建。如果分支名称是某种格式,我有一个将工件部署到环境的步骤。我希望能够在每个分支的基础上配置环境,
关闭。这个问题不满足Stack Overflow guidelines .它目前不接受答案。 想改善这个问题吗?更新问题,使其成为 on-topic对于堆栈溢出。 6年前关闭。 Improve thi
我想我刚刚意识到当他们不让我使用 cfdump 时我的网络主机是多么的限制。这其实有点让我生气,真的,dump 有什么害处?无论如何,我的问题是是否有人编写了一个 cfdump 替代方案来剔除复杂类型
任务:我有多个资源需要在一个 HTTP 调用中更新。 要更新的资源类型、字段和值对于所有资源都是相同的。 示例:通过 ID 设置了一组汽车,需要将所有汽车的“状态”更新为“已售出”。 经典 RESTF
场景:表中有 2 列,数据如下例所示。对于“a”列的相同值,该表可能有多个行。 在示例中,考虑到“a”列,“1”有三行,“2”有一行。 示例表“t1”: |a|b ||1|1.1||1|1.2||1
我有一个数据框: Date Price 2021-01-01 29344.67 2021-01-02 32072.08 2021-01-03 33048.03 2021-01-04 32084.
我是一名优秀的程序员,十分优秀!