gpt4 book ai didi

python - Scipy 负距离?什么?

转载 作者:太空狗 更新时间:2023-10-29 21:27:36 27 4
gpt4 key购买 nike

我有一个包含小数点后 4 位 float 的输入文件:

i.e. 13359    0.0000    0.0000    0.0001    0.0001    0.0002`    0.0003    0.0007    ... 

(第一个是id)。我的类使用 loadVectorsFromFile 方法将其乘以 10000,然后使用 int() 这些数字。最重要的是,我还循环遍历每个向量以确保内部没有负值。但是,当我执行 _hclustering 时,我不断看到错误,“LinkageZcontains negative values”

我真的认为这是一个错误,因为:

  1. 我检查了我的值(value)观,
  2. 这些值没有足够小或足够大以接近 float 的限制,并且
  3. 我用来导出文件中值的公式使用绝对值(我的输入绝对正确)。

有人可以启发我为什么会看到这个奇怪的错误吗?是什么导致了这种负距离误差?

=====

def loadVectorsFromFile(self, limit, loc, assertAllPositive=True, inflate=True):
"""Inflate to prevent "negative" distance, we use 4 decimal points, so *10000
"""
vectors = {}
self.winfo("Each vector is set to have %d limit in length" % limit)
with open( loc ) as inf:
for line in filter(None, inf.read().split('\n')):
l = line.split('\t')
if limit:
scores = map(float, l[1:limit+1])
else:
scores = map(float, l[1:])

if inflate:
vectors[ l[0]] = map( lambda x: int(x*10000), scores) #int might save space
else:
vectors[ l[0]] = scores

if assertAllPositive:
#Assert that it has no negative value
for dirID, l in vectors.iteritems():
if reduce(operator.or_, map( lambda x: x < 0, l)):
self.werror( "Vector %s has negative values!" % dirID)
return vectors

def main( self, inputDir, outputDir, limit=0,
inFname="data.vectors.all", mappingFname='all.id.features.group.intermediate'):
"""
Loads vector from a file and start clustering
INPUT
vectors is { featureID: tfidfVector (list), }
"""
IDFeatureDic = loadIdFeatureGroupDicFromIntermediate( pjoin(self.configDir, mappingFname))
if not os.path.exists(outputDir):
os.makedirs(outputDir)

vectors = self.loadVectorsFromFile( limit, pjoin( inputDir, inFname))
for threshold in map( lambda x:float(x)/30, range(20,30)):
clusters = self._hclustering(threshold, vectors)
if clusters:
outputLoc = pjoin(outputDir, "threshold.%s.result" % str(threshold))
with open(outputLoc, 'w') as outf:
for clusterNo, cluster in clusters.iteritems():
outf.write('%s\n' % str(clusterNo))
for featureID in cluster:
feature, group = IDFeatureDic[featureID]
outline = "%s\t%s\n" % (feature, group)
outf.write(outline.encode('utf-8'))
outf.write("\n")
else:
continue

def _hclustering(self, threshold, vectors):
"""function which you should call to vary the threshold
vectors: { featureID: [ tfidf scores, tfidf score, .. ]
"""
clusters = defaultdict(list)
if len(vectors) > 1:
try:
results = hierarchy.fclusterdata( vectors.values(), threshold, metric='cosine')
except ValueError, e:
self.werror("_hclustering: %s" % str(e))
return False

for i, featureID in enumerate( vectors.keys()):

最佳答案

这是因为 float 不准确,所以向量之间的某些距离不是 0,而是例如 -0.000000000000000002。使用scipy.clip() 函数来修正问题。如果您的距离矩阵是 dmatr,请使用 numpy.clip(dmatr,0,1,dmatr),您应该没问题。

关于python - Scipy 负距离?什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2590117/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com