gpt4 book ai didi

python - 为什么我使用 python sklearn 从看似非随机的代码中得到随机结果?

转载 作者:行者123 更新时间:2023-11-30 08:54:33 24 4
gpt4 key购买 nike

我根据回复更新了问题。

我有一个名为“str_tuple”的字符串列表。我想计算列表中的第一个元素与其余元素之间的一些相似性度量。我运行以下六行代码片段。

令我完全困惑的是,每次运行代码时,结果似乎都是完全随机的。然而,我看不到我的六行代码中引入了任何随机性。

更新:

需要指出的是,TruncatedSVD() 有一个“random_state”参数。指定“random_state”将给出固定结果(完全正确)。但是,如果更改“random_state”,结果将会更改。但对于其他字符串(例如 str2),无论您如何更改“random_state”,结果都是相同的。事实上,这些字符串来自 HOME_DEPOT Kaggle 竞赛。我有一个 pd.Series 包含数千个此类字符串,其中大多数给出行为类似于 str2 的非随机结果(无论设置什么“random_state”)。由于某些未知的原因,str1 是每次更改“random_state”时都会给出随机结果的示例之一。我开始认为 str1 的某些固有字符可能会产生影响。

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer

# str1 yields random results
str1 = [u'l bracket', u'simpson strong tie 12 gaug angl', u'angl make joint stronger provid consist straight corner simpson strong tie offer wide varieti angl various size thick handl light duti job project structur connect need bent skew match project outdoor project moistur present use zmax zinc coat connector provid extra resist corros look "z" end model number .versatil connector various 90 connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: 3 in. xbi 3 in. xbi 1 0.5 in. made 12 gaug steelgalvan extra corros resistanceinstal 10 d common nail 9 xbi 1 0.5 in. strong drive sd screw', u'simpson strong-tie', u'', u'versatile connector for various 90\xe2\xb0 connections and home repair projects stronger than angled nailing or screw fastening alone help ensure joints are consistently straight and strong dimensions: 3 in. x 3 in. x 1-1/2 in. made from 12-gauge steel galvanized for extra corrosion resistance install with 10d common nails or #9 x 1-1/2 in. strong-drive sd screws']
# str2 yields non-random result
str2 = [u'angl bracket', u'simpson strong tie 12 gaug angl', u'angl make joint stronger provid consist straight corner simpson strong tie offer wide varieti angl various size thick handl light duti job project structur connect need bent skew match project outdoor project moistur present use zmax zinc coat connector provid extra resist corros look "z" end model number .versatil connector various 90 connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: 3 in. xbi 3 in. xbi 1 0.5 in. made 12 gaug steelgalvan extra corros resistanceinstal 10 d common nail 9 xbi 1 0.5 in. strong drive sd screw', u'simpson strong-tie', u'', u'versatile connector for various 90\xe2\xb0 connections and home repair projects stronger than angled nailing or screw fastening alone help ensure joints are consistently straight and strong dimensions: 3 in. x 3 in. x 1-1/2 in. made from 12-gauge steel galvanized for extra corrosion resistance install with 10d common nails or #9 x 1-1/2 in. strong-drive sd screws']

vectorizer = CountVectorizer(token_pattern=r"\d+\.\d+|\d+\/\d+|\b\w+\b")
# replacing str1 with str2 gives non-ramdom result regardless of random_state
cmat = vectorizer.fit_transform(str1).astype(float) # sparse matrix
cmat = TruncatedSVD(2).fit_transform(cmat) # dense numpy array
cmat = Normalizer().fit_transform(cmat) # dense numpy array
sim = np.dot(cmat, cmat.T)
sim[0,1:].tolist()

最佳答案

默认情况下,截断 SVD 遵循随机算法。因此,您必须指定要设置为 numpy.random.seed 值的 RandomState 值。

cmat = TruncatedSVD(n_components=2, random_state=42).fit_transform(cmat)

Docs

class sklearn.decomposition.TruncatedSVD(n_components=2, algorithm='randomized', n_iter=5, random_state=None, tol=0.0)

<小时/>

为了使其产生非随机输出,列表的起始元素必须出现多次。也就是说,如果str1的起始元素是anglversatilesimpson,那么它会给出非随机结果。由于 str2 在列表开头至少重复了 angl 多次,因此它不会返回随机输出。

因此,随机性是给定列表中元素出现次数之间差异性的度量。而且,在这些情况下,指定 RandomState 对于生成唯一的输出非常有用。
[感谢 @wen 指出了这一点]

关于python - 为什么我使用 python sklearn 从看似非随机的代码中得到随机结果?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38924726/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com