gpt4 book ai didi

python - 从压缩数据列表创建一个非常大的稀疏矩阵 csv

转载 作者:太空宇宙 更新时间:2023-11-04 07:56:14 26 4
gpt4 key购买 nike

我有一个格式的字典:

{
"sample1": set(["feature1", "feature2", "feature3"]),
"sample2": set(["feature1", "feature4", "feature5"]),
}

我有 2000 万个样本和 150K 个独特的特征。

我想把它转换成格式的 csv:

sample,feature1,feature2,feature3,feature4,feature5
sample1,1,1,1,0,0
sample2,1,0,0,1,1

到目前为止我做了什么:

  1. ALL_FEATURES = list(set(features))
  2. with open("features.csv", "w") as f:
    f.write("fvecmd5," + ",".join([str(x) for x in ALL_FEATURES]) + "\n")
    fvecs_lol = list(fvecs.items())
    fvecs_keys, fvecs_values = zip(*fvecs_lol)
    del fvecs_lol
    tmp = [["1" if feature in featurelist else "0" for feature in ALL_FEATURES] for featurelist in fvecs_values]
    for i, entry in enumerate(tmp):
    f.write(fvecs_keys[i] + "," + ",".join(entry) + "\n")

但是这个运行速度很慢。有没有更快的方法?也许利用 Numpy/Cython?

最佳答案

您可以使用 sklearn.feature_extraction.text.CountVectorizer ,它产生一个稀疏矩阵,然后创建一个 SparseDataFrame:

In [49]: s = pd.SparseSeries(d).astype(str).str.replace(r"[{,'}]",'')

In [50]: s
Out[50]:
sample1 feature1 feature2 feature3
sample2 feature1 feature5 feature4
dtype: object

In [51]: from sklearn.feature_extraction.text import CountVectorizer

In [52]: cv = CountVectorizer()

In [53]: r = pd.SparseDataFrame(cv.fit_transform(s),
s.index,
cv.get_feature_names(),
default_fill_value=0)

In [54]: r
Out[54]:
feature1 feature2 feature3 feature4 feature5
sample1 1 1 1 0 0
sample2 1 0 0 1 1

关于python - 从压缩数据列表创建一个非常大的稀疏矩阵 csv,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48329815/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com