gpt4 book ai didi

machine-learning - 哈希向量化器和计数向量化器在使用时有何区别?

转载 作者:行者123 更新时间:2023-11-30 08:26:04 25 4
gpt4 key购买 nike

我正在尝试在 scikit-learn 中使用各种 SVM 变体以及 CountVectorizer 和 HashingVectorizer。他们在不同的示例中使用 fit 或 fit_transform ,让我困惑何时使用哪个。

如有任何澄清,我们将不胜感激。

最佳答案

它们有类似的目的。 documentationHashingVectorizer 提供了一些优点和缺点:

This strategy has several advantages:

  • it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory
  • it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
  • it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

  • there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.
  • there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).
  • no IDF weighting as this would render the transformer stateful.

关于machine-learning - 哈希向量化器和计数向量化器在使用时有何区别?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30024122/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com