gpt4 book ai didi

scala - Spark HashingTF 结果说明

转载 作者:行者123 更新时间:2023-12-04 15:40:51 29 4
gpt4 key购买 nike

我在 DataBricks 上尝试了标准的 spark HashingTF 示例。

import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}

val sentenceData = spark.createDataFrame(Seq(
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
)).toDF("label", "sentence")

val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
display(featurizedData)

我很难理解下面的结果。
Please see the image
当 numFeatures 为 20 时
[0,20,[0,5,9,17],[1,1,1,2]]
[0,20,[2,7,9,13,15],[1,1,3,1,1]]
[0,20,[4,6,13,15,18],[1,1,1,1,1]]

如果 [0,5,9,17] 是哈希值
和 [1,1,1,2] 是频率。
17 有频率 2
9 有 3(它有 2)
13,15 有 1 而他们必须有 2。

可能我错过了一些东西。找不到详细说明的文档。

最佳答案

正如 mcelikkaya 所指出的,输出频率不是您所期望的。这是由于少数特征(在本例中为 20)的哈希冲突造成的。我在输入数据中添加了一些单词(用于说明目的)并将特征增加到 20,000,然后产生正确的频率:

+-----+---------------------------------------------------------+-------------------------------------------------------------------------+--------------------------------------------------------------------------------------+
|label|sentence |words |rawFeatures |
+-----+---------------------------------------------------------+-------------------------------------------------------------------------+--------------------------------------------------------------------------------------+
|0 |Hi hi hi hi I i i i i heard heard heard about Spark Spark|[hi, hi, hi, hi, i, i, i, i, i, heard, heard, heard, about, spark, spark]|(20000,[3105,9357,11777,11960,15329],[2.0,3.0,1.0,4.0,5.0]) |
|0 |I i wish Java could use case classes spark |[i, i, wish, java, could, use, case, classes, spark] |(20000,[495,3105,3967,4489,15329,16213,16342,19809],[1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0])|
|1 |Logistic regression models are neat |[logistic, regression, models, are, neat] |(20000,[286,1193,9604,13138,18695],[1.0,1.0,1.0,1.0,1.0]) |
+-----+---------------------------------------------------------+-------------------------------------------------------------------------+------------------------------------------------------------

关于scala - Spark HashingTF 结果说明,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41153131/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com