gpt4 book ai didi

scala - 如何使用RDD计算文本文件中每行的字数?

转载 作者:行者123 更新时间:2023-12-05 08:23:24 24 4
gpt4 key购买 nike

有没有一种方法可以使用 map 和 reduce 来计算 RDD 的每一行而不是完整的 RDD 的单词出现次数?

例如,如果一个 RDD[String] 包含这两行:

Let's have some fun.

To have fun you don't need any plans.

那么输出应该类似于包含键值对的映射:

("Let's",1)
("have",1)
("some",1)
("fun",1)

("To",1)
("have",1)
("fun",1)
("you",1)
("don't",1)
("need",1)
("plans",1)

最佳答案

如果您刚开始使用 Spark 并且没有人告诉您使用它,请不要使用 RDD API。在 Spark 中,有更好且通常更高效的 Spark SQL API 来执行此操作以及针对大型数据集的许多其他分布式计算。

使用 RDD API 就像使用汇编程序来做一些你可以用 Scala(或其他高级编程语言)做的事情。在开始您的 Spark 之旅时,我个人首先推荐使用 DataFrames 和 Datasets 的 Spark SQL 的高级 API。


给定输入:

$ cat input.txt
Let's have some fun.

To have fun you don't need any plans.

并且您要使用 Dataset API,您可以执行以下操作:

val lines = spark.read.text("input.txt").withColumnRenamed("value", "line")
val wordsPerLine = lines.withColumn("words", explode(split($"line", "\\s+")))
scala> wordsPerLine.show(false)
+-------------------------------------+------+
|line |words |
+-------------------------------------+------+
|Let's have some fun. |Let's |
|Let's have some fun. |have |
|Let's have some fun. |some |
|Let's have some fun. |fun. |
| | |
|To have fun you don't need any plans.|To |
|To have fun you don't need any plans.|have |
|To have fun you don't need any plans.|fun |
|To have fun you don't need any plans.|you |
|To have fun you don't need any plans.|don't |
|To have fun you don't need any plans.|need |
|To have fun you don't need any plans.|any |
|To have fun you don't need any plans.|plans.|
+-------------------------------------+------+

scala> wordsPerLine.
groupBy("line", "words").
count.
withColumn("word_count", struct($"words", $"count")).
select("line", "word_count").
groupBy("line").
agg(collect_set("word_count")).
show(truncate = false)
+-------------------------------------+------------------------------------------------------------------------------+
|line |collect_set(word_count) |
+-------------------------------------+------------------------------------------------------------------------------+
|To have fun you don't need any plans.|[[fun,1], [you,1], [don't,1], [have,1], [plans.,1], [any,1], [need,1], [To,1]]|
|Let's have some fun. |[[have,1], [fun.,1], [Let's,1], [some,1]] |
| |[[,1]] |
+-------------------------------------+------------------------------------------------------------------------------+

完成。 很简单,不是吗?

参见 functions对象(用于 explodestruct 函数)。

关于scala - 如何使用RDD计算文本文件中每行的字数?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43994499/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com