scala - 如何在 Scala 的 RddPair<K,Tuple> 中使用 reduceByKey-6ren

scala - 如何在 Scala 的 RddPair 中使用 reduceByKey

转载作者：行者123 更新时间：2023-12-02 04:39:18

25

4

我有一个 CassandraTable。通过 SparkContext.cassandraTable() 访问。检索我所有的 CassandraRow。

现在我要存储3个信息:(用户，城市，字节)我是这样存储的

rddUsersFilter.map(row =>
(row.getString("user"),(row.getString("city"),row.getString("byte").replace(",","").toLong))).groupByKey

我得到一个 RDD[(String, Iterable[(String, Long)])]现在，对于每个用户，我想对所有字节求和并为城市创建一个 map ，例如:“city”->“occurrencies”(这个城市为这个用户出现了多少次)。

之前，我将此代码拆分为两个不同的 RDD，一个用于对字节求和，另一个用于创建所描述的映射。

城市发生的例子

rddUsers.map(user => (user._1, user._2.size, user._2.groupBy(identity).map(city => (city._1,city._2.size))))

那是因为我可以通过 ._2 方法访问元组的第二个元素。但现在？我的第二个元素是 Iterable[(String,Long)]，我不能像以前那样映射了。

有没有一种解决方案可以只用一个 rdd 和一个 MapReduce 来检索我的所有信息？

最佳答案

您可以通过首先对用户、城市的字节和城市出现进行分组然后按用户进行分组来轻松地做到这一点

val data = Array(("user1","city1",100),("user1","city1",100),
     ("user1","city1",100),("user1","city2",100),("user1","city2",100), 
     ("user1","city3",100),("user1","city2",100),("user2","city1",100),
     ("user2","city2",100))
val rdd = sc.parallelize(data)

val res = rdd.map(x=> ((x._1,x._2),(1,x._3)))
             .reduceByKey((x,y)=> (x._1+y._1,x._2+y._2))
             .map(x => (x._1._1,(x._1._2,x._2._1,x._2._2)))
             .groupByKey
val userCityUsageRdd = res.map(x => { 
 val m = x._2.toList
 (x._1 ,m.map(y => (y._1->y._2)).toMap, m.map(x => x._3).reduce(_+_))
})

输出

res20: Array[(String, scala.collection.immutable.Map[String,Int], Int)] = 
Array((user1,Map(city1 -> 3, city3 -> 1, city2 -> 3),700), 
      (user2,Map(city1 -> 1, city2 -> 1),200))

关于scala - 如何在 Scala 的 RddPair<K,Tuple> 中使用 reduceByKey，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38788825/

25

4

0

文章推荐： timer - 如何定期执行 Maya MEL 程序

文章推荐： Java 错误 : incorrect time in MSK

文章推荐：斯卡拉宏 : How to get a Type object of a generic List

文章推荐： QueryTables 循环中的 VBA 错误处理

python : Find tuples from a list of tuples having duplicate data in the 0th element(of the tuple)
我有一个包含文件名和文件路径的元组列表。我想找到重复的 filename(但 filepath 可能不同)，即文件名相同但 filepath 可能不同的元组。元组列表示例: file_info
c++ - std::tuple 和 std::tuple 是否被 std::vector 视为同一类型？
我有一个像这样定义的变量 auto drum = std::make_tuple ( std::make_tuple ( 0.3f , Ex
swift 4 : pattern match an object against a tuple (Tuple pattern cannot match values of the non-tuple type)
我有一个包含几个字段的自定义结构，我想在快速 switch 语句中对其进行模式匹配，这样我就可以通过将其中一个字段与另一个字段进行比较来自定义匹配正则表达式。例如鉴于这种结构: struct MyS
c++ - 过滤嵌套动态元组(dynamic tuple of tuples)
我有一种动态元组结构: template //Should only be tuples class DynamicTuple { vector data; //All data is st
c# Tuple - 什么是 Tuple 的实际用途
这个问题在这里已经有了答案: What and When to use Tuple? [duplicate] (5 个答案) 关闭 8 年前。我正在查看 Tuple 的在线示例，但我没有看到任何理
tuples - common-lisp 中有 'tuple' 等价物吗？
在我的项目中我有很多坐标要处理，在二维情况下我发现(cons x y)的构造比(list x y)快和 (vector x y)。但是，我不知道如何将 cons 扩展到 3D 或更进一步，因为我没有
Scala Function.tupled 与 f.tupled
我有以下 Scala 代码: def f(x: Int, y: Int): Option[String] = x*y match { case 0 => None case n =>
scala - N-Tuple of Options to Option of N-Tuple
我的直觉告诉我，在一般情况下，只有宏或复杂类型的体操才能解决这个问题。 Shapeless 或 Scalaz 可以在这里帮助我吗？这是 N=2 问题的具体实例，但我正在寻找的解决方案适用于所有合理的
scala - 为什么 Scala 在解包 Tuple 时要构造一个新的 Tuple？
为什么这段 Scala 代码是这样的: class Test { def foo: (Int, String) = { (123, "123") } def bar: Unit
python - 类型错误 : can only concatenate tuple (not "vector") to tuple
我是 python 和 pygame 的新手，我正在尝试学习向量和类的基础知识，但在这个过程中我搞砸了，而且我在理解和修复标题中的错误消息方面苦苦挣扎。这是我的 Vector 类的代码: impor
python - "TypeError: can only concatenate tuple (not " float ") to tuple"
我正在编写一个程序来打开和读取一个 txt 文件，并在每一行中循环。将第 2 列和第 4 列中的值相乘并将其分配给第 5 列。 A 500.00 A 84.15 ? B 648.80 B 77.61
Python 类型错误 : can only concatenate tuple (not "str") to tuple
我知道还有其他几个问题提出了完全相同的问题，但是当我运行时: 导入命令从 pyDes 导入 * def encrypt(data, password,): k = des(password,
python 3 : Removing an empty tuple from a list of tuples
我有一个元组列表，内容如下: >>>myList [(), (), ('',), ('c', 'e'), ('ca', 'ea'), ('d',), ('do',), ('dog', 'ear', '
c++ - std::tuple 和 boost::tuple 之间的转换
给定一个 boost::tuple 和 std::tuple，你如何在它们之间进行转换？也就是说，您将如何实现以下两个功能？ template boost::tuple asBoostTuple(
c++ - 为什么不能用兼容类型的 std::tuple 按元素构造 std::tuple？
我无法初始化 std::tuple来自 std::tuple 的逐元素元素兼容类型。为什么它不像 boost::tuple 那样工作？ #include #include template st
java - 创建一个 backtype.storm.tuple.Tuple 用于测试目的？
我是 Storm 的新手并且我正在尝试找出如何编写一个 bolt 测试来测试子类 BaseRichBolt 中的 execute(Tuple tuple) 方法。问题是 Tuple 似乎是不可变的，
Python:从不考虑顺序的 "set of tuples"生成 "list of tuples"
如果我有如下元组列表: [('a', 'b'), ('c', 'd'), ('a', 'b'), ('b', 'a')] 我想删除重复的元组(在内容和内部项目顺序方面重复)以便输出为: [('a',
python - 类型错误 : can only concatenate tuple (not "list") to tuple"
我编写了一个简单的脚本来模拟基于每用户平均收入 (ARPU)、利润率和客户保持客户的年数 (ltvYears) 的客户生命周期值(value) (LTV)。下面是我的脚本。它在“ltvYears =
Python: Append tuple to a set with tuples(Python：将元组附加到具有元组的集合)
以下是我的代码，它是一组元组：。输出：设置([(‘A’，20160129，36.44)，(‘A’，20160104，41.06)，(‘A’，20160201，37.37)])。如何将另一个元组(‘A’
python - 类型错误 : Type Tuple cannot be instantiated; use tuple() instead
我用以下代码编写了一个程序: import pandas as pd import numpy as np from typing import Tuple def split_data(self,

首页

博学

6Ren·AI

商城

scala - 如何在 Scala 的 RddPair 中使用 reduceByKey