algorithm - Spark : Counting co-occurrence - Algorithm for efficient multi-pass filtering of huge collections-6ren

algorithm - Spark : Counting co-occurrence - Algorithm for efficient multi-pass filtering of huge collections

转载作者：塔克拉玛干更新时间：2023-11-03 02:36:20

有一个表，有两列books和readers这些书，其中books和readers是图书和读者 ID，分别为:

   books readers
1:     1      30
2:     2      10
3:     3      20
4:     1      20
5:     1      10
6:     2      30

Record book = 1, reader = 30 表示 id = 1 的书被 id = 30 的用户阅读。对于每一对书籍，我需要使用以下算法计算阅读这两本书的读者数量:

for each book
  for each reader of the book
    for each other_book in books of the reader
      increment common_reader_count ((book, other_book), cnt)

使用此算法的优势在于，与按两本计算所有书籍组合相比，它需要少量操作。

为了实现上述算法，我将这些数据分为两组:1) 按书键，一个包含每本书读者的 RDD 和 2) 按读者键，一个包含每个读者阅读的书的 RDD，如下所示程序:

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.log4j.Logger
import org.apache.log4j.Level

object Small {

  case class Book(book: Int, reader: Int)
  case class BookPair(book1: Int, book2: Int, cnt:Int)

  val recs = Array(
    Book(book = 1, reader = 30),
    Book(book = 2, reader = 10),
    Book(book = 3, reader = 20),
    Book(book = 1, reader = 20),
    Book(book = 1, reader = 10),
    Book(book = 2, reader = 30))

  def main(args: Array[String]) {
    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
    // set up environment
    val conf = new SparkConf()
      .setAppName("Test")
      .set("spark.executor.memory", "2g")
    val sc = new SparkContext(conf)
    val data = sc.parallelize(recs)

    val bookMap = data.map(r => (r.book, r))
    val bookGrps = bookMap.groupByKey

    val readerMap = data.map(r => (r.reader, r))
    val readerGrps = readerMap.groupByKey

    // *** Calculate book pairs
    // Iterate book groups 
    val allBookPairs = bookGrps.map(bookGrp => bookGrp match {
      case (book, recIter) =>
        // Iterate user groups 
        recIter.toList.map(rec => {
          // Find readers for this book
          val aReader = rec.reader
          // Find all books (including this one) that this reader read
          val allReaderBooks = readerGrps.filter(readerGrp => readerGrp match {
            case (reader2, recIter2) => reader2 == aReader
          })
          val bookPairs = allReaderBooks.map(readerTuple => readerTuple match {
            case (reader3, recIter3) => recIter3.toList.map(rec => ((book, rec.book), 1))
          })
          bookPairs
        })

    })
    val x = allBookPairs.flatMap(identity)
    val y = x.map(rdd => rdd.first)
    val z = y.flatMap(identity)
    val p = z.reduceByKey((cnt1, cnt2) => cnt1 + cnt2)
    val result = p.map(bookPair => bookPair match {
      case((book1, book2),cnt) => BookPair(book1, book2, cnt)
    } )

    val resultCsv = result.map(pair => resultToStr(pair))
    resultCsv.saveAsTextFile("./result.csv")
  }

   def resultToStr(pair: BookPair): String = {
     val sep = "|"
    pair.book1 + sep + pair.book2 + sep + pair.cnt
  }
}

这种实现实际上导致了不同的、低效的算法!:

for each book
  find each reader of the book scanning all readers every time!
    for each other_book in books of the reader
      increment common_reader_count ((book, other_book), cnt)

这与上述算法的主要目标相矛盾，因为它没有减少，而是增加了操作次数。查找用户书籍需要过滤每本书的所有用户。因此操作数 ~ N * M 其中 N - 用户数和 M - 图书数。

问题:

有什么方法可以在 Spark 中实现原始算法，而无需为每本书过滤完整的读者集合？
还有其他算法可以有效地计算图书对数吗？
此外，当实际运行这段代码时，我得到了过滤器异常，我不知道这是什么原因。有任何想法吗？

请查看下面的异常日志:

15/05/29 18:24:05 WARN util.Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface eth0)
15/05/29 18:24:05 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/05/29 18:24:09 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/05/29 18:24:10 INFO Remoting: Starting remoting
15/05/29 18:24:10 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.0.2.15:38910]
15/05/29 18:24:10 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@10.0.2.15:38910]
15/05/29 18:24:12 ERROR executor.Executor: Exception in task 0.0 in stage 6.0 (TID 4)
java.lang.NullPointerException
    at org.apache.spark.rdd.RDD.filter(RDD.scala:282)
    at Small$$anonfun$4$$anonfun$apply$1.apply(Small.scala:58)
    at Small$$anonfun$4$$anonfun$apply$1.apply(Small.scala:54)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at Small$$anonfun$4.apply(Small.scala:54)
    at Small$$anonfun$4.apply(Small.scala:51)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
    at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
    at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:54)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

更新:

这段代码:

val df = sc.parallelize(Array((1,30),(2,10),(3,20),(1,10)(2,30))).toDF("books","readers")
val results = df.join(
df.select($"books" as "r_books", $"readers" as "r_readers"), 
$"readers" === $"r_readers" and $"books" < $"r_books"
)
.groupBy($"books", $"r_books")
.agg($"books", $"r_books", count($"readers"))

给出以下结果:

books r_books COUNT(readers)
1     2       2

所以 COUNT 这里是两本书(这里是 1 和 2)被一起阅读的次数(对数)。

最佳答案

如果将原始 RDD 转换为 DataFrame，这种事情就容易多了:

val df = sc.parallelize(
  Array((1,30),(2,10),(3,20),(1,10), (2,30))
).toDF("books","readers")

一旦你这样做了，只需在 DataFrame 上做一个自连接来制作书籍对，然后计算有多少读者阅读了每对书籍:

val results = df.join(
  df.select($"books" as "r_books", $"readers" as "r_readers"), 
  $"readers" === $"r_readers" and $"books" < $"r_books"
).groupBy(
  $"books", $"r_books"
).agg(
  $"books", $"r_books", count($"readers")
)

至于关于该加入的额外解释，请注意我加入 df回到自身——自连接:df.join(df.select(...), ...) .您要做的是将第 1 本书拼在一起 -- $"books" -- 与第二本书 -- $"r_books" , 来自同一个读者 -- $"reader" === $"r_reader" .但是如果你只加入 $"reader" === $"r_reader" , 你会把同一本书重新连接起来。相反，我使用 $"books" < $"r_books"确保书对中的顺序始终为 (<lower_id>,<higher_id>) .

完成连接后，您将获得一个 DataFrame，其中包含每对图书的每位读者的一行。 groupBy和 agg函数对每对图书的读者数量进行实际计数。

顺便说一句，如果读者阅读同一本书两次，我相信您最终会重复计算，这可能是您想要的，也可能不是您想要的。如果这不是您想要的，只需更改 count($"readers")至 countDistinct($"readers") .

如果您想了解更多关于 agg 的信息函数 count()和 countDistinct()和一堆其他有趣的东西，查看 org.apache.spark.sql.functions 的 scaladoc

关于algorithm - Spark : Counting co-occurrence - Algorithm for efficient multi-pass filtering of huge collections，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30534068/

文章推荐： algorithm - 找到平衡括号的最少编辑次数？

文章推荐： apache - 将 301 从带有问号的 URL 重定向到新的 URL

文章推荐： backbone.js - Backbone 与 SEO 服务器 - 让它们协同工作

文章推荐： algorithm - 每个递归算法都可以用动态规划改进吗？

algorithm - 隐私和匿名化 "Algorithm"
我在一本书(Interview Question)中读到这个问题，想在这里详细讨论这个问题。请点亮它。问题如下:- 隐私和匿名化马萨诸塞州集团保险委员会早在 1990 年代中期就有一个绝妙的主意
algorithm - 微软技术面试 : Matrix Algorithm
我最近接受了一次面试，面试官给了我一些伪代码并提出了相关问题。不幸的是，由于准备不足，我无法回答他的问题。由于时间关系，我无法向他请教该问题的解决方案。如果有人可以指导我并帮助我理解问题，以便我可以改
algorithm - 获取二叉树中给定值的根到节点的距离 : Algorithm correctness
这是我的代码 public int getDist(Node root, int value) { if (root == null && value !=0) return
algorithm - 交叉点 : Strassen's Algorithm
就效率而言，Strassen 算法应该停止递归并应用乘法的最佳交叉点是多少？我知道这与具体的实现和硬件密切相关，但对于一般情况应该有某种指南或某人的一些实验结果。在网上搜索了一下，问了一些他们认为
algorithm - 图书请求 : Distributed algorithms
我想学习一些关于分布式算法的知识，所以我正在寻找任何书籍推荐。我对理论书籍更感兴趣，因为实现只是个人喜好问题(我可能会使用 erlang(或 c#))。但另一方面，我不想对算法进行原始的数学分析。只是
algorithm - "classical algorithms"的真实世界实现
我想知道你们中有多少人实现了计算机科学的“ classical algorithms ”，例如 Dijkstra's algorithm或现实世界中的数据结构(例如二叉搜索树)，而不是学术项目？当有
algorithm - 我试图找到一个 "bartender algorithm"
我正在解决旧编程竞赛中的一些示例问题。在这个问题中，我们得到了我们有多少调酒师以及他们知道哪些食谱的信息。制作每杯鸡尾酒需要 1 分钟，我们需要使用所有调酒师计算是否可以在 5 分钟内完成订单。解决
algorithm - 函数式编程中存在 "algorithms"吗？
关闭。这个问题是opinion-based .它目前不接受答案。想要改进这个问题？更新问题，以便 editing this post 可以用事实和引用来回答它. 关闭 8 年前。 Improve
javascript - if (!options.algorithms) throw new Error ('algorithms should be set' );错误 : algorithms should be set
我开始学习 Nodejs，但我被困在中间的某个地方。我从 npm 安装了一个新库，它是 express -jwt ，它在运行后显示某种错误。附上代码和错误日志，请帮助我! const jwt = re
algorithm - SSL 证书 : Signature Algorithm shows "sha256rsa" but thumbprint algorithm shows "sha1"
我有一个证书，其中签名算法显示“sha256rsa”，但指纹算法显示“sha1”。我的证书 SHA1/SHA2 的标识是什么？谢谢! 最佳答案 TL;TR:签名和指纹是完全不同的东西。对于证书的强度
algorithm - "algorithm problem size"到底是什么意思？
我目前在我的大学学习数据结构类(class)，并且在之前的类(class)中做过一些算法分析，但这是我在之前的类(class)中遇到的最困难的部分。我们现在将在我的数据结构类(class)中学习算法分
algorithm - 选择不相邻的单元格 : algorithm's time complexity
有一个由 N 个 1x1 方格组成的区域，并且该区域的所有部分都是相连的(没有任何方格无法到达的方格)。下面是一些面积的例子。我想在这个区域中选择一些方块，并且两个相邻的方块不能一起选择(对角接触
algorithm - 粗糙度降低 : Algorithm for smoothing out shapes
我有一些多边形形状的点列表，我想将其包含在我页面上的 Google map 中。我已经从原始数据中删除了尽可能多的不必要的多边形，现在我剩下大约 12 个，但它们非常详细以至于导致了问题。现在我的文
algorithm - Marching Squares Algorithm 的位移步骤
我目前正在实现 Marching Squares用于计算等高线曲线，我对此处提到的位移位的使用有疑问 Compose the 4 bits at the corners of the cell to
algorithm - 理解约束满足问题 : map coloring algorithm
我正在尝试针对给定算法的约束满足问题实现此递归回溯函数: function BACKTRACKING-SEARCH(csp) returns solution/failure return R
algorithm - 将矩阵除以矩阵 : Bartlett Correlation Algorithm
是否有包含反函数的库？作为项目的一部分，我目前正在研究测向算法。我正在使用巴特利特相关性。在 Bartlett 相关性中，我需要将已经是 3 次矩阵乘法(包括 Hermitian 转置)的分子除以作
algorithm - 多项式时间 : Accepting and Decision Algorithms
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。这个问题似乎与 help center 中定义的范围内的编程无关。 . 关闭 8 年前。 Improve
algorithm - 长波紫外线 - 1394 : And There Was One Algorithm
问题的链接是UVA - 1394 : And There Was One . 朴素的算法是扫描整个数组并在每次迭代中标记第 k 个元素并在最后停止:这需要 O(n^2) 时间。我搜索了一种替代算法并
algorithm - 什么是 "Decentralized Uniqueness Algorithm"？
COM 中创建 GUID 的函数 (CoCreateGUID) 使用“分散唯一性算法”，但我的问题是，它是什么？谁能解释一下？最佳答案一种生成 ID 的方法，该 ID 具有一定的唯一性保证，而不
algorithm - 最小化颜色 : a variation of the knapsack algorithm?
在做一个项目时我遇到了这个问题，我将在这个问题的实际领域之外重新措辞(我想我可以谈论烟花的口径和形状，但这会使理解更加复杂).我正在寻找一种(可能是近似的)算法来解决它。我有 n 个不同大小的容器，

塔克拉玛干

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

algorithm - Spark : Counting co-occurrence - Algorithm for efficient multi-pass filtering of huge collections