gpt4 book ai didi

scala - 通过维护顺序来聚合重复记录,并且还包括重复记录

转载 作者:行者123 更新时间:2023-12-01 01:53:07 27 4
gpt4 key购买 nike

我正在尝试解决一个有趣的问题,很容易只做一个groupBy来进行聚合,例如求和,计数等。但是这个问题略有不同。让我解释一下:

这是我的元组列表:

val repeatSmokers: List[(String, String, String, String, String, String)] =
List(
("ID76182", "sachin", "kita MR.", "56308", "1990", "300"),
("ID76182", "KOUN", "Jana MR.", "56714", "1990", "100"),
("ID76182", "GANGS", "SKILL", "27539", "1990", "255"),
("ID76182", "GANGS", "SKILL", "27539", "1990", "110"),
("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "20"),
("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "6750"),
("ID76182", "DOWNES", "RYAN", "47542", "1990", "2090"),
("ID76182", "DRAGON", "WARS", "49337", "1990", "200"),
("ID76182", "HULK", "PAIN MR.", "47542", "1990", "280"),
("ID76182", "JAMES", "JIM", "30548", "1990", "300"),
("ID76182", "KIMMELSHUE", "RUTH", "55345", "1990", "2600"),
("ID76182", "DRAGON", "WARS", "49337", "1990", "370"),
("ID76182", "COOPER", "ANADA", "45873", "1990", "2600"),
("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "2600"),
("ID76182", "HULK", "PAIN MR.", "47542", "1990", "256")
)

这些记录的架构为(Idnumber、name、test_code、year、amount)。从这些元素中,我只想要重复的记录,我们在上面的列表中定义唯一组合的方式是采用 (sachin, kita MR.,56308) 名称和 test_code 组合。这意味着如果相同的名称和测试代码重复,则这是重复吸烟者记录。为简单起见,您可以仅假设 test_code 作为唯一值,如果它重复,您可以说它是重复吸烟者记录。

below is the exact output:

ID76182,27539,1990,255,1 
ID76182,27539,1990,365,2
ID76182,45873,1990,20,1
ID76182,45873,1990,6770,2
ID76182,45873,1990,9370,3
ID76182,49337,1990,200,1
ID76182,49337,1990,570,2
ID76182,47542,1990,280,1
ID76182,47542,1990,536,2

最后,这里的挑战性部分是维护每秒重复吸烟者记录的顺序和总计,并添加发生次数。

例如:此记录架构为:ID76182,47542,1990,536,2

IDNumber、test_code、年份、金额、发生次数

由于它发生了两次,我们看到上面的 2。

Note:

输出可以是任何集合的列表,但它应该采用与我上面提到的相同的格式

最佳答案

这里是一些 Scala 代码,但它实际上是用 Scala 编写的 Java 代码:

import java.util.ArrayList
import java.util.LinkedHashMap
import scala.collection.convert._


type RawRecord = (String, String, String, String, String, String)
type Record = (String, String, String, String, Int, Int)
type RecordKey = (String, String, String, String)
type Output = (String, String, String, String, Int, Int, Int)
val keyF: Record => RecordKey = r => (r._1, r._2, r._3, r._4)
val repeatSmokersRaw: List[RawRecord] =
List(
("ID76182", "sachin", "kita MR.", "56308", "1990", "300"),
("ID76182", "KOUN", "Jana MR.", "56714", "1990", "100"),
("ID76182", "GANGS", "SKILL", "27539", "1990", "255"),
("ID76182", "GANGS", "SKILL", "27539", "1990", "110"),
("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "20"),
("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "6750"),
("ID76182", "DOWNES", "RYAN", "47542", "1990", "2090"),
("ID76182", "DRAGON", "WARS", "49337", "1990", "200"),
("ID76182", "HULK", "PAIN MR.", "47542", "1990", "280"),
("ID76182", "JAMES", "JIM", "30548", "1990", "300"),
("ID76182", "KIMMELSHUE", "RUTH", "55345", "1990", "2600"),
("ID76182", "DRAGON", "WARS", "49337", "1990", "370"),
("ID76182", "COOPER", "ANADA", "45873", "1990", "2600"),
("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "2600"),
("ID76182", "HULK", "PAIN MR.", "47542", "1990", "256")
)
val repeatSmokers = repeatSmokersRaw.map(r => (r._1, r._2, r._3, r._4, r._5.toInt, r._6.toInt))

val acc = new LinkedHashMap[RecordKey, (util.ArrayList[Output], Int, Int)]
repeatSmokers.foreach(r => {
val key = keyF(r)
var cur = acc.get(key)
if (cur == null) {
cur = (new ArrayList[Output](), 0, 0)
}
val nextCnt = cur._2 + 1
val sum = cur._3 + r._6
val output = (r._1, r._2, r._3, r._4, r._5, sum, nextCnt)
cur._1.add(output)
acc.put(key, (cur._1, nextCnt, sum))
})
val result = acc.values().asScala.filter(p => p._2 > 1).flatMap(p => p._1.asScala)
// or if you are clever you can merge filter and flatMap as
// val result = acc.values().asScala.flatMap(p => if (p._1.size > 1) p._1.asScala else Nil)

println(result.mkString("\n"))

打印

(ID76182,GANGS,SKILL,27539,1990,255,1)
(ID76182,GANGS,SKILL,27539,1990,365,2)
(ID76182,SEMI,GAUTAM A MR.,45873,1990,20,1)
(ID76182,SEMI,GAUTAM A MR.,45873,1990,6770,2)
(ID76182,SEMI,GAUTAM A MR.,45873,1990,9370,3)
(ID76182,DRAGON,WARS,49337,1990,200,1)
(ID76182,DRAGON,WARS,49337,1990,570,2)
(ID76182,HULK,PAIN MR.,47542,1990,280,1)
(ID76182,HULK,PAIN MR.,47542,1990,536,2)

这段代码的主要技巧是使用Java的 LinkedHashMap 作为累加器集合,因为它保留插入顺序。额外的技巧是在内部存储一些列表(因为我使用 Java 集合,所以我决定使用 ArrayList 作为内部累加器,但你可以使用任何你喜欢的东西)。因此,我们的想法是构建一个 key => 吸烟者列表的映射,并另外为每个 key 存储当前计数器和当前总和,以便可以将“聚合”吸烟者添加到列表中。当构建映射时,通过它过滤掉那些没有积累至少 2 条记录的键,然后将列表映射转换为单个列表(这就是使用 LinkedHashMap 的重要点,因为迭代期间保留插入顺序)

关于scala - 通过维护顺序来聚合重复记录,并且还包括重复记录,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48782282/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com