gpt4 book ai didi

scala - 从连接两个数据帧的 Spark SQL 中删除重复项

转载 作者:行者123 更新时间:2023-12-05 02:55:06 26 4
gpt4 key购买 nike

我有两个需要加入的 spark DF。仅从 df2 中选择存在于 df1 中的值,不应重复行。

例如:

df1:

+-------------+---------------+----------+
|a |b |val |
+-------------+---------------+----------+
| 202003101750| 202003101700|1712384842|
| 202003101740| 202003101700|1590554927|
| 202003101730| 202003101700|1930860788|
| 202003101730| 202003101600| 101713|
| 202003101720| 202003101700|1261542412|
| 202003101720| 202003101600| 1824155|
| 202003101710| 202003101700| 912601761|
+-------------+---------------+----------+

df2:

+-------------+---------------+
|a |b |
+-------------+---------------+
| 202003101800| 202003101700|
| 202003101800| 202003101700|
| 202003101750| 202003101700|
| 202003101750| 202003101700|
| 202003101750| 202003101700|
| 202003101750| 202003101700|
| 202003101740| 202003101700|
| 202003101740| 202003101700|
+-------------+---------------+

我正在做以下事情:

df1.join(df2, Seq("a", "b"), "leftouter").where(col("val").isNotNull)但是我的输出有几个重复的行。

+-------------+---------------+----------+
|a |b |val |
+-------------+---------------+----------+
| 202003101750| 202003101700|1712384842|
| 202003101750| 202003101700|1712384842|
| 202003101750| 202003101700|1712384842|
| 202003101750| 202003101700|1712384842|
| 202003101740| 202003101700|1590554927|
| 202003101740| 202003101700|1590554927|
| 202003101740| 202003101700|1590554927|
| 202003101740| 202003101700|1590554927||
+-------------+---------------+----------+

如果从 df1 中删除了 val,我正在尝试实现类似 except 的操作。但是 except 似乎不起作用。例如以下是所需的操作df1.drop(col("val")).except("df2")df1 的架构如下:

root
|-- a: String (nullable = true)
|-- b: String (nullable = true)
|-- val: long (nullable = true)

另外,left-outer join 和 except 之间到底有什么区别?预期输出:

+-------------+---------------+----------+
|a |b |val |
+-------------+---------------+----------+
| 202003101750| 202003101700|1712384842|
| 202003101740| 202003101700|1590554927||
+-------------+---------------+----------+

最佳答案

LeftOuter join 将从左表中获取所有行并从右表中获取匹配行。

Except将给出与第一个数据帧(没有重复项)相比第二个数据帧中不存在的行。

对于您的情况,您可以使用 inner(或)outer 加入 dropDuplicates。

df1.join(df2, Seq("a", "b"), "inner").dropDuplicates().show()
//+------------+------------+----------+
//| a| b| val|
//+------------+------------+----------+
//|202003101740|202003101700|1590554927|
//|202003101750|202003101700|1712384842|
//+------------+------------+----------+

df1.join(df2, Seq("a", "b"), "rightouter").where(col("val").isNotNull).dropDuplicates().show()
//+------------+------------+----------+
//| a| b| val|
//+------------+------------+----------+
//|202003101740|202003101700|1590554927|
//|202003101750|202003101700|1712384842|
//+------------+------------+----------+

关于scala - 从连接两个数据帧的 Spark SQL 中删除重复项,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61472104/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com