gpt4 book ai didi

google-bigquery - 大查询 : Deleting Duplicates in Partitioned Table

转载 作者:行者123 更新时间:2023-12-03 21:34:19 37 4
gpt4 key购买 nike

我有按插入时间分区的 BQ 表。
我正在尝试从表中删除重复项。这些是真正的重复:对于 2 个重复的行,所有列都是相等的 - 当然拥有一个唯一的键可能会有所帮助:-(

起初我尝试了一个 SELECT 查询来枚举重复项并删除它们:

SELECT
* EXCEPT(row_number)
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id_column) row_number
FROM
`mytable`)
WHERE
row_number = 1

这会产生唯一的行,但会创建一个不包含分区数据的新表 - 所以不好。

我见过这个 answer here它指出保留分区的唯一方法是使用上述查询一一检查它们并保存到特定的目标表分区。

我真正想做的是使用 DML DELETE删除重复的行。我尝试了类似的东西 this answer suggested :
DELETE
FROM `mytable` AS d
WHERE (SELECT ROW_NUMBER() OVER (PARTITION BY id_column)
FROM `mytable ` AS d2
WHERE d.id = d2.id) > 1;

但是接受的答案不起作用并导致 BQ 错误:
Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN
如果有人可以提供一种更简单的(DML 或其他)方法来处理这个问题,那将会很棒,这样我就不需要单独遍历所有分区。

最佳答案

有点像黑客,但您可以使用 MERGE语句删除表的所有内容并以原子方式仅重新插入不同的行。下面是一个例子:

-- Create a table with some duplicate rows
CREATE TABLE dataset.PartitionedTable
PARTITION BY date AS
SELECT x, CONCAT('foo', CAST(x AS STRING)) AS y, DATE_SUB(CURRENT_DATE(), INTERVAL x DAY) AS date
FROM UNNEST(GENERATE_ARRAY(1, 10)) AS x, UNNEST(GENERATE_ARRAY(1, 10));

现在为 MERGE部分:
-- Execute a MERGE statement where all original rows are deleted,
-- then replaced with new, deduplicated rows:
MERGE dataset.PartitionedTable AS t1
USING (SELECT DISTINCT * FROM dataset.PartitionedTable) AS t2
ON FALSE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
WHEN NOT MATCHED BY SOURCE THEN DELETE

关于google-bigquery - 大查询 : Deleting Duplicates in Partitioned Table,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53650596/

37 4 0