gpt4 book ai didi

postgresql - 从另一个表填充随机数据

转载 作者:行者123 更新时间:2023-11-29 11:28:23 25 4
gpt4 key购买 nike

update dataset1.test
set column4 = (select column1
from dataset2
order by random()
limit 1
)

我必须更新第 4 列的数据集 1,每行更新数据集 2 列中的一个随机条目。但是到目前为止,在上面的查询中,我在数据集 1 的所有行中只得到一个随机条目,而且我都一样希望它是随机的。

最佳答案

设置

让我们首先假设您的表和数据如下。请注意,我假设 dataset1 有一个主键(它可以是一个复合键,但为了简单起见,我们将其设为整数):

CREATE TABLE dataset1
(
id INTEGER PRIMARY KEY,
column4 TEXT
) ;

CREATE TABLE dataset2
(
column1 TEXT
) ;

我们用示例数据填充两个表

INSERT INTO dataset1
(id, column4)
SELECT
i, 'column 4 for id ' || i
FROM
generate_series(101, 120) AS s(i);

INSERT INTO dataset2
(column1)
SELECT
'SOMETHING ' || i
FROM
generate_series (1001, 1020) AS s(i) ;

健全性检查:

SELECT count(DISTINCT column4) FROM dataset1 ;
| count || ----: ||    20 |

Case 1: number of rows in dataset1 <= rows in dataset2

We'll perform a complete shuffling. Values from dataset2 will be used once, and no more than once.

EXPLANATION

In order to make an update that shuffles all the values from column4 in arandom fashion, we need some intermediate steps.

First, for the dataset1, we need to create a list (relation) of tuples (id, rn), thatare just:

(id_1,   1),
(id_2, 2),
(id_3, 3),
...
(id_20, 20)

id_1, ..., id_20dataset1 上的 ID。它们可以是任何类型,它们不必是连续的,它们可以是复合的。

对于 dataset2,我们需要创建另一个 (column_1,rn) 列表,如下所示:

(column1_1,  17),
(column1_2, 3),
(column1_3, 11),
...
(column1_20, 15)

在这种情况下,第二列包含所有值 1 .. 20,但已打乱顺序。

一旦我们有了这两个关系,我们就JOIN它们ON ... rn。在实践中,这会生成另一个包含 (id, column1) 的元组列表,其中配对是随机完成的。我们使用这些对来更新 dataset1

真正的查询

这一切都可以通过使用一些 CTE(WITH 语句)来保持中间关系来完成(很明显,我希望):

WITH original_keys AS
(
-- This creates tuples (id, rn),
-- where rn increases from 1 to number or rows
SELECT
id,
row_number() OVER () AS rn
FROM
dataset1
)
, shuffled_data AS
(
-- This creates tuples (column1, rn)
-- where rn moves between 1 and number of rows, but is randomly shuffled
SELECT
column1,
-- The next statement is what *shuffles* all the data
row_number() OVER (ORDER BY random()) AS rn
FROM
dataset2
)
-- You update your dataset1
-- with the shuffled data, linking back to the original keys
UPDATE
dataset1
SET
column4 = shuffled_data.column1
FROM
shuffled_data
JOIN original_keys ON original_keys.rn = shuffled_data.rn
WHERE
dataset1.id = original_keys.id ;

请注意,技巧是通过以下方式执行的:

row_number() OVER (ORDER BY random()) AS rn

row_number() window function从 1 开始产生与行数一样多的连续数字。这些数字随机打乱,因为 OVER 子句获取所有数据并随机排序。

支票

我们可以再次检查:

SELECT count(DISTINCT column4) FROM dataset1 ;
| count || ----: ||    20 |
SELECT * FROM dataset1 ;
 id | column4       --: | :-------------101 | SOMETHING 1016102 | SOMETHING 1009103 | SOMETHING 1003...118 | SOMETHING 1012119 | SOMETHING 1017120 | SOMETHING 1011

ALTERNATIVE

Note that this can also be done with subqueries, by simple substitution, instead of CTEs. That might improve performance in some occasions:

UPDATE
dataset1
SET
column4 = shuffled_data.column1
FROM
(SELECT
column1,
row_number() OVER (ORDER BY random()) AS rn
FROM
dataset2
) AS shuffled_data
JOIN
(SELECT
id,
row_number() OVER () AS rn
FROM
dataset1
) AS original_keys ON original_keys.rn = shuffled_data.rn
WHERE
dataset1.id = original_keys.id ;

再一次...

SELECT * FROM dataset1;
 id | column4       --: | :-------------101 | SOMETHING 1011102 | SOMETHING 1018103 | SOMETHING 1007...118 | SOMETHING 1020119 | SOMETHING 1002120 | SOMETHING 1016

You can check the whole setup and experiment at dbfiddle here

NOTE: if you do this with very large datasets, don't expect it to be extremely fast. Shuffling a very big deck of cards is expensive.


Case 2: number of rows in dataset1 > rows in dataset2

In this case, values for column4 can be repeated several times.

The easiest possibility I can think of (probably, not an efficient one, but easy to understand) is to create a function random_column1, marked as VOLATILE:

CREATE FUNCTION random_column1() 
RETURNS TEXT
VOLATILE -- important!
LANGUAGE SQL
AS
$$
SELECT
column1
FROM
dataset2
ORDER BY
random()
LIMIT
1 ;
$$ ;

并用它来更新:

UPDATE
dataset1
SET
column4 = random_column1();

这样,dataset2 中的某些值可能根本不会被使用,而其他被使用不止一次。

dbfiddle here

关于postgresql - 从另一个表填充随机数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45510958/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com