gpt4 book ai didi

csv - 将 CSV 文件(包含空字符串和重复项)导入 DynamoDB

转载 作者:可可西里 更新时间:2023-11-01 14:40:29 25 4
gpt4 key购买 nike

我有一个要导入 Amazon DynamoDB 的 CSV 文件。所以我将它上传到 S3,设置 EMR 集群,并创建一个外部表,如下所示:

hive> CREATE EXTERNAL TABLE s3_table_myitems (colA BIGINT, colB STRING, colC STRING, colD DOUBLE, colE DOUBLE, colF STRING, colG STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('serialization.null.format'='""')
STORED AS TEXTFILE
LOCATION 's3://bucketname/dirname/'
TBLPROPERTIES ('skip.header.line.count'='1');

CSV 中的任何列都可以为空,但 DynamoDB 无法处理空字符串(“com.amazonaws.AmazonServiceException:一个或多个参数值无效:AttributeValue 可能不包含空字符串").

这就是Amazon says :

We will consider this optional "ignore empty string" behavior in a future release. … As a workaround, you could … transform empty attribute values into NULLs. For example, you can … use a more complex SELECT expression to turn empty strings into something else, including setting them to NULL.

这就是我想出的,但它看起来很丑:

hive> INSERT INTO TABLE ddb_tbl_ingredients
SELECT
regexp_replace(colA, '^$', 'NULL'),
regexp_replace(colB, '^$', 'NULL'),
regexp_replace(colC, '^$', 'NULL'),
regexp_replace(colD, '^$', 'NULL'),
regexp_replace(colE, '^$', 'NULL'),
regexp_replace(colF, '^$', 'NULL'),
regexp_replace(colG, '^$', 'NULL')
FROM s3_table_ingredients;

对于整个问题是否有更好的解决方案(缺少预处理 CSV),或者至少有更好的 SELECT 语法?


编辑:我最终也不得不处理重复项(“com.amazonaws.AmazonServiceException:提供的项目键列表包含重复项”)。

为了后代,这是我的完整流程。我很想听到更好的方法来做到这一点,无论是在美学上还是在性能上。任务看似简单(“将 CSV 文件导入 DynamoDB”),但到目前为止已经花了好几个小时:P

# source
hive> CREATE EXTERNAL TABLE s3_table_myitems (colA STRING, colB STRING, colC DOUBLE, colD DOUBLE, colE STRING, colF STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('serialization.null.format'='""')
STORED AS TEXTFILE
LOCATION 's3://bucketname/dirname/'
TBLPROPERTIES ('skip.header.line.count'='1');

# destination
hive> CREATE EXTERNAL TABLE ddb_tbl_myitems (colA STRING, colB STRING, colC DOUBLE, colD DOUBLE, colE STRING, colF STRING)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "myitems",
"dynamodb.column.mapping" = "colA:colA,colB:colB,colC:colC,colD:colD,colE:colE,colF:colF");

# remove dupes - http://stackoverflow.com/a/34165762/594211
hive> CREATE TABLE tbl_myitems_deduped AS
SELECT colA, min(colB) AS colB, min(colC) AS colC, min(colD) AS colD, min(colE) AS colE, min(colF) AS colF
FROM (SELECT colA, colB, colC, colD, unit, colF, rank() OVER
(PARTITION BY colA ORDER BY colB, colC, colD, colE, colF)
AS col_rank FROM s3_table_myitems) t
WHERE t.col_rank = 1
GROUP BY colA;

# replace empty strings with placeholder 'NULL'
hive> CREATE TABLE tbl_myitems_noempty AS
SELECT colA,
regexp_replace(colB, '^$', 'NULL') AS colB,
regexp_replace(colC, '^$', 'NULL') AS colC,
regexp_replace(colD, '^$', 'NULL') AS colD,
regexp_replace(colE, '^$', 'NULL') AS colE,
regexp_replace(colF, '^$', 'NULL') AS colF
FROM tbl_myitems_deduped
WHERE LENGTH(colA) > 0;

# ...other preprocessing here...

# insert to DB
hive> INSERT INTO TABLE ddb_tbl_myitems
SELECT * FROM tbl_myitems_noempty;

注意:colA为分区键。

最佳答案

您可以向创建表语句添加额外的表属性,这些属性会将任何指定字符视为空值。

TBLPROPERTIES('serialization.null.format'='');

关于csv - 将 CSV 文件(包含空字符串和重复项)导入 DynamoDB,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37566932/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com