gpt4 book ai didi

hadoop - Sqoop 导入 : composite primary key and textual primary key

转载 作者:行者123 更新时间:2023-12-02 20:23:39 26 4
gpt4 key购买 nike

堆栈:使用 Ambari 2.1 安装 HDP-2.3.2.0-2950

源数据库模式位于 sql server 上,它包含几个表,它们的主键为:

  • 一个varchar
  • Composite - 两个 varchar 列或一个 varchar + 一个 int 列或
    两个 int 列。有一张大 table 吗?具有三个的行
    PK 中的列 一个 int + 两个 varchar 列

  • 根据 Sqoop 文档:
    Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.

    第一个问题是:“手动选择拆分列”的预期是什么 - 我怎样才能牺牲 pk 而只使用一个列,或者我错过了一些概念?

    SQL Server 表是(只有两列,它们形成一个复合主键):
    ChassiNo    varchar(8)  Unchecked
    ECU_Name nvarchar(15) Unchecked

    我继续导入 源表有 7909097 条记录 :
    sqoop import --connect 'jdbc:sqlserver://somedbserver;database=somedb' --username someusname --password somepass --as-textfile --fields-terminated-by '|&|'  --table ChassiECU --num-mappers 8  --warehouse-dir /dataload/tohdfs/reio/odpdw/may2016 --verbose

    令人担忧的警告和不正确的映射器输入和记录:
    16/05/13 10:59:04 WARN manager.CatalogQueryManager: The table ChassiECU contains a multi-column primary key. Sqoop will default to the column ChassiNo only for this job.
    16/05/13 10:59:08 WARN db.TextSplitter: Generating splits for a textual index column.
    16/05/13 10:59:08 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
    16/05/13 10:59:08 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
    16/05/13 10:59:38 INFO mapreduce.Job: Counters: 30
    File System Counters
    FILE: Number of bytes read=0
    FILE: Number of bytes written=1168400
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=1128
    HDFS: Number of bytes written=209961941
    HDFS: Number of read operations=32
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=16
    Job Counters
    Launched map tasks=8
    Other local map tasks=8
    Total time spent by all maps in occupied slots (ms)=62785
    Total time spent by all reduces in occupied slots (ms)=0
    Total time spent by all map tasks (ms)=62785
    Total vcore-seconds taken by all map tasks=62785
    Total megabyte-seconds taken by all map tasks=128583680
    Map-Reduce Framework
    Map input records=15818167
    Map output records=15818167
    Input split bytes=1128
    Spilled Records=0
    Failed Shuffles=0
    Merged Map outputs=0
    GC time elapsed (ms)=780
    CPU time spent (ms)=45280
    Physical memory (bytes) snapshot=2219433984
    Virtual memory (bytes) snapshot=20014182400
    Total committed heap usage (bytes)=9394716672
    File Input Format Counters
    Bytes Read=0
    File Output Format Counters
    Bytes Written=209961941
    16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Transferred 200.2353 MB in 32.6994 seconds (6.1235 MB/sec)
    16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Retrieved 15818167 records.

    创建表:
    CREATE EXTERNAL TABLE IF NOT EXISTS ChassiECU(`ChassiNo` varchar(8),
    `ECU_Name` varchar(15)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/dataload/tohdfs/reio/odpdw/may2016/ChassiECU';

    糟糕的结果(没有错误)--问题:15818167 vs 7909097(sql server)记录:
     > select count(1) from ChassiECU;
    Query ID = hive_20160513110313_8e294d83-78aa-4e52-b90f-b5640268b8ac
    Total jobs = 1
    Launching Job 1 out of 1
    Tez session was closed. Reopening...
    Session re-established.
    Status: Running (Executing on YARN cluster with App id application_1446726117927_0059)
    --------------------------------------------------------------------------------
    VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
    --------------------------------------------------------------------------------
    Map 1 .......... SUCCEEDED 14 14 0 0 0 0
    Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
    --------------------------------------------------------------------------------
    VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 6.12 s
    --------------------------------------------------------------------------------
    OK
    _c0
    15818167

    令人惊讶的是,如果复合键由一个 int (用于拆分)组成,我得到的记录要么准确,要么不匹配少于 10 条,但我仍然对这些感到担忧!

    我该如何进行?

    最佳答案

    手动指定拆分列。拆分列不一定等于PK。您可以拥有复杂的 PK 和一些 int Split 列。您可以指定任何整数列,甚至是简单的函数(一些简单的函数,如子字符串或强制转换,而不是聚合或分析)。 拆分列最好应该是均匀分布的整数 .
    例如,如果您的拆分列包含值 -1 的几行和值 10000 - 10000000 和 num-mappers=8 的 10M 行,那么 sqoop 将不均匀地在映射器之间拆分数据集:

  • 第一个映射器将获得几行 -1,
  • 第 2-7 个映射器将获得 0 行,
  • 第 8 个映射器将获得近 10M 行,

  • 这将导致数据倾斜,并且第 8 个映射器将永远运行或
    甚至失败。我也有 使用非整数时重复
    使用 MS-SQL 拆分列
    .因此,使用整数拆分列。在你的情况下
    对于只有两个 varchar 列的表,您可以
    (1) 添加代理 int PK 并将其也用作拆分或
    (2) 使用带有 WHERE 的自定义查询手动拆分数据子句并使用 num-mappers=1 运行 sqoop 几次,或者
    (3)申请一些 确定性整数非聚合 对您的 varchar 列起作用,例如 cast(substr(...) as int) 或 second(timestamp_col)datepart(second, date)等作为拆分列。
    对于 Teradata,您可以使用 AMP 编号: HASHAMP (HASHBUCKET (HASHROW (string_column_list)))从非整数键列表中获取整数 AMP 编号并依赖 AMP 之间的 TD 分布。我直接使用简单函数作为拆分,而不将其作为派生列添加到查询中

    关于hadoop - Sqoop 导入 : composite primary key and textual primary key,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58721468/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com