hadoop - Sqoop 导入 : composite primary key and textual primary key-6ren

hadoop - Sqoop 导入 : composite primary key and textual primary key

转载作者：行者123 更新时间：2023-12-02 20:23:39

26

4

堆栈:使用 Ambari 2.1 安装 HDP-2.3.2.0-2950

源数据库模式位于 sql server 上，它包含几个表，它们的主键为:

一个varchar

Composite - 两个 varchar 列或一个 varchar + 一个 int 列或
两个 int 列。有一张大 table 吗？具有三个的行
PK 中的列一个 int + 两个 varchar 列

根据 Sqoop 文档:

Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.

第一个问题是:“手动选择拆分列”的预期是什么 - 我怎样才能牺牲 pk 而只使用一个列，或者我错过了一些概念？

SQL Server 表是(只有两列，它们形成一个复合主键):

ChassiNo    varchar(8)  Unchecked
ECU_Name    nvarchar(15)    Unchecked

我继续导入 源表有 7909097 条记录 :

sqoop import --connect 'jdbc:sqlserver://somedbserver;database=somedb' --username someusname --password somepass --as-textfile --fields-terminated-by '|&|'  --table ChassiECU --num-mappers 8  --warehouse-dir /dataload/tohdfs/reio/odpdw/may2016 --verbose

令人担忧的警告和不正确的映射器输入和记录:

16/05/13 10:59:04 WARN manager.CatalogQueryManager: The table ChassiECU contains a multi-column primary key. Sqoop will default to the column ChassiNo only for this job.
16/05/13 10:59:08 WARN db.TextSplitter: Generating splits for a textual index column.
16/05/13 10:59:08 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
16/05/13 10:59:08 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
16/05/13 10:59:38 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=1168400
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=1128
                HDFS: Number of bytes written=209961941
                HDFS: Number of read operations=32
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=16
        Job Counters
                Launched map tasks=8
                Other local map tasks=8
                Total time spent by all maps in occupied slots (ms)=62785
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=62785
                Total vcore-seconds taken by all map tasks=62785
                Total megabyte-seconds taken by all map tasks=128583680
        Map-Reduce Framework
                Map input records=15818167
                Map output records=15818167
                Input split bytes=1128
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=780
                CPU time spent (ms)=45280
                Physical memory (bytes) snapshot=2219433984
                Virtual memory (bytes) snapshot=20014182400
                Total committed heap usage (bytes)=9394716672
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=209961941
16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Transferred 200.2353 MB in 32.6994 seconds (6.1235 MB/sec)
16/05/13 10:59:38 INFO mapreduce.ImportJobBase: Retrieved 15818167 records.

创建表:

CREATE EXTERNAL TABLE IF NOT EXISTS ChassiECU(`ChassiNo` varchar(8),
`ECU_Name` varchar(15)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'  LOCATION '/dataload/tohdfs/reio/odpdw/may2016/ChassiECU';

糟糕的结果(没有错误)--问题:15818167 vs 7909097(sql server)记录:

 > select count(1) from ChassiECU;
Query ID = hive_20160513110313_8e294d83-78aa-4e52-b90f-b5640268b8ac
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.
Status: Running (Executing on YARN cluster with App id application_1446726117927_0059)
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     14         14        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 6.12 s
--------------------------------------------------------------------------------
OK
_c0
15818167

令人惊讶的是，如果复合键由一个 int (用于拆分)组成，我得到的记录要么准确，要么不匹配少于 10 条，但我仍然对这些感到担忧!

我该如何进行？

最佳答案

手动指定拆分列。拆分列不一定等于PK。您可以拥有复杂的 PK 和一些 int Split 列。您可以指定任何整数列，甚至是简单的函数(一些简单的函数，如子字符串或强制转换，而不是聚合或分析)。 拆分列最好应该是均匀分布的整数 .
例如，如果您的拆分列包含值 -1 的几行和值 10000 - 10000000 和 num-mappers=8 的 10M 行，那么 sqoop 将不均匀地在映射器之间拆分数据集:

第一个映射器将获得几行 -1，

第 2-7 个映射器将获得 0 行，

第 8 个映射器将获得近 10M 行，

这将导致数据倾斜，并且第 8 个映射器将永远运行或
甚至失败。我也有使用非整数时重复
使用 MS-SQL 拆分列 .因此，使用整数拆分列。在你的情况下
对于只有两个 varchar 列的表，您可以
(1) 添加代理 int PK 并将其也用作拆分或
(2) 使用带有 WHERE 的自定义查询手动拆分数据子句并使用 num-mappers=1 运行 sqoop 几次，或者
(3)申请一些 确定性整数非聚合 对您的 varchar 列起作用，例如 cast(substr(...) as int) 或 second(timestamp_col)或 datepart(second, date)等作为拆分列。
对于 Teradata，您可以使用 AMP 编号: HASHAMP (HASHBUCKET (HASHROW (string_column_list)))从非整数键列表中获取整数 AMP 编号并依赖 AMP 之间的 TD 分布。我直接使用简单函数作为拆分，而不将其作为派生列添加到查询中

关于hadoop - Sqoop 导入 : composite primary key and textual primary key，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58721468/

26

4

0

文章推荐： hadoop - 创建具有Serde格式和额外列的外部表-HIVE

jsf - JSF 新手 : composition vs composite
我正在开始开发一个使用 JSF 2 作为 View 技术的新 Web 应用程序。我之前没有使用 JSF 的经验，对这些概念有点困惑。我阅读了一些关于 JSF 的文档，主要思想是它是一个基于组件的框
jsf - JSF ui :composition and composite component 的问题
我试图找出这是一个 JSF/EL 问题还是这里出了什么问题。基本上，我想将项目对象作为 ui:param 传递给 ui:composition 并在里面有一个按钮(即引导按钮，因此它们实际上是 ht
gwt - 从另一个 Composite 调用 Composite 的 GWT 方法
我是 GWT 的新手，正在尝试制作一个简单的应用程序(例如 fmylife 的小型版本)。到目前为止，我制作了一个加载事实的复合 Material 和另一个具有提交新事实的表单的复合 Material
agda - 使用 Agda "rewrite"证明 "composition of maps is map of compositions"
我对 Agda 很陌生，我正在尝试做一个简单的证明“ map 的组合就是组合的 map ”。 (来自 this course 的练习) 相关定义: _=$=_ : {X Y : Set}{f f' :
javascript - After Effects 脚本 : How to add Composition with start time display to an Composition
我有三个组合(compFinal、compSlide1 和 compSlide2)。我已经使用脚本将 compSlide1 和 compSlide2 添加到 compFinal 组合中。我使用以下
JSF 2.0 : Passing composite component attribute to inner composite component
我有以下情况: #{cc.attrs.someValue} 因此，在我的复合组件中，我正在调用其他一些复合组件，并尝试将提供给“主”复合组件的参数
java - JSR 303 Bean 验证 : The Constraint composition with groups for each composited constraint
我正在尝试使用 Constraint composition并希望为每个复合约束定义组，如下例所示:- 复合约束 @Target({ ElementType.FIELD, Elemen
android - java.lang.IllegalStateException : Composition requires an active composition context (Android Jetpack Compose)
尝试使用 Jetpack Compose 显示 AlertDialog，但应用程序在调用 AlertDialog 函数时崩溃，错误为 java.lang.IllegalStateException:
java - SWT : Repositioning/refreshing buttons in composite and setting composite size according to visible components
我想根据按钮的可见性属性重绘组合中的按钮。我根据其中的按钮进行合成以调整大小，并且我正在使用以下代码来刷新合成。问题:下面的代码工作正常，但按钮从未在复合中重新定位请帮忙。代码中是否缺少要重新定位的内
composition - 使用扩展脚本在后效项目中按名称获取合成
我正在研究 After Effects 脚本并使用 AE 脚本指南作为学习基础。我有一个 After Effect 项目，其中包含两个 AE 项目，并且每个项目中都有多个项目。我想从具有特定名称的
Python面向对象编程: Composition
我一直在学习如何在 python 编程中实现组合，但我很难理解为什么它比继承更受欢迎。例如，这是迄今为止我的代码: class Particle: # Constructor (public)
java设计模式之组合模式(Composite)
概述是一种结构型模式，将对象以树形结构组织起来，以表示“部分 - 整体”的层次结构，使得客户端对单个对象和组合对象的使用具有唯一性。 UML类图上面的类图包含的角色： Compone
Java基础教程之组合(composition)
我们已经尝试去定义类。定义类，就是新建了一种类型(type)。有了类，我们接着构造相应类型的对象。更进一步，每个类型还应该有一个清晰的接口(interface)，供用户使用。我们可以在一个新类的
Composition API——setup函数
一、Options API的弊端 Options api的一大特点就是在对应得属性中编写对应的模块。比如data定义数据、methods中定义方法、computed中定义计算属性、watch中监听属性
scroll - 如何平滑滚动SWT composite？
我正在使用 SWT ScrolledComposite，但是当我在 Windows 中滚动时，如果我快速滚动，我会出现一些撕裂/闪烁。我该怎么做才能加倍缓冲或减少这种影响，或者我该怎么做才能覆盖默认滚
JSF Composite 组件迭代并显示对象列表
在 JSP 和 JSTL 中我通常会做这样的事情: ${user.name} ${user.description}
JSF Composite 组件的性能
几周以来，我们的 Web 应用程序出现了性能问题。首先我们认为问题属于大 DOM。大 DOM 并不是很好，但这不是主要的性能问题。问题在于复合组件。过去几周，我们开发了核心复合组件，以减少代码冗余并
MySQL Composite 主键唯一性
如何在 mySQL 中创建复合主键。在 table1 中，我需要将 id1、id2 设置为复合主键。我使用了这个查询。但它使每个成为主键。它检查每个条目的重复项 ALTER TABLE `table1
具有相对路径的Java Composite backgroundimage
我是 Java 的新手，我想将背景图像添加到 Composite。我只能使用 SWT，不能使用 JFace。我正在使用 eclipse indigo IDE (3.8)，当我想设置背景图像时，首先我将
c# - "composite"变量类型的正确实现是什么？
我有一个程序必须使用复合键来管理对象。这个键，简单来说就是几个字符串。我有以下代码: public struct MyKey { public string Part1 { get; set

首页

博学

6Ren·AI

商城

hadoop - Sqoop 导入 : composite primary key and textual primary key