gpt4 book ai didi

apache-spark - spark sql 在 Cassandra 表之间传输数据

转载 作者:行者123 更新时间:2023-12-02 02:51:49 24 4
gpt4 key购买 nike

请在下面找到 Cassandra 表。

我正在尝试将数据从 1 个 Cassandra 表复制到另一个具有相同结构的 Cassandra 表。

请帮帮我。

CREATE TABLE data2 (
d_no text,
d_type text,
sn_perc int,
tse_dt timestamp,
f_lvl text,
ign_f boolean,
lk_loc text,
lk_ts timestamp,
mi_rem text,
nr_fst text,
perm_stat text,
rec_crt_dt timestamp,
sr_stat text,
sor_query text,
tp_dat text,
tp_ts timestamp,
tr_rem text,
tr_type text,
PRIMARY KEY (device_serial_no, device_type)
) WITH CLUSTERING ORDER BY (device_type ASC)

数据插入使用:

Insert into data2(all column names) values('64FCFCFC','HUM',4,'1970-01-02 05:30:00’ ,’NA’,true,'NA','1970-01-02 05:40:00',’NA’,'NA','NA','1970-02-01 05:30:00','NA','NA','NA','1970-02-03 05:30:00','NA','NA');

注意:第 4 列时间戳,当我尝试像这样插入 '1970-01-02 05:30:00' 时,在 dtaframe 中也正确插入了时间戳,但是当从数据帧插入到 cassandra 并使用 select * from table 时,我看到它是像 1970-01-02 00:00:00.000000+0000 一样插入

对于所有时间戳列,它的发生都是类似的。

pom.xml

<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector -->
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.11</artifactId>
<version>2.3.1</version>
</dependency>

我想读取这些值并使用 spark Scala 将其写入另一个 Cassandra 表。见下面的代码:

val df2 = spark.read
.format("org.apache.spark.sql.cassandra")
.option("spark.cassandra.connection.host","hostname")
.option("spark.cassandra.connection.port","9042")
.option( "spark.cassandra.auth.username","usr")
.option("spark.cassandra.auth.password","pas")
.option("keyspace","hr")
.option("table","data2")
.load()
Val df3 =doing some processing on df2.
df3.write
.format("org.apache.spark.sql.cassandra")
.mode("append")
.option("spark.cassandra.connection.host","hostname")
.option("spark.cassandra.connection.port","9042")
.option( "spark.cassandra.auth.username","usr")
.option("spark.cassandra.auth.password","pas")
.option("spark.cassandra.output.ignoreNulls","true")
.option("confirm.truncate","true")
.option("keyspace","hr")
.option("table","data3")
.save()

但是当我尝试使用上面的代码插入数据时,出现以下错误,

java.lang.IllegalArgumentException: requirement failed: Invalid row size: 18 instead of 17.
at scala.Predef$.require(Predef.scala:224)
at com.datastax.spark.connector.writer.SqlRowWriter.readColumnValues(SqlRowWriter.scala:23)
at com.datastax.spark.connector.writer.SqlRowWriter.readColumnValues(SqlRowWriter.scala:12)
at com.datastax.spark.connector.writer.BoundStatementBuilder.bind(BoundStatementBuilder.scala:99)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:106)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:233)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:210)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:210)
at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:197)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:183)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

最佳答案

这是一个已知问题 (SPARKC-541) - 您正在将启用 DSE 搜索的表中的数据复制到未启用它的表中。您只需在转换过程中删除此列:

val df3 = df2.drop("solr_query").... // your transformations

或者您可以简单地使用较新的驱动程序(如果您使用的是 OSS 驱动程序,则为 2.3.1)或包含此修复程序的相应 DSE 版本。

关于apache-spark - spark sql 在 Cassandra 表之间传输数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51929246/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com