gpt4 book ai didi

r - 无法在 dplyr.spark.hive 包中创建由 SparkSQL 支持的 dplyr src

转载 作者:可可西里 更新时间:2023-11-01 14:33:12 33 4
gpt4 key购买 nike

最近我发现了很棒的 dplyr.spark.hive启用 dplyr 的软件包前端操作 sparkhive后端。

在包的 README 中有关于如何安装此包的信息:

options(repos = c("http://r.piccolboni.info", unlist(options("repos"))))
install.packages("dplyr.spark.hive")

还有很多关于如何使用 dplyr.spark.hive 的例子当一个已经连接到hiveServer - check this .

但我无法连接到 hiveServer , 所以我无法从这个包的强大功能中受益...

我试过这样的命令,但没有成功。有没有人对我做错了什么有任何解决方案或评论?

> library(dplyr.spark.hive, 
+ lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
>
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
>
> my_db = src_SparkSQL()
Error in .jfindClass(as.character(driverClass)[1]) : class not found
>
> my_db = src_SparkSQL(host = 'jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl',
+ port = 10000)
Error in .jfindClass(as.character(driverClass)[1]) : class not found
>
> my_db = src_SparkSQL(start.server = TRUE)
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
>
> my_db = src_SparkSQL(start.server = TRUE,
+ list(spark.num.executors='5', spark.executor.cores='5', master="yarn-client"))
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1

编辑 2

我已经像这样设置了更多系统变量路径,但现在我收到一条警告,告诉我未指定某种 Java 日志记录配置,但我认为是

> library(dplyr.spark.hive, 
+ lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
3: package ‘SparkR’ was built under R version 3.2.1
>
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HADOOP_HOME="/usr/share/hadoop")
> Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop")
> Sys.setenv(PATH='/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/share/hadoop/bin:/opt/hive/bin')
>
>
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

我的日志属性不为空。

-bash-4.2$ wc /etc/hadoop/log4j.properties 
179 432 6581 /etc/hadoop/log4j.properties

编辑 3

我对 scr_SparkSQL() 的准确调用是

> detach("package:SparkR", unload=TRUE)
Warning message:
package ‘SparkR’ was built under R version 3.2.1
> detach("package:dplyr", unload=TRUE)
> library(dplyr.spark.hive, lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

然后进程不会停止(从不)。这些设置适用于具有此类参数的直线:

beeline  -u "jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl" -n mkosinski --outputformat=tsv --incremental=true -f sql_statement.sql > sql_output

但我无法通过 user姓名和dbnamesrc_SparkSQL()所以我尝试手动使用该函数内部的代码,但我收到以下代码也未完成的相同问题

host = 'tools-1.hadoop.srv'
port = 10000
driverclass = "org.apache.hive.jdbc.HiveDriver"
Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
library(RJDBC)
dr = JDBC(driverclass, Sys.getenv("HADOOP_JAR"))
url = paste0("jdbc:hive2://", host, ":", port)
class = "Hive"
con.class = paste0(class, "Connection") # class = "Hive"
# dbConnect_retry =
# function(dr, url, retry){
# if(retry > 0)
# tryCatch(
# dbConnect(drv = dr, url = url),
# error =
# function(e) {
# Sys.sleep(0.1)
# dbConnect_retry(dr = dr, url = url, retry - 1)})
# else dbConnect(drv = dr, url = url)}
#################
##con = new(con.class, dbConnect_retry(dr, url, retry = 100))
#################
con = new(con.class, dbConnect(dr, url, user = "mkosinski", dbname = "loghost"))

也许是 url还应该包含 /loghost - dbname

最佳答案

我现在看到您尝试了多项操作但都出现了多个错误。让我错误地评论。

my_db = src_SparkSQL() Error in .jfindClass(as.character(driverClass)[1]) : class not found

无法创建 RJDBC 对象。除非我们解决这个问题,否则无论是否有变通办法,其他任何方法都行不通。例如,您是否将 HADOOP_JAR 设置为Sys.setenv(HADOOP_JAR = "../spark/assembly/target/scala-2.10/spark-assembly-1.5.0-hadoop2.6.0.jar")。抱歉,我似乎在说明中跳过了这一点。将修复。

my_db = src_SparkSQL(host = 'jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl', + port = 10000) Error in .jfindClass(as.character(driverClass)[1]) : class not found

同样的问题。请注意主机端口参数不接受 URL 语法,只接受主机和端口。 URL 在内部形成。

my_db = src_SparkSQL(start.server = TRUE) Error in start.server() : Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first. In addition: Warning message: running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1

首先停止 thriftserver 或连接到现有的 thriftserver,但您仍然必须修复类未找到的问题。

my_db = src_SparkSQL(start.server = TRUE, + list(spark.num.executors='5', spark.executor.cores='5', master="yarn-client")) Error in start.server() : Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first. In addition: Warning message: running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1

同上。

计划:

  • 设置 HADOOP_JAR。查找运行 thriftserver 的主机和端口,如果不是默认的。尝试使用 start.server = FALSE 的 src_SparkSQL。如果快乐退出,否则转到第 2 步
  • 停止现有的 thriftserver。使用 start.server = TRUE 重试 src_SparkSQL

让我知道事情进展如何。

关于r - 无法在 dplyr.spark.hive 包中创建由 SparkSQL 支持的 dplyr src,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34267400/

33 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com