gpt4 book ai didi

json - java.lang.ClassCastException : org. apache.hadoop.hive.ql.io.orc.OrcStruct 无法转换为 org.apache.hadoop.io.Text。 json serde 错误

转载 作者:可可西里 更新时间:2023-11-01 16:39:11 25 4
gpt4 key购买 nike

我不熟悉在配置单元上处理 json 数据。我正在开发一个获取 json 数据并将其存储到配置单元表中的 spark 应用程序。我有一个这样的 json:

Json of Jsons

展开后是这样的:

hierarchy

我能够将 json 读入数据帧并将其保存在 HDFS 上的某个位置。但是让 Hive 能够读取数据是困难的部分。

例如,在我在线搜索之后,我尝试这样做:

对所有 json 字段使用 STRUCT,然后使用 column.element 访问元素。

例如:

web_app_security 将是表内的列(STRUCT 类型)的名称以及其中的其他 json,如 config_web_cms_authentication、web_threat_intel_alert_external也将是 Structs(以 ratingrating_numeric 作为字段)。

我尝试使用 json serde 创建表。这是我的表格定义:

CREATE EXTERNAL TABLE jsons (
web_app_security struct<config_web_cms_authentication: struct<rating: string, rating_numeric: float>, web_threat_intel_alert_external: struct<rating: string, rating_numeric: float>, web_http_security_headers: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>,
dns_security struct<domain_hijacking_protection: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float, dns_hosting_providers: struct<rating:string, rating_numeric: float>>,
email_security struct<rating: string, email_encryption_enabled: struct<rating: string, rating_numeric: float>, rating_numeric: float, email_hosting_providers: struct<rating: string, rating_numeric: float>, email_authentication: struct<rating: string, rating_numeric: float>>,
threat_intell struct<rating: string, threat_intel_alert_internal_3: struct<rating: string, rating_numeric: float>, threat_intel_alert_internal_1: struct<rating: string, rating_numeric: float>, rating_numeric: float, threat_intel_alert_internal_12: struct<rating: string, rating_numeric: float>, threat_intel_alert_internal_6: struct<rating: string, rating_numeric: float>>,
data_loss struct<data_loss_6: struct<rating: string, rating_numeric: float>, rating: string, data_loss_36plus: struct<rating: string, rating_numeric: float>, rating_numeric: float, data_loss_36: struct<rating: string, rating_numeric: float>, data_loss_12: struct<rating: string, rating_numeric: float>, data_loss_24: struct<rating: string, rating_numeric: float>>,
system_hosting struct<host_hosting_providers: struct<rating: string, rating_numeric: float>, hosting_countries: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>,
defensibility struct<attack_surface_web_ip: struct<rating: string, rating_numeric: float>, shared_hosting: struct<rating: string, rating_numeric: float>, defensibility_hosting_providers: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float, attack_surface_web_hostname: struct<rating: string, rating_numeric: float>>,
software_patching struct<patching_web_cms: struct<rating: string, rating_numeric: float>, rating: string, patching_web_server: struct<rating: string, rating_numeric: float>, patching_vuln_open_ssl: struct<rating: string, rating_numeric: float>, patching_app_server: struct<rating: string, rating_numeric: float>, rating_numeric: float>,
governance struct<governance_customer_base: struct<rating: string, rating_numeric: float>, governance_security_certifications: struct<rating: string, rating_numeric: float>, governance_regulatory_requirements: struct<rating: string, rating_numeric: float>, rating: string, rating_numeric: float>
)ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS orc
LOCATION 'hdfs://nameservice1/data/gis/final/rr_current_analysis'

我尝试使用 json serde 解析行。在我将一些数据保存到表中之后,当我尝试查询它时出现以下错误:

Error: java.io.IOException: java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.Text (state=,code=0)

我不确定我的做法是否正确。

我也愿意接受任何其他将数据存储到表中的方法。任何帮助,将不胜感激。谢谢。

最佳答案

那是因为您混合了 ORC 作为存储(STORED AS orc)和 JSON 作为 SerDe(ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe') 覆盖 ORC 的默认 OrcSerde SerDe,但不是输入 (OrcInputFormat) 和输出 (OrcOutputFormat) 格式。

您要么需要在不覆盖其默认 SerDe 的情况下使用 ORC 存储。在这种情况下,请确保您的 Spark 应用程序写入 ORC 表,而不是 JSON。

或者,如果您希望将数据存储在 JSON 中,请将 JsonSerDe 与纯文本文件一起用作存储 (STORED AS TEXTFILE)。


Hive 开发人员指南解释了 SerDe 和存储的工作原理 - https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HiveSerDe

关于json - java.lang.ClassCastException : org. apache.hadoop.hive.ql.io.orc.OrcStruct 无法转换为 org.apache.hadoop.io.Text。 json serde 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45123464/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com