gpt4 book ai didi

scala - 关于如何以编程方式从 json 文件开始创建自定义 org.apache.spark.sql.types.StructType 架构对象

转载 作者:行者123 更新时间:2023-12-02 10:56:56 24 4
gpt4 key购买 nike

我必须使用 json 文件中的信息创建一个自定义 org.apache.spark.sql.types.StructType 架构对象,json 文件可以是任何内容,所以我在属性文件中对其进行了参数化。

这是属性文件的样子:

//ruta al esquema del fichero output (por defecto se infiere el esquema del Parquet destino). Si existe, el esquema será en formato JSON, aplicable a DataFrame (ver StructType.fromJson)
schema.parquet=/Users/XXXX/Desktop/generated_schema.json
writing.mode=overwrite
separator=;
header=false

文件 generated_schema.json 如下所示:

{"type" : "struct","fields" : [ {"name" : "codigo","type" : "string","nullable" : true}, {"name":"otro", "type":"string", "nullable":true}, {"name":"vacio", "type":"string", "nullable":true},{"name":"final","type":"string","nullable":true} ]}

所以,这就是我认为可以解决它的方法:

val path: Path = new Path(mra_schema_parquet)
val fileSystem = path.getFileSystem(sc.hadoopConfiguration)
val inputStream: FSDataInputStream = fileSystem.open(path)
val schema_json = Stream.cons(inputStream.readLine(), Stream.continually( inputStream.readLine))

System.out.println("schema_json looks like " + schema_json.head)

val mySchemaStructType :DataType = DataType.fromJson(schema_json.head)

/*
After this line, mySchemaStructType have four StructFields objects inside it, the same than appears at schema_json
*/
logger.info(mySchemaStructType)

val myStructType = new StructType()
myStructType.add("mySchemaStructType",mySchemaStructType)

/*

After this line, myStructType have zero StructFields! here must be the bug, myStructType should have the four StructFields that represents the loaded schema json! this must be the error! but how can i construct the necessary StructType object?

*/

myDF = loadCSV(sqlContext, path_input_csv,separator,myStructType,header)
System.out.println("myDF.schema.json looks like " + myDF.schema.json)
inputStream.close()

df.write
.format("com.databricks.spark.csv")
.option("header", header)
.option("delimiter",delimiter)
.option("nullValue","")
.option("treatEmptyValuesAsNulls","true")
.mode(saveMode)
.parquet(pathParquet)

当代码运行最后一行.parquet(pathParquet)时,发生异常:

**parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: message root {
}**

这段代码的输出是这样的:

16/11/11 13:57:04 INFO AnotherCSVtoParquet$: The job started using this propertie file: /Users/aisidoro/Desktop/mra-csv-converter/parametrizacion.properties
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: path_input_csv is /Users/aisidoro/Desktop/mra-csv-converter/cds_glcs.csv
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: path_output_parquet is /Users/aisidoro/Desktop/output900000
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: mra_schema_parquet is /Users/aisidoro/Desktop/mra-csv-converter/generated_schema.json
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: writting_mode is overwrite
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: separator is ;
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: header is false
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: ATTENTION! aplying mra_schema_parquet /Users/aisidoro/Desktop/mra-csv-converter/generated_schema.json
schema_json looks like {"type" : "struct","fields" : [ {"name" : "codigo","type" : "string","nullable" : true}, {"name":"otro", "type":"string", "nullable":true}, {"name":"vacio", "type":"string", "nullable":true},{"name":"final","type":"string","nullable":true} ]}
16/11/11 13:57:12 INFO AnotherCSVtoParquet$: StructType(StructField(codigo,StringType,true), StructField(otro,StringType,true), StructField(vacio,StringType,true), StructField(final,StringType,true))
16/11/11 13:57:13 INFO AnotherCSVtoParquet$: loadCSV. header is false, inferSchema is false pathCSV is /Users/aisidoro/Desktop/mra-csv-converter/cds_glcs.csv separator is ;
myDF.schema.json looks like {"type":"struct","fields":[]}

schema_json 对象和 myDF.schema.json 对象应该具有相同的内容,不是吗?但它并没有发生。我认为这一定会引发错误。

最后,工作因这个异常而崩溃:

**parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: message root {
}**

事实是,如果我不提供任何 json 架构文件,作业执行得很好,但使用此架构...

有人可以帮我吗?我只想从 csv 文件和 json 架构文件开始创建一些 parquet 文件。

谢谢。

依赖项是:

    <spark.version>1.5.0-cdh5.5.2</spark.version>
<databricks.version>1.5.0</databricks.version>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>${spark.version}</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>${databricks.version}</version>
</dependency>

更新

我可以看到有一个 Unresolved 问题,

https://github.com/databricks/spark-csv/issues/61

最佳答案

既然你说了自定义架构,你就可以做这样的事情。

val schema = (new StructType).add("field1", StringType).add("field2", StringType)
sqlContext.read.schema(schema).json("/json/file/path").show

另外,请查看 thisthis

您可以像下面一样创建嵌套的 JSON 架构。

例如:

{
"field1": {
"field2": {
"field3": "create",
"field4": 1452121277
}
}
}

val schema = (new StructType)
.add("field1", (new StructType)
.add("field2", (new StructType)
.add("field3", StringType)
.add("field4", LongType)
)
)

关于scala - 关于如何以编程方式从 json 文件开始创建自定义 org.apache.spark.sql.types.StructType 架构对象,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40526208/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com