gpt4 book ai didi

java - 无法读取json文件: Spark Structured Streaming using java

转载 作者:行者123 更新时间:2023-11-30 05:39:22 26 4
gpt4 key购买 nike

我有一个Python脚本,它每分钟在一个新文件(单行)中从纽约证券交易所获取股票数据(如下)。它包含4只股票的数据 - MSFT、ADBE、GOOGL和FB,如下json格式

[{"symbol": "MSFT", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "126.0800", "high": "126.1000", "low": "126.0500", "close": "126.0750", "volume": "57081"}}, {"symbol": "ADBE", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "279.2900", "high": "279.3400", "low": "279.2600", "close": "279.3050", "volume": "12711"}}, {"symbol": "GOOGL", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "1166.4100", "high": "1166.7400", "low": "1166.2900", "close": "1166.7400", "volume": "8803"}}, {"symbol": "FB", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "192.4200", "high": "192.5000", "low": "192.3600", "close": "192.4800", "volume": "33490"}}]

我正在尝试将此文件流读入 Spark Streaming 数据帧。但我无法为其定义正确的架构。查看互联网并到目前为止完成了以下工作

import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;



public class Driver1 {

public static void main(String args[]) throws InterruptedException, StreamingQueryException {


SparkSession session = SparkSession.builder().appName("Spark_Streaming").master("local[2]").getOrCreate();
Logger.getLogger("org").setLevel(Level.ERROR);


StructType priceData = new StructType()
.add("open", DataTypes.DoubleType)
.add("high", DataTypes.DoubleType)
.add("low", DataTypes.DoubleType)
.add("close", DataTypes.DoubleType)
.add("volume", DataTypes.LongType);

StructType schema = new StructType()
.add("symbol", DataTypes.StringType)
.add("timestamp", DataTypes.StringType)
.add("stock", priceData);


Dataset<Row> rawData = session.readStream().format("json").schema(schema).json("/home/abhinavrawat/streamingData/data/*");
rawData.printSchema();
rawData.writeStream().format("console").start().awaitTermination();
session.close();

}

}

我得到的输出是这样的-

root
|-- symbol: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- stock: struct (nullable = true)
| |-- open: double (nullable = true)
| |-- high: double (nullable = true)
| |-- low: double (nullable = true)
| |-- close: double (nullable = true)
| |-- volume: long (nullable = true)

-------------------------------------------
Batch: 0
-------------------------------------------
+------+-------------------+-----+
|symbol| timestamp|stock|
+------+-------------------+-----+
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
+------+-------------------+-----+

我什至尝试首先将 json 字符串作为文本文件读取,然后应用架构(就像使用 Kafka-Streaming 完成的那样)...

  Dataset<Row> rawData = session.readStream().format("text").load("/home/abhinavrawat/streamingData/data/*");
Dataset<Row> raw2 = rawData.select(org.apache.spark.sql.functions.from_json(rawData.col("value"),schema));
raw2.writeStream().format("console").start().awaitTermination();

获取以下输出,在本例中,rawData 数据帧作为字符串 fromat 中的 json 数据,

+--------------------+
|jsontostructs(value)|
+--------------------+
| null|
| null|
| null|
| null|
| null|

请帮我解决一下。

最佳答案

刚刚弄清楚,请记住以下两件事 -

  1. 定义架构时,请确保字段的名称和顺序与 json 文件中的完全相同。

  2. 最初,所有字段仅使用 StringType,您可以应用转换将其更改回某种特定的数据类型。

这对我有用 -

    StructType priceData = new StructType()
.add("open", DataTypes.StringType)
.add("high", DataTypes.StringType)
.add("low", DataTypes.StringType)
.add("close", DataTypes.StringType)
.add("volume", DataTypes.StringType);

StructType schema = new StructType()
.add("symbol", DataTypes.StringType)
.add("timestamp", DataTypes.StringType)
.add("priceData", priceData);


Dataset<Row> rawData = session.readStream().format("json").schema(schema).json("/home/abhinavrawat/streamingData/data/*");
rawData.writeStream().format("console").start().awaitTermination();
session.close();

查看输出 -

+------+-------------------+--------------------+
|symbol| timestamp| priceData|
+------+-------------------+--------------------+
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
+------+-------------------+--------------------+

您现在可以使用 priceData.openpriceData.close 等展平价格数据列。

关于java - 无法读取json文件: Spark Structured Streaming using java,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55979976/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com