gpt4 book ai didi

java - 将带有换行符的固定长度的文本文件作为属性值之一读入 JavaRDD

转载 作者:行者123 更新时间:2023-11-30 12:04:05 27 4
gpt4 key购买 nike

我有一个宽度为 100 字节的文本文件。以下是结构。我需要读取 JavaRDD 中的数据。

RecType - String 1 Byte
Date - String 8 byte
Productnumber - String 15 byte
TAG - String 11 byte
Filler1 - String 1 byte
Contract - String 11 byte
Code - String 3 byte
Version - String 3 byte
newline - String 1 byte
FILENAME -String 25 byte
Recnumber - String 4 byte

文件中的示例数据

020190718000000000000001CHATHOLDER SUBCONTRACT1MNV3.0
LOGFILEGENAT07312019050AM00001020190718000000000000001CHATHOLDER SUBCONTRACT1MNV3.0
LOGFILEGENAT07312019050AM00002020190718000000000000001CHATHOLDER SUBCONTRACT1MNV3.0
LOGFILEGENAT07312019050AM00003020190718000000000000002CHATHOLDER SUBCONTRACT1MNV3.0
LOGFILEGENAT07312019051AM00004

如果您注意到每条记录都在一行中开始并在下一行结束。从下一个字节开始下一个记录。文件中有4条以字符串020190718开头的记录。

请问如何读取JavaRDD中的记录?

我在努力

JavaRDD1 = SparkUtils.getSession().read().textFile(filepath)
javaRDD()
map(x -> {return FunctiontoParse(x);});

但它一次只考虑一行,而不是读取整条记录。

请帮忙。

最佳答案

您可能想要 see this post.如果一切都适合作为字符串,则使用 wholeTextFile() 将起作用。如果您希望它保持二进制,则需要将其读取为二进制。我用过 JavaSparkContext.binaryFiles(filepath,numPartitions)反而。这会将整个文件读取为字节,并让您根据需要对其进行解析。

JavaSparkContext jsc = JavaSparkContext.fromSparkContext(SparkContext.getOrCreate());
//from here each file gets on record in the resulting RDD. Each Record is a filename, file_contents pair. Each record has the contents of an entire file.
JavaPairRDD<String, PortableDataStream> rawBinaryInputFiles = jsc.binaryFiles(HDFSinputFolder,numPartitions);
//now to use your function to parse each file. Keep in mind, each record has the contents of an entire file,
//you will need to parse out each record. But since it's fixed width by bytes, it should be pretty simple.
//Create a custom wrapper object to hold the values and populate.

JavaRDD<YourCustomWrapperObject> records = rawBinaryInputFiles.flatMap(new FlatMapFunction<Tuple2<String,PortableDataStream>, YourCustomWrapperObject>() {

@Override
public Iterator<YourCustomWrapperObject> call(Tuple2<String, PortableDataStream> t) throws Exception {
List<YourCustomWrapperObject> results = new ArrayList<YourCustomWrapperObject>();
byte[] bytes = t._2().toArray(); //convert PortableDataStream to byte array.
//best option here IMO is to create a wrapper object, populate it from the byte array and return it
YourCustomWrapperObject obj = new YourCustomWrapperObject();
//populate....
results.add(obj);
return results;
}
});

关于java - 将带有换行符的固定长度的文本文件作为属性值之一读入 JavaRDD,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57294619/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com