gpt4 book ai didi

java - 如何读入 RCFile

转载 作者:可可西里 更新时间:2023-11-01 16:14:14 25 4
gpt4 key购买 nike

我正在尝试将一个小的 RCFile(约 200 行数据)读入 HashMap 以执行 Map-Side 连接,但我在将文件中的数据变为可用状态时遇到了很多麻烦。

这是我目前所拥有的,其中大部分是从 this example 中提取的:

    public void configure(JobConf job)                                                                                                   
{
try
{
FileSystem fs = FileSystem.get(job);
RCFile.Reader rcFileReader = new RCFile.Reader(fs, new Path("/path/to/file"), job);
int counter = 1;
while (rcFileReader.next(new LongWritable(counter)))
{
System.out.println("Fetching data for row " + counter);
BytesRefArrayWritable dataRead = new BytesRefArrayWritable();
rcFileReader.getCurrentRow(dataRead);
System.out.println("dataRead: " + dataRead + " dataRead.size(): " + dataRead.size());
for (int i = 0; i < dataRead.size(); i++)
{
BytesRefWritable bytesRefRead = dataRead.get(i);
byte b1[] = bytesRefRead.getData();
Text returnData = new Text(b1);
System.out.println("READ-DATA = " + returnData.toString());
}
counter++;
}
}
catch (IOException e)
{
throw new Error(e);
}
}

但是,我得到的输出在第一行中将每列中的所有数据连接在一起,而在其他任何行中都没有数据。

Fetching data for row 1
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@7f26d3df dataRead.size(): 5
READ-DATA = 191606656066860670
READ-DATA = United StatesAmerican SamoaGuamNorthern Mariana Islands
READ-DATA = USASGUMP
READ-DATA = USSouth PacificSouth PacificSouth Pacific
READ-DATA = 19888
Fetching data for row 2
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@1cb1a4e2 dataRead.size(): 0
Fetching data for row 3
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@52c00025 dataRead.size(): 0
Fetching data for row 4
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@3b49a794 dataRead.size(): 0

如何正确读取这些数据,以便一次访问一行,例如

(191, United States, US, US, 19)?

最佳答案

由于 RCFiles 的列性质,行式读取路径与写入路径有很大不同。我们仍然可以使用 RCFile.Reader 类按行读取 RCFile(不需要 RCFileRecordReader)。但除此之外,我们还需要使用 ColumnarSerDe 将列式数据转换为行式数据。

以下是我们可以获取的用于按行读取 RCFile 的最简化代码。有关详细信息,请参阅内联代码注释。

private static void readRCFileByRow(String pathStr)
throws IOException, SerDeException {

final Configuration conf = new Configuration();

final Properties tbl = new Properties();

/*
* Set the column names and types using comma separated strings.
* The actual name of the columns are not important, as long as the count
* of column is correct.
*
* For types, this example uses strings. byte[] can be stored as string
* by encoding the bytes to ASCII (such as hexString or Base64)
*
* Numbers of columns and number of types must match exactly.
*/
tbl.setProperty("columns", "col1,col2,col3,col4,col5");
tbl.setProperty("columns.types", "string:string:string:string:string");

/*
* We need a ColumnarSerDe to de-serialize the columnar data to row-wise
* data
*/
ColumnarSerDe serDe = new ColumnarSerDe();
serDe.initialize(conf, tbl);

Path path = new Path(pathStr);
FileSystem fs = FileSystem.get(conf);
final RCFile.Reader reader = new RCFile.Reader(fs, path, conf);

final LongWritable key = new LongWritable();
final BytesRefArrayWritable cols = new BytesRefArrayWritable();

while (reader.next(key)) {
System.out.println("Getting next row.");

/*
* IMPORTANT: Pass the same cols object to the getCurrentRow API; do not
* create new BytesRefArrayWritable() each time. This is because one call
* to getCurrentRow(cols) can potentially read more than one column
* values which the serde below would take care to read one by one.
*/
reader.getCurrentRow(cols);

final ColumnarStruct row = (ColumnarStruct) serDe.deserialize(cols);
final ArrayList<Object> objects = row.getFieldsAsList();
for (final Object object : objects) {
// Lazy decompression happens here
final String payload =
((LazyString) object).getWritableObject().toString();
System.out.println("Value:" + payload);
}
}
}

在此代码中,getCourrentRow 仍然按列读取数据,我们需要使用 SerDe 将其转换为行。此外,调用 getCurrentRow() 并不意味着该行中的所有字段都已解压。实际上,根据惰性解压,一个列只有在它的一个字段被反序列化时才会被解压。为此,我们使用了 coulmnarStruct.getFieldsAsList() 来获取惰性对象的引用列表。实际读取发生在对 LazyString 引用的 getWritableObject() 调用中。

实现相同目的的另一种方法是使用 StructObjectInspector 并使用 copyToStandardObject API。但是我觉得上面的方法更简单。

关于java - 如何读入 RCFile,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25416114/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com