java - 如何读入 RCFile-6ren

java - 如何读入 RCFile

转载作者：可可西里更新时间：2023-11-01 16:14:14

我正在尝试将一个小的 RCFile(约 200 行数据)读入 HashMap 以执行 Map-Side 连接，但我在将文件中的数据变为可用状态时遇到了很多麻烦。

这是我目前所拥有的，其中大部分是从 this example 中提取的:

    public void configure(JobConf job)                                                                                                   
    {   
        try
        {                                                                                                                                
            FileSystem fs = FileSystem.get(job);                                                                                         
            RCFile.Reader rcFileReader = new RCFile.Reader(fs, new Path("/path/to/file"), job);          
            int counter = 1;   
            while (rcFileReader.next(new LongWritable(counter)))
            {
                System.out.println("Fetching data for row " + counter);                                                  
                BytesRefArrayWritable dataRead = new BytesRefArrayWritable();                                                            
                rcFileReader.getCurrentRow(dataRead);                                                                                    
                System.out.println("dataRead: " + dataRead + " dataRead.size(): " + dataRead.size());
                for (int i = 0; i < dataRead.size(); i++)                                                                                
                {
                    BytesRefWritable bytesRefRead = dataRead.get(i);                               
                    byte b1[] = bytesRefRead.getData();                                                                                  
                    Text returnData = new Text(b1);
                    System.out.println("READ-DATA = " + returnData.toString());                                                          
                }                                                        
                counter++;
            } 
        }
        catch (IOException e)
        {             
            throw new Error(e);
        }             
    }

但是，我得到的输出在第一行中将每列中的所有数据连接在一起，而在其他任何行中都没有数据。

Fetching data for row 1
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@7f26d3df dataRead.size(): 5
READ-DATA = 191606656066860670
READ-DATA = United StatesAmerican SamoaGuamNorthern Mariana Islands
READ-DATA = USASGUMP
READ-DATA = USSouth PacificSouth PacificSouth Pacific
READ-DATA = 19888
Fetching data for row 2
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@1cb1a4e2 dataRead.size(): 0
Fetching data for row 3
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@52c00025 dataRead.size(): 0
Fetching data for row 4
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@3b49a794 dataRead.size(): 0

如何正确读取这些数据，以便一次访问一行，例如

(191, United States, US, US, 19)?

最佳答案

由于 RCFiles 的列性质，行式读取路径与写入路径有很大不同。我们仍然可以使用 RCFile.Reader 类按行读取 RCFile(不需要 RCFileRecordReader)。但除此之外，我们还需要使用 ColumnarSerDe 将列式数据转换为行式数据。

以下是我们可以获取的用于按行读取 RCFile 的最简化代码。有关详细信息，请参阅内联代码注释。

private static void readRCFileByRow(String pathStr)
  throws IOException, SerDeException {

  final Configuration conf = new Configuration();

  final Properties tbl = new Properties();

  /*
   * Set the column names and types using comma separated strings. 
   * The actual name of the columns are not important, as long as the count 
   * of column is correct.
   * 
   * For types, this example uses strings. byte[] can be stored as string 
   * by encoding the bytes to ASCII (such as hexString or Base64)
   * 
   * Numbers of columns and number of types must match exactly.
   */
  tbl.setProperty("columns", "col1,col2,col3,col4,col5");
  tbl.setProperty("columns.types", "string:string:string:string:string");

  /*
   * We need a ColumnarSerDe to de-serialize the columnar data to row-wise 
   * data 
   */
  ColumnarSerDe serDe = new ColumnarSerDe();
  serDe.initialize(conf, tbl);

  Path path = new Path(pathStr);
  FileSystem fs = FileSystem.get(conf);
  final RCFile.Reader reader = new RCFile.Reader(fs, path, conf);

  final LongWritable key = new LongWritable();
  final BytesRefArrayWritable cols = new BytesRefArrayWritable();

  while (reader.next(key)) {
    System.out.println("Getting next row.");

    /*
     * IMPORTANT: Pass the same cols object to the getCurrentRow API; do not 
     * create new BytesRefArrayWritable() each time. This is because one call
     * to getCurrentRow(cols) can potentially read more than one column
     * values which the serde below would take care to read one by one.
     */
    reader.getCurrentRow(cols);

    final ColumnarStruct row = (ColumnarStruct) serDe.deserialize(cols);
    final ArrayList<Object> objects = row.getFieldsAsList();
    for (final Object object : objects) {
      // Lazy decompression happens here
      final String payload = 
        ((LazyString) object).getWritableObject().toString();
      System.out.println("Value:" + payload);
    }
  }
}

在此代码中，getCourrentRow 仍然按列读取数据，我们需要使用 SerDe 将其转换为行。此外，调用 getCurrentRow() 并不意味着该行中的所有字段都已解压。实际上，根据惰性解压，一个列只有在它的一个字段被反序列化时才会被解压。为此，我们使用了 coulmnarStruct.getFieldsAsList() 来获取惰性对象的引用列表。实际读取发生在对 LazyString 引用的 getWritableObject() 调用中。

实现相同目的的另一种方法是使用 StructObjectInspector 并使用 copyToStandardObject API。但是我觉得上面的方法更简单。

关于java - 如何读入 RCFile，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25416114/

文章推荐： hadoop - HDFS 上的 Solr 核心创建失败

文章推荐： windows - 在 32 位处理器的 Windows 7 上安装 Hadoop？

文章推荐： xml - Oozie workflow.xml 错误

hadoop - RCFile-发出GZip压缩的int列
由于某些原因，Hive无法识别以整数形式发出的列，但会识别以字符串形式发出的列。 Hive或RCFile或GZ是否存在阻止int正确渲染的问题？我的Hive DDL看起来像: create exte
java - 如何读入 RCFile
我正在尝试将一个小的 RCFile(约 200 行数据)读入 HashMap 以执行 Map-Side 连接，但我在将文件中的数据变为可用状态时遇到了很多麻烦。这是我目前所拥有的，其中大部分是从 t
hadoop - 在Apache Pig中读取Snappy压缩的Hive RCFile
尝试使用Hive读取Pig中的http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/HiveColumnarLo
RCFIle 格式文件的 Hadoop NullWritable
我不太理解Hadoop 中的NullWritable 的概念。它的用途是什么？为什么 RCFile 的 outputKeyClass 格式为 NullWritable.class 而 outputVa
python - pylint:忽略 rcfile 中的多个
在我的 django 项目中，我使用了一个写得很糟糕的外部编写的应用程序。现在我想从我的 pylint 报告中忽略这个应用程序，但是我不能让 pylint 忽略它。 Pylint 已经忽略了南迁，就像
hadoop - 我可以使用 Sqoop 将数据导入 RCFile 格式吗？
根据 http://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1764646 You can import data i
python - 带有自定义 rcfile 和 errors-only 标志的 pylint
是否可以使用自定义 rcfile 和仅错误标志运行 pylint？我希望 pylint 在典型用法中报告警告，但是当检查在我们的 CI 服务器上运行时，我想使用 --errors-only。例如，
python - 在 PyLint 上，使用 rcfile 禁用特定文件上的特定警告
大家好。我正在开发一个 python 项目，并负责清理 pylint 警告。事实是，代码的某些特定部分需要缩进或单词之间有间距，这与 Pylint 相悖。问题:有没有办法禁用 rcfile 中特定文
Hadoop:ClassNotFoundException - org.apache.hcatalog.rcfile.RCFileMapReduceOutputFormat
当我为类 org.apache.hcatalog.rcfile.RCFileMapReduceOutputFormat 运行作业时，我遇到了 ClassNotFoundException。我尝试使用
python - Pylint:避免检查 INSIDE DOCSTRINGS(全局指令/rcfile)
考虑这段代码: def test(): """This line is longer than 80 chars, but, for me this is ok inside a DOCSTR
reactjs - 错误 TS2430 : Interface 'RcFile' incorrectly extends interface 'File' in antd
C:\Users\SCC\Desktop\xxx\web-shop\node_modules\antd\lib\upload\interface.d.ts (6,18): error TS2430:
apache-spark - 如何将数据帧(从 hive 表中获取)写入 hadoop SequenceFile 和 RCFile？
我可以把它写成 ORC PARQUET直接和 TEXTFILE AVRO 使用来自数据块的附加依赖项。 com.databricks spark-csv_2.
django - 启动失败 [/bin/bash, --rcfile,/snap/pycharm-professional/127/plugins/terminal/jediterm-bash.in, -i]
我的 Pycharm 突然工作得很好，我看到了下面的错误消息。我有pycharm专业版关于如何解决这个问题的任何建议 Cannot open Local Terminal Failed to sta

可可西里

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

java - 如何读入 RCFile