scala - 如何强制 spark/hadoop 忽略文件上的 .gz 扩展名并将其读取为未压缩的纯文本？-6ren

scala - 如何强制 spark/hadoop 忽略文件上的 .gz 扩展名并将其读取为未压缩的纯文本？

转载作者：可可西里更新时间：2023-11-01 14:19:42

我的代码如下:

val lines: RDD[String] = sparkSession.sparkContext.textFile("s3://mybucket/file.gz")

URL 以 .gz 结尾，但这是遗留代码的结果。该文件是纯文本，不涉及压缩。然而，spark 坚持将其作为 GZIP 文件读取，这显然失败了。我怎样才能让它忽略扩展名并简单地将文件作为文本读取？

基于 this article我已经尝试在不包括 GZIP 编解码器的各个地方设置配置，例如:

sparkContext.getConf.set("spark.hadoop.io.compression.codecs", classOf[DefaultCodec].getCanonicalName)

这似乎没有任何效果。

由于文件在 S3 上，我不能在不复制整个文件的情况下简单地重命名它们。

最佳答案

第一个解决方案:Shading GzipCodec

这个想法是通过包含在你自己的来源这个 java file并替换这一行:

public String getDefaultExtension() {
  return ".gz";
}

与:

public String getDefaultExtension() {
  return ".whatever";
}

在构建您的项目时，这将有效地使用您对 GzipCodec 的定义，而不是依赖项提供的定义(这是 GzipCodec 的阴影)。

这样，在解析您的文件时，textFile() 将被迫应用默认编解码器，因为 gzip 编解码器不再适合您的文件命名。

此解决方案的不便之处在于您将无法在同一个应用程序中处理真正的 gzip 文件。

第二种解决方案:将 newAPIHadoopFile 与自定义/修改的 TextInputFormat 结合使用

您可以使用 newAPIHadoopFile (而不是 textFile)与自定义/修改 TextInputFormat这强制使用 DefaultCodec(纯文本)。

我们将根据默认的 (TextInputFormat) 编写我们自己的行阅读器。这个想法是删除 TextInputFormat 的一部分，它发现它被命名为 .gz 并因此在读取文件之前解压缩文件。

不是调用 sparkContext.textFile，

// plain text file with a .gz extension:
sparkContext.textFile("s3://mybucket/file.gz")

我们可以使用底层的sparkContext.newAPIHadoopFile，它允许我们指定如何读取输入:

import org.apache.hadoop.mapreduce.lib.input.FakeGzInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}

sparkContext
  .newAPIHadoopFile(
    "s3://mybucket/file.gz",
    classOf[FakeGzInputFormat], // This is our custom reader
    classOf[LongWritable],
    classOf[Text],
    new Configuration(sparkContext.hadoopConfiguration)
  )
  .map { case (_, text) => text.toString }

调用 newAPIHadoopFile 的常用方法是使用 TextInputFormat。这是包装如何读取文件以及根据文件扩展名选择压缩编解码器的部分。

我们称它为 FakeGzInputFormat 并按以下方式实现它作为 TextInputFormat 的扩展(这是一个 Java 文件，让我们把它放在包 src/main/java/org/apache/hadoop/mapreduce/lib/input 中):

package org.apache.hadoop.mapreduce.lib.input;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;

import com.google.common.base.Charsets;

public class FakeGzInputFormat extends TextInputFormat {

    public RecordReader<LongWritable, Text> createRecordReader(
        InputSplit split,
        TaskAttemptContext context
    ) {

        String delimiter =
            context.getConfiguration().get("textinputformat.record.delimiter");

        byte[] recordDelimiterBytes = null;
        if (null != delimiter)
            recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);

        // Here we use our custom `FakeGzLineRecordReader` instead of
        // `LineRecordReader`:
        return new FakeGzLineRecordReader(recordDelimiterBytes);
    }

    @Override
    protected boolean isSplitable(JobContext context, Path file) {
        return true; // plain text is splittable (as opposed to gzip)
    }
}

事实上，我们必须更深入一层，同时替换默认的 LineRecordReader (Java) 和我们自己的(我们称它为 FakeGzLineRecordReader)。

由于LineRecordReader很难继承，我们可以复制LineRecordReader(在src/main/java/org/apache/hadoop/mapreduce/lib/input)并通过强制使用默认编解码器(纯文本)稍微修改(并简化)initialize(InputSplit genericSplit, TaskAttemptContext context) 方法:

(与原始 LineRecordReader 相比的唯一变化已给出解释正在发生的事情的评论)

package org.apache.hadoop.mapreduce.lib.input;

import java.io.IOException;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.Seekable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

@InterfaceAudience.LimitedPrivate({"MapReduce", "Pig"})
@InterfaceStability.Evolving
public class FakeGzLineRecordReader extends RecordReader<LongWritable, Text> {
    private static final Logger LOG =
            LoggerFactory.getLogger(FakeGzLineRecordReader.class);
    public static final String MAX_LINE_LENGTH =
            "mapreduce.input.linerecordreader.line.maxlength";

    private long start;
    private long pos;
    private long end;
    private SplitLineReader in;
    private FSDataInputStream fileIn;
    private Seekable filePosition;
    private int maxLineLength;
    private LongWritable key;
    private Text value;
    private byte[] recordDelimiterBytes;

    public FakeGzLineRecordReader(byte[] recordDelimiter) {
        this.recordDelimiterBytes = recordDelimiter;
    }

    // This has been simplified a lot since we don't need to handle compression
    // codecs.
    public void initialize(
        InputSplit genericSplit,
        TaskAttemptContext context
    ) throws IOException {

        FileSplit split = (FileSplit) genericSplit;
        Configuration job = context.getConfiguration();
        this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
        start = split.getStart();
        end = start + split.getLength();
        final Path file = split.getPath();

        final FileSystem fs = file.getFileSystem(job);
        fileIn = fs.open(file);

        fileIn.seek(start);
        in = new UncompressedSplitLineReader(
            fileIn, job, this.recordDelimiterBytes, split.getLength()
        );
        filePosition = fileIn;

        if (start != 0) {
            start += in.readLine(new Text(), 0, maxBytesToConsume(start));
        }
        this.pos = start;
    }

    // Simplified as input is not compressed:
    private int maxBytesToConsume(long pos) {
        return (int) Math.max(Math.min(Integer.MAX_VALUE, end - pos), maxLineLength);
    }

    // Simplified as input is not compressed:
    private long getFilePosition() {
        return pos;
    }

    private int skipUtfByteOrderMark() throws IOException {
        int newMaxLineLength = (int) Math.min(3L + (long) maxLineLength,
                Integer.MAX_VALUE);
        int newSize = in.readLine(value, newMaxLineLength, maxBytesToConsume(pos));
        pos += newSize;
        int textLength = value.getLength();
        byte[] textBytes = value.getBytes();
        if ((textLength >= 3) && (textBytes[0] == (byte)0xEF) &&
                (textBytes[1] == (byte)0xBB) && (textBytes[2] == (byte)0xBF)) {
            LOG.info("Found UTF-8 BOM and skipped it");
            textLength -= 3;
            newSize -= 3;
            if (textLength > 0) {
                textBytes = value.copyBytes();
                value.set(textBytes, 3, textLength);
            } else {
                value.clear();
            }
        }
        return newSize;
    }

    public boolean nextKeyValue() throws IOException {
        if (key == null) {
            key = new LongWritable();
        }
        key.set(pos);
        if (value == null) {
            value = new Text();
        }
        int newSize = 0;
        while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
            if (pos == 0) {
                newSize = skipUtfByteOrderMark();
            } else {
                newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));
                pos += newSize;
            }

            if ((newSize == 0) || (newSize < maxLineLength)) {
                break;
            }

            LOG.info("Skipped line of size " + newSize + " at pos " +
                    (pos - newSize));
        }
        if (newSize == 0) {
            key = null;
            value = null;
            return false;
        } else {
            return true;
        }
    }

    @Override
    public LongWritable getCurrentKey() {
        return key;
    }

    @Override
    public Text getCurrentValue() {
        return value;
    }

    public float getProgress() {
        if (start == end) {
            return 0.0f;
        } else {
            return Math.min(1.0f, (getFilePosition() - start) / (float)(end - start));
        }
    }

    public synchronized void close() throws IOException {
        try {
            if (in != null) {
                in.close();
            }
        } finally {}
    }
}

关于scala - 如何强制 spark/hadoop 忽略文件上的 .gz 扩展名并将其读取为未压缩的纯文本？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49110384/

文章推荐： hadoop - Hive 执行 "insert into ... values ..."非常慢

文章推荐： Python ctypes : SetWindowsHookEx callback function never called

文章推荐： c++ - 使用 CreateWindowEx() 创建的窗口中的默认按钮

文章推荐： java - PySpark:无法创建 SparkSession。(Java 网关错误)

.htaccess - 强制 https，强制到子文件夹，强制 www，不要破坏站点
我一直很难编辑我的 .htaccess 文件来一起做这三件事。我已经能够分别获得每个部分，但我只是不明白逻辑流程如何使它们全部工作。这是我能够使用 bluehost support 上的演示进行整合
vba - 强制"is"以保存并另存为无宏工作簿
我制作的宏将模板工作簿保存为两个单独的文件。每个测试保存一个(位置 1、2、3 或 4)，然后在另一个宏中使用每个测试的数据。第二个是保留用于备份的原始数据文件。现在的问题是每次我在每个位置运行测试并
types - 包括模块、强制
我正在写一篇关于如何使用 OCaml 的模块系统而不是 Java 的 OO 系统(一个有趣的视角)的博客文章。我遇到了一些我不理解的关于强制的事情。下面是一个基本模块和两个包含它的模块: module
c - 强制 if 语句只执行一次
我有一段将被执行多次(5,000+)的代码，以及一个仅在第一次为真的 if 语句。我曾想过使用“FIRST”变量并每次都进行比较，但每次都检查它似乎是一种浪费，即使我知道它不需要。 bool FIRS
强制 P4CONFIG 未设置？
首先，我是 Perforce 的新手，我主要通过其文档进行学习。因此，我们即将从 CVS 迁移到 Perforce，我最近学到了一个避免更改每个工作区的 P4CLIENT 的好方法，即在工作区根目录
java - 强制 FileNotFoundException
我正在为一段代码编写测试，其中包含我试图涵盖的 IOException 捕获。 try/catch 看起来像这样: try { oos = new ObjectOutputStream(new
javascript - 强制 $.each() 循环等待动画完成后再继续
我正在尝试在新闻项目滚动之间添加延迟。我知道 $.each() 通过不等待动画完成来完成其工作，但我想知道如何制作它，以便一次向上滚动一个项目并等到最后一个动画完成后再继续在循环中。 $(functi
c# - 强制/要求方法的排序列表参数
假设已经编写了一个方法，需要一个排序列表作为其输入之一。当然这将在代码中进行注释和记录，param 将被命名为“sortedList”，但如果有人忘记，则会出现错误。有没有办法强制输入必须排序？我正
Apache 强制 SSL
我正在尝试将传入请求重定向到 https://www.domain.com/和所有 https://www.domain.com/ {所有页面}并且没有什么麻烦。我试过的方法: 添加此行:Redire
python - 强制 raw_input
我将如何实现以下内容: title_selection = raw_input("Please type in the number of your title and press Enter.\n%
javascript - 在浏览器中自动完成 ="off"强制
我有一个登录表单，我需要强制关闭自动完成功能。我试过了 jquery: $('#login').attr("autocomplete", "off"); HTML: Javascript:docume
Git 强制 merge
我想知道我应该怎么做才能强制从 dev 分支 merge 到我的 master 分支？使用“git merge dev”会导致很多冲突。但是，我不想单独处理它们。相反，我只是想使用我的 dev 分支中
c# - 强制 Nuget 包使用特定版本的子依赖项？
当安装 Hl7.Fhir.DSTU2 和 Hl7.Fhir.R4 这两个 Nuget 包时，我们得到如下信息: DSTU2 包似乎在使用 Hl7.Fhir.Support.Poco 版本 3.4.0
reactjs - 强制 react 功能组件触发重新渲染
我正在尝试让一个功能组件在 testFn 执行时强制重新渲染。我想使用状态来做到这一点(如果有更好的方法请说出来)，这似乎成功地强制重新渲染但只有两次，然后什么都没有。我构建了一个简单的演示来模拟这
g++ - 强制 g++ 为未使用的函数生成代码
默认情况下，g++ 似乎会省略未使用的类内定义方法的代码。示例 from my previous question : struct Foo { void bar() {} void baz(
html - 强制 div 高度以适应透视内容
我正在尝试使用 here 中介绍的技术使我的网站背景以比内容慢的速度滚动。我不希望背景固定，只希望更慢。这是 HTML 的样子: .parallax { perspective: 1px;
dart - 强制 ListView 缩小以包含子项
我能找到的最相似的问题是 'how to create a row of scrollable text boxes or widgets in flutter inside a ListView?'
javascript - 强制(修复)导入完全在一行中
我有以下 eslint 配置: "object-curly-newline": ["error", { "ImportDeclaration": "never",
tinymce - 强制 TinyMCE 去除数据属性
我正在使用 TinyMCE 插件并将 valid_elements 选项设置为: "a[href|target:_blank],strong/b,em/i,br,p,ul,ol,li" 即使没有列出数
强制 : wanted to Multiline description
您好，我想使用以下命令放置多行描述 p4 --field Description="MY CLN Header \\n my CLN complete description in two -thre

可可西里

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

scala - 如何强制 spark/hadoop 忽略文件上的 .gz 扩展名并将其读取为未压缩的纯文本？