gpt4 book ai didi

scala - 如何在 Apache Spark scala 中读取 PDF 文件和 xml 文件?

转载 作者:行者123 更新时间:2023-12-01 09:16:47 27 4
gpt4 key购买 nike

我读取文本文件的示例代码是

val text = sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitions)
var rddwithPath = text.asInstanceOf[HadoopRDD[LongWritable, Text]].mapPartitionsWithInputSplit { (inputSplit, iterator) ⇒
val file = inputSplit.asInstanceOf[FileSplit]
iterator.map { tpl ⇒ (file.getPath.toString, tpl._2.toString) }
}.reduceByKey((a,b) => a)

这样我如何使用PDF和Xml文件

最佳答案

可以使用 Tika 解析 PDF 和 XML:

Apache Tika - a content analysis toolkit
enter image description here
看着
- https://tika.apache.org/1.9/api/org/apache/tika/parser/xml/
- http://tika.apache.org/0.7/api/org/apache/tika/parser/pdf/PDFParser.html
- https://tika.apache.org/1.9/api/org/apache/tika/parser/AutoDetectParser.html
下面是 Spark 与 Tika 的集成示例:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.input.PortableDataStream
import org.apache.tika.metadata._
import org.apache.tika.parser._
import org.apache.tika.sax.WriteOutContentHandler
import java.io._

object TikaFileParser {

def tikaFunc (a: (String, PortableDataStream)) = {

val file : File = new File(a._1.drop(5))
val myparser : AutoDetectParser = new AutoDetectParser()
val stream : InputStream = new FileInputStream(file)
val handler : WriteOutContentHandler = new WriteOutContentHandler(-1)
val metadata : Metadata = new Metadata()
val context : ParseContext = new ParseContext()

myparser.parse(stream, handler, metadata, context)

stream.close

println(handler.toString())
println("------------------------------------------------")
}


def main(args: Array[String]) {

val filesPath = "/home/user/documents/*"
val conf = new SparkConf().setAppName("TikaFileParser")
val sc = new SparkContext(conf)
val fileData = sc.binaryFiles(filesPath)
fileData.foreach( x => tikaFunc(x))
}
}

关于scala - 如何在 Apache Spark scala 中读取 PDF 文件和 xml 文件?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42000832/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com