gpt4 book ai didi

grails - Grails-Tika内容操纵

转载 作者:行者123 更新时间:2023-12-02 14:49:52 29 4
gpt4 key购买 nike

我正在尝试将.docx文件解析为xml。我可以解析它并将xml渲染到单独的页面中。但是我真正想要的是仅在<body>中显示template,而不在metadata中显示。我怎样才能做到这一点?我尝试使用BodyContentHandler,但是摆脱了xml tags

谢谢。

编辑

我在controller中有一个简单的代码,但我搞砸了。这就是我以前的经历。我从temp文件夹中获取文件并将其发送到tikaService(我从GitHub复制了服务)。
Controller

def parse(Document documentInstance) {
def file = new File(documentInstance.fullPath)
def result = tikaService.parseFile(file)
render(view:"parse", text: result, contentType: "text/xml", encoding: "UTF-8")
}
Service
class TikaService {

static transactional = false

String parseFile(File file, TikaConfig tikaConfig, Metadata metadata){
SAXTransformerFactory factory = SAXTransformerFactory.newInstance()
TransformerHandler handler = factory.newTransformerHandler()
handler.transformer.setOutputProperty(OutputKeys.METHOD, "xml")
handler.transformer.setOutputProperty(OutputKeys.INDENT, "yes")

StringWriter sw = new StringWriter()
handler.result = new StreamResult(sw)

Parser parser = new AutoDetectParser(tikaConfig)
ParseContext pc = new ParseContext()
try {
parser.parse(new FileInputStream(file), handler, metadata, pc)
return sw.toString()
} catch (Exception e) {
log.error("Failed to parse file ${file.absolutePath}", e)
throw e
}
}

String parseFile(File file){
TikaConfig tikaConfig = new TikaConfig()
Metadata tikaMeta = new Metadata()
return parseFile(file, tikaConfig, tikaMeta)
}
}

如果我使用 render,我会得到

this

当我用 parse.gsp${result}调用结果时

this

我希望我对此有所解释。谢谢。

编辑2

XML格式
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Revision-Number" content="0"/>
<meta name="Last-Printed" content="1601-01-01T00:00:00Z"/>
<meta name="cp:revision" content="0"/>
<meta name="meta:print-date" content="1601-01-01T00:00:00Z"/>
<meta name="meta:creation-date" content="2013-03-20T15:29:13Z"/>
<meta name="dcterms:modified" content="1601-01-01T00:00:00Z"/>
<meta name="meta:save-date" content="1601-01-01T00:00:00Z"/>
<meta name="dc:creator" content="ingo "/>
<meta name="Last-Modified" content="1601-01-01T00:00:00Z"/>
<meta name="Author" content="ingo "/>
<meta name="dcterms:created" content="2013-03-20T15:29:13Z"/>
<meta name="date" content="1601-01-01T00:00:00Z"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser"/>
<meta name="modified" content="1601-01-01T00:00:00Z"/>
<meta name="creator" content="ingo "/>
<meta name="Creation-Date" content="2013-03-20T15:29:13Z"/>
<meta name="meta:author" content="ingo "/>
<meta name="Content-Type" content="application/msword"/>
<meta name="Last-Save-Date" content="1601-01-01T00:00:00Z"/>
<title/>
</head>
<body>
<p class="überschrift_1"><b>Tika Parser Test </b></p>
<p class="standard">This is a simple test document</p>
</body>
</html>

编辑3

控制者
import javax.xml.transform.OutputKeys
import javax.xml.transform.sax.SAXTransformerFactory
import javax.xml.transform.sax.TransformerHandler
import javax.xml.transform.stream.StreamResult

import org.apache.tika.config.TikaConfig
import org.apache.tika.metadata.Metadata
import org.apache.tika.parser.AutoDetectParser
import org.apache.tika.parser.ParseContext
import org.apache.tika.parser.Parser
import org.apache.tika.sax.BodyContentHandler
import org.apache.tika.sax.ToXMLContentHandler
import org.apache.tika.sax.ToHTMLContentHandler

def parse(Document documentInstance) {
def file = new File(documentInstance.fullPath)
BodyContentHandler handler = new BodyContentHandler(new ToHTMLContentHandler())
AutoDetectParser parser = new AutoDetectParser()
FileInputStream inputstream = new FileInputStream(file)

Metadata metadata = new Metadata()
parser.parse(inputstream, handler, metadata)
}

错误
Namespace http://www.w3.org/1999/xhtml not declared

最佳答案

首先,看起来Tika文档上给出的示例是错误的:

link to the bug ticket



这是解决该问题的方法:

link to solution


ToHTMLContentHandler toHtmlContentHandler = new ToHTMLContentHandler(outputStream, "UTF-8");
WriteOutContentHandler handler = new WriteOutContentHandler(toHtmlContentHandler, (int) 4000000);
ContentHandler bodyHandler = new BodyContentHandler(handler);

希望这可以帮助!

关于grails - Grails-Tika内容操纵,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35876259/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com