gpt4 book ai didi

java - Apache POI - 将 *.doc 转换为带图像的 *.html

转载 作者:塔克拉玛干 更新时间:2023-11-01 22:29:49 27 4
gpt4 key购买 nike

有一个包含一些图像的 DOC 文件。如何将其转换为带有图像的 HTML?

我试着用这个例子: Convert Word doc to HTML programmatically in Java

public class Converter {
...

private File docFile, htmlFile;

try {
FileInputStream fos = new FileInputStream(docFile.getAbsolutePath());
HWPFDocument doc = new HWPFDocument(fos);
Document newDoc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();

WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDoc) ;
wordToHtmlConverter.processDocument(doc);

StringWriter stringWriter = new StringWriter();

Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.transform(
new DOMSource(wordToHtmlConverter.getDocument()),
new StreamResult(stringWriter)
);

String html = stringWriter.toString();

try {
BufferedWriter out = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream(htmlFile), "UTF-8")
);
out.write(html);
out.close();
} catch (IOException e) {
e.printStackTrace();
}

JEditorPane jEditorPane = new JEditorPane();
jEditorPane.setContentType("text/html");
jEditorPane.setEditable(false);
jEditorPane.setPage(htmlFile.toURI().toURL());

JScrollPane jScrollPane = new JScrollPane(jEditorPane);

JFrame jFrame = new JFrame("display html file");
jFrame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
jFrame.getContentPane().add(jScrollPane);
jFrame.setSize(512, 342);
jFrame.setVisible(true);

} catch(Exception e) {
e.printStackTrace();
}
...
}

但是图像丢失了。

documentation WordToHtmlConverter 类说明如下:

...this implementation doesn't create images or links to them. This can be changed by overriding AbstractWordConverter.processImage(Element, boolean, Picture) method.

如何将带有图片的DOC转成HTML?

最佳答案

扩展 WordToHtmlConverter 并覆盖 processImageWithoutPicturesManager

 import java.util.Base64;

import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.usermodel.Picture;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
public class InlineImageWordToHtmlConverter extends WordToHtmlConverter {

public InlineImageWordToHtmlConverter(Document document) {
super(document);
}

@Override
protected void processImageWithoutPicturesManager(Element currentBlock,
boolean inlined, Picture picture)
{
Element imgNode = currentBlock.getOwnerDocument().createElement("img");
StringBuilder sb = new StringBuilder();
sb.append(Base64.getMimeEncoder().encodeToString(picture.getRawContent()));
sb.insert(0, "data:"+picture.getMimeType()+";base64,");
imgNode.setAttribute("src", sb.toString());
currentBlock.appendChild(imgNode);
}

}

在解析文档时使用新类,如下所示

HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream("D:/temp/Temp.doc"));    
WordToHtmlConverter wordToHtmlConverter = new InlineImageWordToHtmlConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder()
.newDocument());
wordToHtmlConverter.processDocument(wordDocument);

关于java - Apache POI - 将 *.doc 转换为带图像的 *.html,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13815119/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com