gpt4 book ai didi

java - Apache Poi - 如何从 Word 文档中删除所有链接

转载 作者:行者123 更新时间:2023-11-30 04:54:17 29 4
gpt4 key购买 nike

我想删除Word文档的所有超链接并保留文本。我有这两种方法来阅读带有doc和docx扩展名的word文档。

private void readDocXExtensionDocument(){
File inputFile = new File(inputFolderDir, "test.docx");
try {
XWPFDocument document = new XWPFDocument(OPCPackage.open(new FileInputStream(inputFile)));
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
extractor.setFetchHyperlinks(true);
String context = extractor.getText();
System.out.println(context);
} catch (InvalidFormatException e) {
e.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}

}

private void readDocExtensionDocument(){
File inputFile = new File(inputFolderDir, "test.doc");
POIFSFileSystem fs;
try {
fs = new POIFSFileSystem(new FileInputStream(inputFile));
HWPFDocument document = new HWPFDocument(fs);
WordExtractor wordExtractor = new WordExtractor(document);
String[] paragraphs = wordExtractor.getParagraphText();
System.out.println("Word document has " + paragraphs.length + " paragraphs");
for(int i=0; i<paragraphs.length; i++){
paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");
System.out.println(paragraphs[i]);
}
} catch (IOException e) {
e.printStackTrace();
}
}

是否可以使用apache poi库删除word文档的所有链接?如果不是,是否有其他库可以提供此功能?

最佳答案

我的解决方案,至少对于 .docx 类别,是使用正则表达式。看看这个

private void readDocXExtensionDocument(){
Pattern p = Pattern.compile("\\<(.+?)\\>");
File inputFile = new File(inputFolderDir, "test.docx");
try {
XWPFDocument document = new XWPFDocument(OPCPackage.open(new FileInputStream(inputFile)));
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
extractor.setFetchHyperlinks(true);
String context = extractor.getText();
Matcher m = p.matcher(context);
while (m.find()) {
String link = m.group(0); // the bracketed part
String textString = m.group(1); // the text of the link without the brackets
context = context.replaceAll(link, ""); // ordering important. Link then textString
context = context.replaceAll(textString, "");
}
System.out.println(context);
} catch (InvalidFormatException e) {
e.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

这种方法的唯一警告是,如果这些尖括号中的 Material 不是链接,那么也可以将其删除。如果您对可能出现的链接类型有更好的了解,您可以尝试使用更具体的正则表达式,而不是我提供的正则表达式。

关于java - Apache Poi - 如何从 Word 文档中删除所有链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9067932/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com