gpt4 book ai didi

java - 需要一个关于如何获取 DOC 和 DOCX 文件字数的清晰示例

转载 作者:太空宇宙 更新时间:2023-11-04 14:53:39 24 4
gpt4 key购买 nike

我能够读取 DOC 文件并获取其字数,但它是错误的。

我的代码:

 public class WordCounter {
public static void main(String[] args) throws Throwable {
processDOC();
}

private static void processDOC() throws Throwable {
File file = new File("/Users/yjiang/Desktop/whatever.doc");
File file2 = new File("/Users/yjiang/Desktop/Test.docx");
File file3 = new File("/Users/yjiang/Desktop/QB Tests 4-14-2014.xls");
File file4 = new File("/Users/yjiang/Desktop/QB Tests 4-14-2014.xlsx");

try {
FileInputStream fs = new FileInputStream(file);
POIFSFileSystem poifsFileSystem = new POIFSFileSystem(fs);
DirectoryEntry directoryEntry = poifsFileSystem.getRoot();
DocumentEntry documentEntry = (DocumentEntry) directoryEntry.getEntry(SummaryInformation.DEFAULT_STREAM_NAME);
DocumentInputStream dis = new DocumentInputStream(documentEntry);
PropertySet ps = new PropertySet(dis);
SummaryInformation si = new SummaryInformation(ps);

System.out.println(si.getWordCount());
} catch (Exception e) {
e.printStackTrace();
}


try {
HWPFDocument hwpfDocument = new HWPFDocument(new FileInputStream(file));
System.out.println(hwpfDocument.getDocProperties().getCWords()); // actually 71 words using word count in MSWord, returned 57.
System.out.println(hwpfDocument.getDocProperties().getCWordsFtnEnd());
XWPFDocument xwpfDocument = new XWPFDocument(new FileInputStream(file2)); // actually 71 words using word count in MSWord, returned 57.
System.out.println(xwpfDocument.getProperties().getExtendedProperties().getUnderlyingProperties().getWords());



System.out.println();
} catch (Exception e) {
e.printStackTrace();
}
}
}

“whatever.doc”有 71 个单词,当我运行它时,它只返回 57 个单词。

enter image description here

似乎我无法使用相同的方法来读取 DOCX 文件,当我运行它时,我得到以下信息:

org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)

可以举个例子吗?

最佳答案

我还发现内置的单词计数器给出了奇怪的计数,但文本提取似乎更可靠,所以我使用这个解决方案:

public long getWordCount(File file) throws IOException {
POITextExtractor textExtractor;
if (file.getName().endsWith(".docx")) {
XWPFDocument doc = new XWPFDocument(new FileInputStream(file));
textExtractor = new XWPFWordExtractor(doc);
}
else if (file.getName().endsWith(".doc")) {
textExtractor = new WordExtractor(new FileInputStream(file));
}
else {
throw new IllegalArgumentException("Not a MS Word file.");
}

return Arrays.stream(textExtractor.getText().split("\\s+"))
.filter(s -> s.matches("^.*[\\p{L}\\p{N}].*$"))
.count();
}

如果需要,可以调整底部的正则表达式,但总体而言,事实证明该正则表达式具有相当的弹性。

关于java - 需要一个关于如何获取 DOC 和 DOCX 文件字数的清晰示例,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23479409/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com