gpt4 book ai didi

java - 如何使用 POI api 在 java 中读取 doc 和 docx 文件

转载 作者:塔克拉玛干 更新时间:2023-11-03 04:37:38 27 4
gpt4 key购买 nike

我正在尝试阅读 doc 和 docx 文件。这是代码:

  static String distination="E:\\         
static String docFileName="Requirements.docx";
public static void main(String[] args) throws FileNotFoundException, IOException {
// TODO code application logic here
ReadFile rf= new ReadFile();
rf.ReadFileParagraph(distination+docFileName);


}
public void ReadFileParagraph(String path) throws FileNotFoundException, IOException
{
FileInputStream fis;
File file = new File(path);
fis=new FileInputStream(file.getAbsolutePath());
String filename=file.getName();

String fileExtension=fileExtension(path);
if(fileExtension.equals("doc"))
{
HWPFDocument document=new HWPFDocument(fis);
WordExtractor DocExtractor = new WordExtractor(document);
ReadDocFile(DocExtractor,filename);

}
else if(fileExtension.equals("docx"))
{

XWPFDocument documentX = new XWPFDocument(fis);
List<XWPFParagraph> pera =documentX.getParagraphs();
ReadDocXFile(pera,filename);
}
else
{
System.out.println("format does not match");
}

}
public void ReadDocFile(WordExtractor extractor,String filename)
{

for (String paragraph : extractor.getParagraphText()) {
System.out.println("Peragraph: "+paragraph);
}
}
public void ReadDocXFile(List<XWPFParagraph> extractor,String filename)
{

for (XWPFParagraph paragraph : extractor) {
System.out.println("Question: "+paragraph.getParagraphText());
}

}
public String fileExtension(String filename)
{

String extension = filename.substring(filename.lastIndexOf(".") + 1, filename.length());
return extension;
}

当我想读取 docx 文件时,此代码会出现异常:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/xmlbeans/XmlException
at l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:52)
at autometictagdetection.TagDetection.main(TagDetection.java:36)
Caused by: java.lang.ClassNotFoundException: org.apache.xmlbeans.XmlException
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
... 2 more
Java Result: 1

另一个问题 是当我想读取 Doc 文件时,它读取某些文件非常好,但对于某些文件它给出了这样的异常

    Exception in thread "main" org.apache.poi.hwpf.OldWordFileFormatException: The               document is too old - Word 95 or older. Try HWPFOldDocument instead?
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:222)
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:186)
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:174)
at l3s.readfiles.db.ReadFile.ReadFileParagraph(ReadFile.java:44)
at autometictagdetection.TagDetection.main(TagDetection.java:36)
Java Result: 1

我在 http://poi.apache.org/hwpf/index.html 中看到 POI API 支持第 6 个词和第 95 个词.请问有人可以解决这两个问题吗?

最佳答案

需要核心 maven 依赖项,这是问题 1 的解决方案

<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>3.15</version>
</dependency>
<!-- For .DOCX FILES -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.15</version>
</dependency>
<!-- For .DOC FILES -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>3.9</version>
</dependency>

For Problem 2 From the original source code , seems POI doesn't support documents way too old

  /**
* This constructor loads a Word document from a specific point
* in a POIFSFileSystem, probably not the default.
* Used typically to open embeded documents.
*
* @param directory The DirectoryNode that contains the Word document.
* @throws IOException If there is an unexpected IOException from the passed
* in POIFSFileSystem.
*/
public HWPFDocument(DirectoryNode directory) throws IOException
{
// Load the main stream and FIB
// Also handles HPSF bits
super(directory);

// Is this document too old for us?
if(_fib.getFibBase().getNFib() < 106) {
throw new OldWordFileFormatException("The document is too old - Word 95 or older. Try HWPFOldDocument instead?");
}

Source code for HWPFDocument

关于java - 如何使用 POI api 在 java 中读取 doc 和 docx 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17617173/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com