gpt4 book ai didi

java -PDFBox 如何从文档中提取文本而不存储在数组中?

转载 作者:行者123 更新时间:2023-12-02 01:47:37 25 4
gpt4 key购买 nike

我正在使用 PDFBox 从 PDF 文档中提取文本。然后,一旦提取,我会将这些文本插入 MySQL 中的表中。

代码:

PDDocument document = PDDocument.load(new File(path1));

if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
String[] words = line.split(" ");

String sql="insert IGNORE into test.indextable values (?,?);";

preparedStatement = con1.prepareStatement(sql);
int i=0;
for (String word : words) {
// check if one or more special characters at end of string then remove OR
// check special characters in beginning of the string then remove
// insert every word directly to table db
word=word.replaceAll("([\\W]+$)|(^[\\W]+)", "");
preparedStatement.setString(1, path1);
preparedStatement.setString(2, word);

/* preparedStatement.executeUpdate();
System.out.print("Add ");*/

preparedStatement.addBatch();

i++;
if (i % 1000 == 0) {
preparedStatement.executeBatch();

System.out.print("Add Thousand");
}
}

if (i > 0) {
preparedStatement.executeBatch();

System.out.print("Add Remaining");
}
}
}

代码工作正常,但正如您所看到的,如果文档很大并且内部有大约 1000 万个单词,则 lines[] 不会做任何正义的事情,并且会抛出 内存异常

我想不出解决办法。有什么方法可以直接提取单词并将其插入数据库,否则这是不可能的?

编辑:

这就是我所做的:

processText 方法:

public void processText(String text) throws SQLException {

String lines[] = text.split("\\r?\\n");
for (String line : lines) {
String[] words = line.split(" ");


String sql="insert IGNORE into test.indextable values (?,?);";


preparedStatement = con1.prepareStatement(sql);
int i=0;
for (String word : words) {

// check if one or more special characters at end of string then remove OR
// check special characters in beginning of the string then remove
// insert every word directly to table db
word=word.replaceAll("([\\W]+$)|(^[\\W]+)", "");
preparedStatement.setString(1, path1);
preparedStatement.setString(2, word);



preparedStatement.addBatch();

i++;
if (i % 1000 == 0) {
preparedStatement.executeBatch();

System.out.print("Add Thousand");
}




}




if (i > 0) {
preparedStatement.executeBatch();

System.out.print("Add Remaining");

}

}
preparedStatement.close();
System.out.println("Successfully commited changes to the database!");

}

index方法(调用上面的方法):

public void index() throws Exception {
// Connection con1 = con.connect();
try {

// Connection con1=con.connect();
// Connection con1 = con.connect();
Statement statement = con1.createStatement();

ResultSet rs = statement.executeQuery("select * from filequeue where Status='Active' LIMIT 5");


while (rs.next()) {
// get the filepath of the PDF document
path1 = rs.getString(2);
int getNum = rs.getInt(1);
// while running the process, update status : Processing
//updateProcess_DB(getNum);
Statement test = con1.createStatement();
test.executeUpdate("update filequeue SET STATUS ='Processing' where UniqueID="+getNum);



try {
// call the index function


/*Indexing process = new Indexing();

process.index(path1);*/

PDDocument document = PDDocument.load(new File(path1));

if (!document.isEncrypted()) {

PDFTextStripper tStripper = new PDFTextStripper();
for(int p=1; p<=document.getNumberOfPages();++p) {
tStripper.setStartPage(p);
tStripper.setEndPage(p);
String pdfFileInText = tStripper.getText(document);
processText(pdfFileInText);
}


}

最佳答案

您当前的代码使用从 tStripper.getText(document); 收集的字符串 pdfFileInText 并立即获取整个文档。首先在单独的方法中重构您对该字符串(以 pdfFileInText.split 开头)所做的所有操作,例如processText。然后将代码更改为:

PDFTextStripper tStripper = new PDFTextStripper();
for (int p = 1; p <= document.getNumberOfPages(); ++p)
{
stripper.setStartPage(p); // 1-based
stripper.setEndPage(p); // 1-based
String pdfFileInText = tStripper.getText(document);
processText(pdfFileInText);
}

新代码单独处理每个页面。这样,您将能够以更小的步骤进行数据库插入,并且不必存储文档的所有单词,只需存储一页的单词。

关于java -PDFBox 如何从文档中提取文本而不存储在数组中?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53535027/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com