gpt4 book ai didi

java - 复制时 PDFBox 中的新文档中的页面被裁剪

转载 作者:行者123 更新时间:2023-12-01 18:04:44 26 4
gpt4 key购买 nike

我正在尝试将单个 PDF 拆分为多个。就像 10 页文档变成 10 个单页文档。

PDDocument source = PDDocument.load(input_file);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.addPage(page);
output.save(file);
output.close();

这里的问题是,新文档的页面大小与原始文档不同。因此,新文档中的某些文本被裁剪或丢失。我正在使用 PDFBox 2.0,如何避免这种情况?

更新:谢谢@mkl。

Splitter 发挥了魔力。这是更新后的工作部分,

public static void extractAndCreateDocument(SplitMeta meta, PDDocument source)
throws IOException {

File file = new File(meta.getFilename());

Splitter splitter = new Splitter();
splitter.setStartPage(meta.getStart());
splitter.setEndPage(meta.getEnd());
splitter.setSplitAtPage(meta.getEnd());

List<PDDocument> docs = splitter.split(source);
if(docs.size() > 0){
PDDocument output = docs.get(0);
output.save(file);
output.close();
}
}

public class SplitMeta {

private String filename;
private int start;
private int end;

public SplitMeta() {
}
}

最佳答案

不幸的是,OP 没有提供示例文档来重现该问题。因此,我必须猜测。

我认为问题是基于未立即链接到页面对象而是从其父级继承的对象。

在这种情况下,使用 PDDocument.addPage 是错误的选择,因为此方法仅将给定的页面对象添加到目标文档页面树,而不考虑继承的内容。

相反,应该使用PDDocument.importPage,其记录为:

/**
* This will import and copy the contents from another location. Currently the content stream is stored in a scratch
* file. The scratch file is associated with the document. If you are adding a page to this document from another
* document and want to copy the contents to this document's scratch file then use this method otherwise just use
* the {@link #addPage} method.
*
* Unlike {@link #addPage}, this method does a deep copy. If your page has annotations, and if
* these link to pages not in the target document, then the target document might become huge.
* What you need to do is to delete page references of such annotations. See
* <a href="http://stackoverflow.com/a/35477351/535646">here</a> for how to do this.
*
* @param page The page to import.
* @return The page that was imported.
*
* @throws IOException If there is an error copying the page.
*/
public PDPage importPage(PDPage page) throws IOException

实际上,即使这种方法也可能不够,因为它没有考虑所有继承的属性,但是查看 Splitter 实用程序类,我们会得到一个印象:

PDPage imported = getDestinationDocument().importPage(page);
imported.setCropBox(page.getCropBox());
imported.setMediaBox(page.getMediaBox());
// only the resources of the page will be copied
imported.setResources(page.getResources());
imported.setRotation(page.getRotation());
// remove page links to avoid copying not needed resources
processAnnotations(imported);

使用辅助方法

private void processAnnotations(PDPage imported) throws IOException
{
List<PDAnnotation> annotations = imported.getAnnotations();
for (PDAnnotation annotation : annotations)
{
if (annotation instanceof PDAnnotationLink)
{
PDAnnotationLink link = (PDAnnotationLink)annotation;
PDDestination destination = link.getDestination();
if (destination == null && link.getAction() != null)
{
PDAction action = link.getAction();
if (action instanceof PDActionGoTo)
{
destination = ((PDActionGoTo)action).getDestination();
}
}
if (destination instanceof PDPageDestination)
{
// TODO preserve links to pages within the splitted result
((PDPageDestination) destination).setPage(null);
}
}
// TODO preserve links to pages within the splitted result
annotation.setPage(null);
}
}

当您尝试将单个 PDF 拆分为多个(例如将 10 页文档拆分为 10 个单页文档)时,您可能需要使用此 Splitter 实用程序类按原样。

测试

为了测试这些方法,我使用了 PDF Clown 示例输出 AnnotationSample.Standard.pdf 的输出。因为该库严重依赖于页面树值的继承。因此,我使用 PDDocument.addPagePDDocument.importPageSplitter 将其唯一页面的内容复制到新文档,如下所示:

PDDocument source = PDDocument.load(resource);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.addPage(page);
output.save(new File(RESULT_FOLDER, "PageAddedFromAnnotationSample.Standard.pdf"));
output.close();

( CopyPages.java 测试 testWithAddPage)

PDDocument source = PDDocument.load(resource);
PDDocument output = new PDDocument();
PDPage page = source.getPages().get(0);
output.importPage(page);
output.save(new File(RESULT_FOLDER, "PageImportedFromAnnotationSample.Standard.pdf"));
output.close();

( CopyPages.java 测试 testWithImportPage)

PDDocument source = PDDocument.load(resource);
Splitter splitter = new Splitter();
List<PDDocument> results = splitter.split(source);
Assert.assertEquals("Expected exactly one result document from splitting a single page document.", 1, results.size());
PDDocument output = results.get(0);
output.save(new File(RESULT_FOLDER, "PageSplitFromAnnotationSample.Standard.pdf"));
output.close();

( CopyPages.java 测试 testWithSplitter)

只有最终测试忠实地复制了页面。

关于java - 复制时 PDFBox 中的新文档中的页面被裁剪,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37526904/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com