gpt4 book ai didi

c# - 如何查找 PDF 中所有出现的特定文本并在上方插入分页符?

转载 作者:行者123 更新时间:2023-11-30 23:09:27 26 4
gpt4 key购买 nike

我对 PDF 有一个棘手的要求

我需要在我的 pdf 中搜索特定字符串 - 属性编号:

每次找到这个,我都需要在上面添加一个分页符

我可以访问 IText 和 Spire.PDF,我先看 IText

我从这里的其他帖子确定我需要使用 PDF Stamper

下面的逻辑添加了一个有效的新页面

但是,在我的例子中,我只需要一个分页符而不是一个空白页

var newFile = @"c:\temp\full.pdf";
var dest = @"c:\temp\dest.pdf";
var reader = new PdfReader(newFile);
if (File.Exists(dest))
{
File.Delete(dest);
}

var stamper = new PdfStamper(reader, new FileStream(dest, FileMode.CreateNew));
var total = reader.NumberOfPages + 1;
for (var pageNumber = total; pageNumber > 0; pageNumber--)
{
var pageContent = reader.GetPageContent(pageNumber);
stamper.InsertPage(pageNumber, PageSize.A4);
}

stamper.Close();
reader.Close();

下图显示了一个示例,所以这实际上是 3 页,现有页面,在第一次出现的属性编号上方插入一个新的分页符:

在第二次出现之前需要另一个分页符

enter image description here

最佳答案

此答案分享了使用 iText 和 Java 查找 PDF 中所有出现的特定文本并在上方插入分页符的概念验证。将它移植到 iTextSharp 和 C# 应该不会太难。

此外,对于生产使用,必须添加一些额外的代码,因为目前代码做出了一些假设,例如假定非旋转页面。此外,它根本不处理注释。

这个任务实际上是两个任务的组合,查找插入分页符,因此我们需要

  1. 一些自定义文本位置的提取策略和
  2. 一个切割页面的工具。

SearchTextLocationExtractionStrategy

为了提取自定义文本的位置,我们扩展了 iText LocationTextExtractionStrategy 以允许提取自定义文本字符串的位置,实际上是正则表达式的匹配项:

public class SearchTextLocationExtractionStrategy extends LocationTextExtractionStrategy {
public SearchTextLocationExtractionStrategy(Pattern pattern) {
super(new TextChunkLocationStrategy() {
public TextChunkLocation createLocation(TextRenderInfo renderInfo, LineSegment baseline) {
// while baseLine has been changed to not neutralize
// effects of rise, ascentLine and descentLine explicitly
// have not: We want the actual positions.
return new AscentDescentTextChunkLocation(baseline, renderInfo.getAscentLine(),
renderInfo.getDescentLine(), renderInfo.getSingleSpaceWidth());
}
});
this.pattern = pattern;
}

static Field locationalResultField = null;
static Method filterTextChunksMethod = null;
static Method startsWithSpaceMethod = null;
static Method endsWithSpaceMethod = null;
static Field textChunkTextField = null;
static Method textChunkSameLineMethod = null;
static {
try {
locationalResultField = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
locationalResultField.setAccessible(true);
filterTextChunksMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("filterTextChunks",
List.class, TextChunkFilter.class);
filterTextChunksMethod.setAccessible(true);
startsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("startsWithSpace",
String.class);
startsWithSpaceMethod.setAccessible(true);
endsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("endsWithSpace", String.class);
endsWithSpaceMethod.setAccessible(true);
textChunkTextField = TextChunk.class.getDeclaredField("text");
textChunkTextField.setAccessible(true);
textChunkSameLineMethod = TextChunk.class.getDeclaredMethod("sameLine", TextChunk.class);
textChunkSameLineMethod.setAccessible(true);
} catch (NoSuchFieldException | SecurityException | NoSuchMethodException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

public Collection<TextRectangle> getLocations(TextChunkFilter chunkFilter) {
Collection<TextRectangle> result = new ArrayList<>();
try {
List<TextChunk> filteredTextChunks = (List<TextChunk>) filterTextChunksMethod.invoke(this,
locationalResultField.get(this), chunkFilter);
Collections.sort(filteredTextChunks);

StringBuilder sb = new StringBuilder();
List<AscentDescentTextChunkLocation> locations = new ArrayList<>();
TextChunk lastChunk = null;
for (TextChunk chunk : filteredTextChunks) {
String chunkText = (String) textChunkTextField.get(chunk);
if (lastChunk == null) {
// Nothing to compare with at the end
} else if ((boolean) textChunkSameLineMethod.invoke(chunk, lastChunk)) {
// we only insert a blank space if the trailing character of the previous string
// wasn't a space,
// and the leading character of the current string isn't a space
if (isChunkAtWordBoundary(chunk, lastChunk)
&& !((boolean) startsWithSpaceMethod.invoke(this, chunkText))
&& !((boolean) endsWithSpaceMethod.invoke(this, chunkText))) {
sb.append(' ');
LineSegment spaceBaseLine = new LineSegment(lastChunk.getEndLocation(),
chunk.getStartLocation());
locations.add(new AscentDescentTextChunkLocation(spaceBaseLine, spaceBaseLine, spaceBaseLine,
chunk.getCharSpaceWidth()));
}
} else {
assert sb.length() == locations.size();
Matcher matcher = pattern.matcher(sb);
while (matcher.find()) {
int i = matcher.start();
Vector baseStart = locations.get(i).getStartLocation();
TextRectangle textRectangle = new TextRectangle(matcher.group(), baseStart.get(Vector.I1),
baseStart.get(Vector.I2));
for (; i < matcher.end(); i++) {
AscentDescentTextChunkLocation location = locations.get(i);
textRectangle.add(location.getAscentLine().getBoundingRectange());
textRectangle.add(location.getDescentLine().getBoundingRectange());
}

result.add(textRectangle);
}

sb.setLength(0);
locations.clear();
}
sb.append(chunkText);
locations.add((AscentDescentTextChunkLocation) chunk.getLocation());
lastChunk = chunk;
}
} catch (IllegalAccessException | IllegalArgumentException | InvocationTargetException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return result;
}

@Override
public void renderText(TextRenderInfo renderInfo) {
for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())
super.renderText(info);
}

public static class AscentDescentTextChunkLocation extends TextChunkLocationDefaultImp {
public AscentDescentTextChunkLocation(LineSegment baseLine, LineSegment ascentLine, LineSegment descentLine,
float charSpaceWidth) {
super(baseLine.getStartPoint(), baseLine.getEndPoint(), charSpaceWidth);
this.ascentLine = ascentLine;
this.descentLine = descentLine;
}

public LineSegment getAscentLine() {
return ascentLine;
}

public LineSegment getDescentLine() {
return descentLine;
}

final LineSegment ascentLine;
final LineSegment descentLine;
}

public class TextRectangle extends Rectangle2D.Float {
public TextRectangle(final String text, final float xStart, final float yStart) {
super(xStart, yStart, 0, 0);
this.text = text;
}

public String getText() {
return text;
}

final String text;
}

final Pattern pattern;
}

( SearchTextLocationExtractionStrategy.java )

由于基类的一些必要成员是private或package private的,我们不得不使用反射来提取它们。

AbstractPdfPageSplittingTool

此工具的页面拆分功能已从 PdfVeryDenseMergeTool 中提取自 this answer .此外,允许自定义分页符位置是抽象的。

public abstract class AbstractPdfPageSplittingTool {
public AbstractPdfPageSplittingTool(Rectangle size, float top) {
this.pageSize = size;
this.topMargin = top;
}

public void split(OutputStream outputStream, PdfReader... inputs) throws DocumentException, IOException {
try {
openDocument(outputStream);
for (PdfReader reader : inputs) {
split(reader);
}
} finally {
closeDocument();
}
}

void openDocument(OutputStream outputStream) throws DocumentException {
final Document document = new Document(pageSize, 36, 36, topMargin, 36);
final PdfWriter writer = PdfWriter.getInstance(document, outputStream);
document.open();
this.document = document;
this.writer = writer;
newPage();
}

void closeDocument() {
try {
document.close();
} finally {
this.document = null;
this.writer = null;
this.yPosition = 0;
}
}

void newPage() {
document.newPage();
yPosition = pageSize.getTop(topMargin);
}

void split(PdfReader reader) throws IOException {
for (int page = 1; page <= reader.getNumberOfPages(); page++) {
split(reader, page);
}
}

void split(PdfReader reader, int page) throws IOException
{
PdfImportedPage importedPage = writer.getImportedPage(reader, page);
PdfContentByte directContent = writer.getDirectContent();
yPosition = pageSize.getTop();

Rectangle pageSizeToImport = reader.getPageSize(page);
float[] borderPositions = determineSplitPositions(reader, page);
if (borderPositions == null || borderPositions.length < 2)
return;

for (int borderIndex = 0; borderIndex + 1 < borderPositions.length; borderIndex++) {
float height = borderPositions[borderIndex] - borderPositions[borderIndex + 1];
if (height <= 0)
continue;

directContent.saveState();
directContent.rectangle(0, yPosition - height, pageSizeToImport.getWidth(), height);
directContent.clip();
directContent.newPath();

writer.getDirectContent().addTemplate(importedPage, 0, yPosition - (borderPositions[borderIndex] - pageSizeToImport.getBottom()));

directContent.restoreState();
newPage();
}
}

protected abstract float[] determineSplitPositions(PdfReader reader, int page);

Document document = null;
PdfWriter writer = null;
float yPosition = 0;

final Rectangle pageSize;
final float topMargin;
}

( AbstractPdfPageSplittingTool.java )

一致使用

执行 OP 的任务:

I need to search my pdf for a specific string - Property Number:

Each time this is found, I need to add a page break ABOVE

可以像这样使用上面的类:

AbstractPdfPageSplittingTool tool = new AbstractPdfPageSplittingTool(PageSize.A4, 36) {
@Override
protected float[] determineSplitPositions(PdfReader reader, int page) {
Collection<TextRectangle> locations = Collections.emptyList();
try {
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
SearchTextLocationExtractionStrategy strategy = new SearchTextLocationExtractionStrategy(
Pattern.compile("Property Number"));
parser.processContent(page, strategy, Collections.emptyMap()).getResultantText();
locations = strategy.getLocations(null);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

List<Float> borders = new ArrayList<>();
for (TextRectangle rectangle : locations)
{
borders.add((float)rectangle.getMaxY());
}

Rectangle pageSize = reader.getPageSize(page);
borders.add(pageSize.getTop());
borders.add(pageSize.getBottom());
Collections.sort(borders, Collections.reverseOrder());

float[] result = new float[borders.size()];
for (int i=0; i < result.length; i++)
result[i] = borders.get(i);
return result;
}
};

tool.split(new FileOutputStream(RESULT), new PdfReader(SOURCE));

( SplitPages.java 测试方法 testSplitDocumentAboveAngestellter)

关于c# - 如何查找 PDF 中所有出现的特定文本并在上方插入分页符?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45823741/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com