c# - 在 Itextsharp 中使用 ITextExtractionStrategy 和 LocationTextExtractionStrategy 获取字符串坐标-6ren

c# - 在 Itextsharp 中使用 ITextExtractionStrategy 和 LocationTextExtractionStrategy 获取字符串坐标

转载作者：行者123 更新时间：2023-12-04 09:57:16

我有一个 PDF 文件，我正在使用 ITextExtractionStrategy 将其读入字符串。现在我从字符串中获取一个子字符串，例如 My name is XYZ 并且需要从 PDF 中获取子字符串的直角坐标文件，但无法执行。在谷歌搜索中，我知道了 LocationTextExtractionStrategy 但不知道如何使用它来获取坐标。

这是代码..

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);

string getcoordinate="My name is XYZ";

如何使用 ITEXTSHARP 获取此子字符串的直角坐标..

请帮忙。

最佳答案

这是一个非常非常简单的实现版本。

在实现它之前非常要知道 PDF 对“单词”、“段落”、“句子”等的概念为零。此外，文本中的文本PDF 不一定从左到右和从上到下排列，这与非 LTR 语言无关。短语“Hello World”可以这样写到 PDF 中:

Draw H at (10, 10)
Draw ell at (20, 10)
Draw rld at (90, 10)
Draw o Wo at (50, 20)

也可以写成

Draw Hello World at (10,10)

您需要实现的 ITextExtractionStrategy 接口(interface)有一个名为 RenderText 的方法，它会为 PDF 中的每个文本 block 调用一次。注意我说的是“ block ”而不是“词”。在上面的第一个示例中，该方法将针对这两个词调用四次。在第二个例子中，它会为这两个词调用一次。这是理解的非常重要的部分。 PDF 没有文字，因此 iTextSharp 也没有文字。 “单词”部分 100% 由您来解决。

此外，正如我上面所说，PDF 没有段落。需要注意这一点的原因是 PDF 无法将文本换行。每当您看到看起来像段落返回的内容时，您实际上看到的是一个全新的文本绘制命令，它具有与上一行不同的 y 坐标。参见 this for further discussion .

下面的代码是一个非常简单的实现。为此，我将 LocationTextExtractionStrategy 子类化，它已经实现了 ITextExtractionStrategy。在每次调用 RenderText() 时，我都会找到当前 block 的矩形(使用 Mark's code here )并将其存储以备后用。我正在使用这个简单的辅助类来存储这些 block 和矩形:

//Helper class that stores our rectangle and text
public class RectAndText {
    public iTextSharp.text.Rectangle Rect;
    public String Text;
    public RectAndText(iTextSharp.text.Rectangle rect, String text) {
        this.Rect = rect;
        this.Text = text;
    }
}

这是子类:

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo) {
        base.RenderText(renderInfo);

        //Get the bounding box for the chunk of text
        var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
        var topRight = renderInfo.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );

        //Add this to our main collection
        this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
    }
}

最后是上面的实现:

//Our test file
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");

//Create our test file, nothing special
using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
    using (var doc = new Document()) {
        using (var writer = PdfWriter.GetInstance(doc, fs)) {
            doc.Open();

            doc.Add(new Paragraph("This is my sample file"));

            doc.Close();
        }
    }
}

//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();

//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
    var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}

//Loop through each chunk found
foreach (var p in t.myPoints) {
    Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}

我怎么强调都不为过，上面的没有考虑“单词”，这将取决于您。传递给 RenderText 的 TextRenderInfo 对象有一个名为 GetCharacterRenderInfos() 的方法，您可以使用它来获取更多信息。如果您不关心字体中的下部，您可能还想使用 GetBaseline() 而不是 GetDescentLine()`。

编辑

(我吃了一顿丰盛的午餐，所以我感觉更有帮助了。)

这是 MyLocationTextExtractionStrategy 的更新版本，它执行我在下面的评论中所说的，即它需要一个字符串来搜索并在每个 block 中搜索该字符串。由于列出的所有原因，这在某些/许多/大多数/所有情况下都不起作用。如果子字符串在单个 block 中多次存在，它也将只返回第一个实例。连字和变音符号也可能与此混淆。

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();

    //The string that we're searching for
    public String TextToSearchFor { get; set; }

    //How to compare strings
    public System.Globalization.CompareOptions CompareOptions { get; set; }

    public MyLocationTextExtractionStrategy(String textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None) {
        this.TextToSearchFor = textToSearchFor;
        this.CompareOptions = compareOptions;
    }

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo) {
        base.RenderText(renderInfo);

        //See if the current chunk contains the text
        var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions);

        //If not found bail
        if (startPosition < 0) {
            return;
        }

        //Grab the individual characters
        var chars = renderInfo.GetCharacterRenderInfos().Skip(startPosition).Take(this.TextToSearchFor.Length).ToList();

        //Grab the first and last character
        var firstChar = chars.First();
        var lastChar = chars.Last();


        //Get the bounding box for the chunk of text
        var bottomLeft = firstChar.GetDescentLine().GetStartPoint();
        var topRight = lastChar.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );

        //Add this to our main collection
        this.myPoints.Add(new RectAndText(rect, this.TextToSearchFor));
    }

您可以像以前一样使用它，但现在构造函数只有一个必需参数:

var t = new MyLocationTextExtractionStrategy("sample");

关于c# - 在 Itextsharp 中使用 ITextExtractionStrategy 和 LocationTextExtractionStrategy 获取字符串坐标，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44405214/

文章推荐： sql-server-2005 - SQL排序和连字符

文章推荐： .net - 设置新 .NET 项目的最佳方法是什么？

文章推荐： sql - LINQ to SQL - 选择文本之类的字符串数组

itextsharp - 减少 iTextSharp 上的段落换行高度
当段落长度对于 ColumnText 的宽度来说太长时，如何减少换行符的高度？我已经尝试了以下方法，因为我看到了其他回答这个问题的问题: p.Leading = 0 但这并没有产生任何影响。我还尝
itextsharp - 在 iTextSharp 中使用富文本值加粗
是否可以使用 iTextSharp 将句子中的单个单词加粗？我正在处理来自 xml 的大段文本，并且我试图将几个单独的单词加粗，而不必将字符串分成单独的短语。例如: document.Add(new
itextsharp - 如何在 iTextSharp 中使用不间断空格
不间断空格如何用于在 PdfPTable 单元格中包含多行内容。 iTextSharp 正在用空格字符分解单词。场景是我想要在表头中显示多行内容，例如在第一行可能显示“Text1 &”，在第二行显示
itextsharp - 如何调整字体大小以填充 iTextSharp 中固定高度的表格单元格
我正在从 iTextSharp 创建 PDF 以供打印。我有可变长度的文本，我希望始终以最大字体大小填充固定高度的表格单元格，而不会换行。如何做到这一点？最佳答案首先，您需要能够测量所选字体的文本
itextsharp - 如何在 iTextSharp 上使用 PDFTextExtractor
我想使用 iTextSharp 从 pdf 文件中检索文本。但是，我无法像在 itextsharp(itext) 的 JAVA 库中那样使用 PDFTextExtractor。我需要 readPDFO
itextsharp - 使用 iTextSharp.ShowTextAligned() 添加水印
我们想在发送之前在我们的 pdf 顶部添加一个带有用户电子邮件和名称的水印。我已经编写了执行此操作的代码，并且运行良好。我想检查这是否是最好的方法。我们希望在 pdf 的顶部将水印分成两行。，我使用
itextsharp - 如何使用 iTextsharp 更改 PDF 中第二页的边距？
有没有办法使用 iTextSharp 更改 PDF 中第二页的页边距？我现在有: Document document = new Document(PageSize.A4, 144f, 72f, 1
itextsharp - 如何在 itextsharp 中使用 PdfContentByte 给文本加下划线
这其实是引用Question实际上已关闭我正在使用 ItextSharp 5.2.1。我想使用 PdfContentByte 使我的标题文本带有下划线。请为我提供解决方案。最佳答案 privat
itextsharp - 为什么来自 iTextSharp 的 GetTextFromPage 返回越来越长的字符串？
我正在使用来自 nuGet (5.5.8) 的最新 iTextSharp 库来解析 pdf 文件中的一些文本。我面临的问题是 GetTextFromPage 方法不仅从它应该返回的页面中返回文本，它还
itextsharp - 如何在使用 iTextSharp 保持方向的同时缩放 PDF 页面？
如何在保持 itextsharp 旋转的同时缩放 pdf 页面？我有以下内容，但我失去了轮换: public static void ScaleToLetter(string inPDF,
itextsharp - 使用 ITextSharp 在 PDF 中插入图像
我必须在 pdf 中插入图像。也就是说，无论我在哪里看到文本“签名”，我都必须在那里插入签名图像。我可以通过说 absolute positions 来做到。但是，我正在寻找如何在 pdf 中找到“签
itextsharp - 必须添加哪个类或命名空间才能在 itextSharp 中使用 Class StyleAttrCSSResolver？
我希望使用 itextSharp 将 html 转换为 pdf。我希望在我的 pdf 中有一个特定的样式。我希望所有 pdf 文件都遵循特定的 CSS 类。但我不知道我必须添加那个编译器 khno
itextsharp - 如何通过 Itextsharp 在 PDF 的页脚中添加页码
我在 ASP.NET 代码中使用 iTextSharp DLL。我正在将数据提取到数据集中并将数据集添加到 PDF 表中。如果我的数据集有更多 100 行，那么 100 行将添加到 PDF 表中，并
itextsharp - 使用 itextsharp 更改 pdf 中的默认字体和字体大小
如何使用 iIextSharp 为 PDF 文档设置默认字体和字体大小，以便在整个 PDF 中使用它。最佳答案遇到与俄语和罗马尼亚字母相同的问题(itextsharp 5.5.6.0，.net 3
itextsharp - 如何在 itextsharp 中使用 PdfContentByte 时换行文本
我使用 PdfContentByte 在 pdf 中显示文本，因为我现在也使用 SetTextMatrix mathod 来放置该文本，当我的文本很大时它不会显示在 pdf 中显示我可以包装文本显示我
itextsharp - 图像未按顺序添加到 pdf 文档 itextsharp(元素顺序错误)
我现在正在使用 iTextSharp (5.4.5) 几个星期。这周，我在文档中的元素顺序方面遇到了一些奇怪的事情。我正在处理包含主题和图像(图表)的 pdf 报告。文档的格式是这样的: 自然保护
itextsharp - 使用 PDF itextSharp 可以在创建 pdf 文档时将图像放在文本之上
我尝试了几种方法来做到这一点，但仍然无法做到。看来 iTextSharp 需要 2 次通过情况，以便图像出现在文本顶部。所以我尝试使用内存流来执行此操作，但我不断收到错误。 Public Fu
itextsharp - 如何在 iTextSharp 中以静态 XFA 形式设置 XFA 数据并保存？
我在 iText/iTextSharp(iTextSharp 5.3.3 通过 NuGet)中遇到了一个非常奇怪的 XFA 表单问题。我正在尝试填写静态 XFA 样式的表单，但我的更改没有生效。我有
itextsharp - 使用 itextsharp 的文本 x 和 y 坐标
当我使用 itextsharp 提取文本时，我将获得文本的 x 和 y 坐标。如果我根据 xy 位置将文本从 pdf 转换为 html，则通过使用这 2 个坐标，文本位置 chnages 。得到我使用
itextsharp - 使用 itextsharp 5.4.4 签署 pdf - 示例
有人可以提供示例或链接到使用 itextsharp 5.4.4 签署现有 pdf 的示例吗？理想情况下保持 pdf/pdf 的一致性？谢谢。编辑:我理解这个问题看起来好像我没有使用谷歌等。但是，新版

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c# - 在 Itextsharp 中使用 ITextExtractionStrategy 和 LocationTextExtractionStrategy 获取字符串坐标