.net - 使用 itextsharp 从 Pdf 文件中提取文本和文本矩形坐标-6ren

.net - 使用 itextsharp 从 Pdf 文件中提取文本和文本矩形坐标

转载作者：行者123 更新时间：2023-12-04 13:08:56

我正在尝试从 PDF 文件中获取所有单词及其位置坐标。我在 .NET 上成功使用了 Acrobat API .现在，我正在尝试使用免费 API 获得相同的结果，例如 iTextSharp(.NET 版本)。我可以使用 PRTokeniser 获取文本(逐行) ，但我不知道如何获得线的坐标，更不用说每个单词的坐标了。

最佳答案

我的帐户对 Mark Storer 的回答太新了。

我无法直接使用 LocationTextExtracationStrategy (我想我一定是做错了什么)。当我使用 LocationTextExtracationStrategy 时，我能够获取文本，但我无法弄清楚如何获取每个字符串(或字符串行)的坐标。

我最终继承了 LocationTextExtracationStrategy 并公开了我想要的数据，因为它内部确实有它。

我也想在 .net 中使用它...所以这里是我放在一起的一个草率的 C# 版本。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

using iTextSharp.text.pdf.parser;

namespace PdfHelper
{
    /// <summary>
    /// Taken from http://www.java-frameworks.com/java/itext/com/itextpdf/text/pdf/parser/LocationTextExtractionStrategy.java.html
    /// </summary>
    class LocationTextExtractionStrategyEx : LocationTextExtractionStrategy
    {
        private List<TextChunk> m_locationResult = new List<TextChunk>();
        private List<TextInfo> m_TextLocationInfo = new List<TextInfo>();
        public List<TextChunk> LocationResult 
        {
            get { return m_locationResult; }
        }
        public List<TextInfo> TextLocationInfo
        {
            get { return m_TextLocationInfo; }
        }

        /// <summary>
        /// Creates a new LocationTextExtracationStrategyEx
        /// </summary>
        public LocationTextExtractionStrategyEx()
        {
        }

        /// <summary>
        /// Returns the result so far
        /// </summary>
        /// <returns>a String with the resulting text</returns>
        public override String GetResultantText()
        {
            m_locationResult.Sort();

            StringBuilder sb = new StringBuilder();
            TextChunk lastChunk = null;
            TextInfo lastTextInfo = null;
            foreach (TextChunk chunk in m_locationResult)
            {
                if (lastChunk == null)
                {
                    sb.Append(chunk.Text);
                    lastTextInfo = new TextInfo(chunk);
                    m_TextLocationInfo.Add(lastTextInfo);
                }
                else
                {
                    if (chunk.sameLine(lastChunk))
                    {
                        float dist = chunk.distanceFromEndOf(lastChunk);

                        if (dist < -chunk.CharSpaceWidth)
                        {
                            sb.Append(' ');
                            lastTextInfo.addSpace();
                        }
                        //append a space if the trailing char of the prev string wasn't a space && the 1st char of the current string isn't a space
                        else if (dist > chunk.CharSpaceWidth / 2.0f && chunk.Text[0] != ' ' && lastChunk.Text[lastChunk.Text.Length - 1] != ' ')
                        {
                            sb.Append(' ');
                            lastTextInfo.addSpace();
                        }
                        sb.Append(chunk.Text);
                        lastTextInfo.appendText(chunk);
                    }
                    else
                    {
                        sb.Append('\n');
                        sb.Append(chunk.Text);
                        lastTextInfo = new TextInfo(chunk);
                        m_TextLocationInfo.Add(lastTextInfo);
                    }
                }
                lastChunk = chunk;
            }
            return sb.ToString();
        }

        /// <summary>
        /// 
        /// </summary>
        /// <param name="renderInfo"></param>
        public override void RenderText(TextRenderInfo renderInfo)
        {
            LineSegment segment = renderInfo.GetBaseline();
            TextChunk location = new TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), renderInfo.GetAscentLine(), renderInfo.GetDescentLine());
            m_locationResult.Add(location);
        }

        public class TextChunk : IComparable, ICloneable
        {
            string m_text;
            Vector m_startLocation;
            Vector m_endLocation;
            Vector m_orientationVector;
            int m_orientationMagnitude;
            int m_distPerpendicular;
            float m_distParallelStart;
            float m_distParallelEnd;
            float m_charSpaceWidth;

            public LineSegment AscentLine;
            public LineSegment DecentLine;

            public object Clone()
            {
                TextChunk copy = new TextChunk(m_text, m_startLocation, m_endLocation, m_charSpaceWidth, AscentLine, DecentLine);
                return copy;
            }

            public string Text
            {
                get { return m_text; }
                set { m_text = value; }
            }
            public float CharSpaceWidth
            {
                get { return m_charSpaceWidth; }
                set { m_charSpaceWidth = value; }
            }
            public Vector StartLocation
            {
                get { return m_startLocation; }
                set { m_startLocation = value; }
            }
            public Vector EndLocation
            {
                get { return m_endLocation; }
                set { m_endLocation = value; }
            }

            /// <summary>
            /// Represents a chunk of text, it's orientation, and location relative to the orientation vector
            /// </summary>
            /// <param name="txt"></param>
            /// <param name="startLoc"></param>
            /// <param name="endLoc"></param>
            /// <param name="charSpaceWidth"></param>
            public TextChunk(string txt, Vector startLoc, Vector endLoc, float charSpaceWidth, LineSegment ascentLine, LineSegment decentLine)
            {
                m_text = txt;
                m_startLocation = startLoc;
                m_endLocation = endLoc;
                m_charSpaceWidth = charSpaceWidth;
                AscentLine = ascentLine;
                DecentLine = decentLine;

                m_orientationVector = m_endLocation.Subtract(m_startLocation).Normalize();
                m_orientationMagnitude = (int)(Math.Atan2(m_orientationVector[Vector.I2], m_orientationVector[Vector.I1]) * 1000);

                // see http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html
                // the two vectors we are crossing are in the same plane, so the result will be purely
                // in the z-axis (out of plane) direction, so we just take the I3 component of the result
                Vector origin = new Vector(0, 0, 1);
                m_distPerpendicular = (int)(m_startLocation.Subtract(origin)).Cross(m_orientationVector)[Vector.I3];

                m_distParallelStart = m_orientationVector.Dot(m_startLocation);
                m_distParallelEnd = m_orientationVector.Dot(m_endLocation);
            }

            /// <summary>
            /// true if this location is on the the same line as the other text chunk
            /// </summary>
            /// <param name="textChunkToCompare">the location to compare to</param>
            /// <returns>true if this location is on the the same line as the other</returns>
            public bool sameLine(TextChunk textChunkToCompare)
            {
                if (m_orientationMagnitude != textChunkToCompare.m_orientationMagnitude) return false;
                if (m_distPerpendicular != textChunkToCompare.m_distPerpendicular) return false;
                return true;
            }

            /// <summary>
            /// Computes the distance between the end of 'other' and the beginning of this chunk
            /// in the direction of this chunk's orientation vector.  Note that it's a bad idea
            /// to call this for chunks that aren't on the same line and orientation, but we don't
            /// explicitly check for that condition for performance reasons.
            /// </summary>
            /// <param name="other"></param>
            /// <returns>the number of spaces between the end of 'other' and the beginning of this chunk</returns>
            public float distanceFromEndOf(TextChunk other)
            {
                float distance = m_distParallelStart - other.m_distParallelEnd;
                return distance;
            }

            /// <summary>
            /// Compares based on orientation, perpendicular distance, then parallel distance
            /// </summary>
            /// <param name="obj"></param>
            /// <returns></returns>
            public int CompareTo(object obj)
            {
                if (obj == null) throw new ArgumentException("Object is now a TextChunk");

                TextChunk rhs = obj as TextChunk;
                if (rhs != null)
                {
                    if (this == rhs) return 0;

                    int rslt;
                    rslt = m_orientationMagnitude - rhs.m_orientationMagnitude;
                    if (rslt != 0) return rslt;

                    rslt = m_distPerpendicular - rhs.m_distPerpendicular;
                    if (rslt != 0) return rslt;

                    // note: it's never safe to check floating point numbers for equality, and if two chunks
                    // are truly right on top of each other, which one comes first or second just doesn't matter
                    // so we arbitrarily choose this way.
                    rslt = m_distParallelStart < rhs.m_distParallelStart ? -1 : 1;

                    return rslt;
                }
                else
                {
                    throw new ArgumentException("Object is now a TextChunk");
                }
            }
        }

        public class TextInfo
        {
            public Vector TopLeft;
            public Vector BottomRight;
            private string m_Text;

            public string Text
            {
                get { return m_Text; }
            }

            /// <summary>
            /// Create a TextInfo.
            /// </summary>
            /// <param name="initialTextChunk"></param>
            public TextInfo(TextChunk initialTextChunk)
            {
                TopLeft = initialTextChunk.AscentLine.GetStartPoint();
                BottomRight = initialTextChunk.DecentLine.GetEndPoint();
                m_Text = initialTextChunk.Text;
            }

            /// <summary>
            /// Add more text to this TextInfo.
            /// </summary>
            /// <param name="additionalTextChunk"></param>
            public void appendText(TextChunk additionalTextChunk)
            {
                BottomRight = additionalTextChunk.DecentLine.GetEndPoint();
                m_Text += additionalTextChunk.Text;
            }

            /// <summary>
            /// Add a space to the TextInfo.  This will leave the endpoint out of sync with the text.
            /// The assumtion is that you will add more text after the space which will correct the endpoint.
            /// </summary>
            public void addSpace()
            {
                m_Text += ' ';
            }


        }
    }
}

我添加了一个 TextLocationInfo 属性，它返回一个文本行列表 + 这些行的坐标(左上角和右下角)，可用于为您提供一个边界框。

我也看到了我最初玩耍时的一些奇怪之处。如果我从基线拉出起点和终点，看起来我得到了相同的坐标(我认为正确的做法是从 ascentLine 和 DecentLine 拉出这些点)。我的初始传球我只是使用了基线。奇怪的是，我没有看到结果坐标有什么不同。所以对谨慎的人来说......我不确定我提供的坐标是否正确......我只是认为它们是/应该是。

关于.net - 使用 itextsharp 从 Pdf 文件中提取文本和文本矩形坐标，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/4577789/

文章推荐： javascript - 将 async/await 与 for in 循环一起使用

文章推荐： SQL通配符问题

java - 如何使用 Ruby、PHP 或 Java 解析/提取/提取 ASP.net 网站内容？
我正在做一个业余爱好项目，使用 Ruby、PHP 或 Java 来抓取 ASP.net 网站的内容。例如，如果网站 url“www.myaspnet.com/home.aspx”。我想从 home.a
r - 提取/之间的字符串
如果我有这些字符串： mystrings <- c("X2/D2/F4", "X10/D9/F4", "X3/D22/F4",
regex - 提取 | 之间的最后一个单词|
我有以下数据集 > head(names$SAMPLE_ID) [1] "Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Mor
grails - 提取: 'join'被忽略
设置: 3个域类A，B和C。A和B在插件中。 C在依赖于此插件的应用程序中。 class A{ B b static mapping = { b fetch: 'joi
JAVA StAX 提取
我不知道如何提取 XML 文件中的开始标记元素名称。我很接近〜意味着没有错误，我正在获取标签名称，但我正在获取标签名称加上信息。我得到的是: {http://www.publishing.org}au
regex - 提取 "?"之后的文本
我有一个字符串 x <- "Name of the Student? Michael Sneider" 我想从中提取“Michael Sneider”。我用过: str_extract_all(x,
Java - 提取 [* ... *] 之间的所有内容
我有一个如下所示的文本文件: [* content I want *] [ more content ] 我想读取该文件并能够提取我想要的内容。我能做的最好的事情如下，但它会返回 [更多内容] 请注意
Twig 提取 FOR 循环变量
假设我有一个项目集合 $collection = array( 'item1' => array( 'post' => $post, 'ca
java - 读取一个文本文件并写入多个文本文件以进行过滤/提取
我正在寻找一种过滤文本文件的方法。我有许多文件夹名称，其中包含许多文本文件，文本文件有几个没有人员，每个人员有 10 个群集/组(我在这里只显示了 3 个)。但是每个组/簇可能包含几个原语(我在这里展
python - Unicode 提取
我已经编写了一个从某个网页中提取网址的代码，我面临的问题是它不会以网页上相同的方式提取网址，我的意思是如果该网址位于某些网页中法语，它不会按原样提取它。我该如何解决这个问题？ import reque
c# - 提取 ZipFile
如何在 C# 中提取 ZipFile？(ZipFile 是包含文件和目录) 最佳答案为此使用工具。类似于 SharpZip .据我所知 - .NET 不支持开箱即用的 ZIP 文件。来自 here
c++ - 提取[]之间内容的正则表达式
我有一个表达: [training_width]:lofmimics 我要提取[]之间的内容，在上面的例子中我要 training_width 我试过以下方法: QRegularExpression
bash - 提取 "$@"中最后一个参数之前的参数
我正在尝试创建一个 Bash 脚本，该脚本将从命令行给出的最后一个参数提取到一个变量中以供其他地方使用。这是我正在处理的脚本: #!/bin/bash # compact - archive and
Javascript 提取 *.com
我正在寻找一个 JavaScript 函数/正则表达式来从 URI 中提取 *.com...(在客户端完成) 它应该适用于以下情况: siphone.com = siphone.com qwr.sip
python - BeautifulSoup 提取
关闭。这个问题需要更多focused .它目前不接受答案。想改进这个问题吗？更新问题，使其只关注一个问题 editing this post . 关闭 8 年前。 Improve this qu
Python JSON 提取
编辑:添加了实际的 JSON 对象和代码以供审查我有这种格式的 JSON(只是这种层次结构，假设 JSON 正常工作) {u'kind': u'calendar#events', u'default
python - 提取标签的内容
我已经编写了代码来使用 BeautifulSoup 提取一本书的 url 和标题来自页面。但它并没有在 > 之间提取惊人的 super 科学故事 1930 年 4 月这本书的名字。和标签。如何提
Java，提取$符号之间的单词
使用 Java，我想提取美元符号 $ 之间的单词。例如: String = " this is first attribute $color$. this is the second attribu
string - 提取.txt文件中以00开头的数字
您好，我正在尝试找到一种方法来确定字符串中的常量，然后提取该常量左侧的一定数量的字符。例如-我有一个 .txt 文件，在那个文件的某处有数字 00nnn 数字的例子是 00234 00765 ...
php操作（删除,提取,增加）zip文件方法详解
php读取zip文件(删除文件,提取文件,增加文件)实例从zip压缩文件中提取文件复制代码代码如下: <?php /* php 从zip压缩文件

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

.net - 使用 itextsharp 从 Pdf 文件中提取文本和文本矩形坐标