gpt4 book ai didi

c# - 使用 itextsharp 在 C# 中提取阿拉伯语文本

转载 作者:行者123 更新时间:2023-11-30 21:44:51 25 4
gpt4 key购买 nike

enter image description here我有这段代码,我正在用它来获取 PDF 的文本。这对于英文 PDF 非常有用,但当我尝试提取阿拉伯语文本时,它会显示类似这样的内容。

") + n 9 n <+, + )+ $ # $ +$ F% 9& .< $ : ;"

using (PdfReader reader = new PdfReader(path))
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
String text = "";
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text = PdfTextExtractor.GetTextFromPage(reader, i,strategy);
}

}

最佳答案

我不得不这样改变策略

var t = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());
var te = Convert(t);

还有这个函数来反转阿拉伯语单词并保留英语

  private string Convert(string source)
{
string arabicWord = string.Empty;
StringBuilder sbDestination = new StringBuilder();

foreach (var ch in source)
{
if (IsArabic(ch))
arabicWord += ch;
else
{
if (arabicWord != string.Empty)
sbDestination.Append(Reverse(arabicWord));

sbDestination.Append(ch);
arabicWord = string.Empty;
}
}

// if the last word was arabic
if (arabicWord != string.Empty)
sbDestination.Append(Reverse(arabicWord));

return sbDestination.ToString();
}


private bool IsArabic(char character)
{
if (character >= 0x600 && character <= 0x6ff)
return true;

if (character >= 0x750 && character <= 0x77f)
return true;

if (character >= 0xfb50 && character <= 0xfc3f)
return true;

if (character >= 0xfe70 && character <= 0xfefc)
return true;

return false;
}

// Reverse the characters of string
string Reverse(string source)
{
return new string(source.ToCharArray().Reverse().ToArray());
}

关于c# - 使用 itextsharp 在 C# 中提取阿拉伯语文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40596320/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com