gpt4 book ai didi

jquery - 文本匹配不适用于阿拉伯语问题可能是由于阿拉伯语的正则表达式

转载 作者:行者123 更新时间:2023-12-01 06:27:33 25 4
gpt4 key购买 nike

我一直在努力向我的多语言网站添加一项功能,我必须突出显示匹配的标签关键字。

此功能适用于英语版本,但不适用于阿拉伯语版本。

我已在 JSFiddle 上设置了示例

示例代码

    function HighlightKeywords(keywords)
{
var el = $("#article-detail-desc");
var language = "ar-AE";
var pid = 32;
var issueID = 18;
$(keywords).each(function()
{
// var pattern = new RegExp("("+this+")", ["gi"]); //breaks html
var pattern = new RegExp("(\\b"+this+"\\b)(?![^<]*?>)", ["gi"]); //looks for match outside html tags
var rs = "<a class='ad-keyword-selected' href='http://www.alshindagah.com/ar/search.aspx?Language="+language+"&PageId="+pid+"&issue="+issueID+"&search=$1' title='Seach website for: $1'><span style='color:#990044; tex-decoration:none;'>$1</span></a>";
el.html(el.html().replace(pattern, rs));
});
}

HighlightKeywords(["you","الهدف","طهران","سيما","حاليا","Hello","34","english"]);

//Popup Tooltip for article keywords
$(function() {
$("#article-detail-desc").tooltip({
position: {
my: "center bottom-20",
at: "center top",
using: function( position, feedback ) {
$( this ).css( position );
$( "<div>" )
.addClass( "arrow" )
.addClass( feedback.vertical )
.addClass( feedback.horizontal )
.appendTo( this );
}
}
});
});

我将关键字存储在数组中,然后将它们与特定 div 中的文本进行匹配。

我不确定问题是由于 Unicode 还是什么原因造成的。感谢您在这方面的帮助。

最佳答案

这个答案分为三个部分

  1. 为什么它不起作用

  2. 如何用英语处理它的示例(旨在由了解阿拉伯语的人将其改编为阿拉伯语)

  3. 对阿拉伯语一无所知的人(我)尝试制作阿拉伯语版本:-)

为什么它不起作用

至少部分问题是您依赖 \b assertion ,它(就像它的对应项 \B\w\W )以英语为中心。您不能在其他语言中依赖它(甚至,实际上,在英语中 - 见下文)。

这是 the spec\b 的定义:

The production Assertion :: \ b evaluates by returning an internal AssertionTester closure that takes a State argument x and performs the following:

  • Let e be x's endIndex.
  • Call IsWordChar(e–1) and let a be the Boolean result.
  • Call IsWordChar(e) and let b be the Boolean result.
  • If a is true and b is false, return true.
  • If a is false and b is true, return true.
  • Return false.

...其中 IsWordChar 进一步定义为基本上表示这 63 个字符之一:

a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  zA  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z0  1  2  3  4  5  6  7  8  9  _    

E.g., the 26 English letters a to z in upper or lower case, the digits 0 to 9, and _. (This means you can't even rely on \b, \B, \w, or \W in English, because English has loan words like "Voilà", but that's another story.)

A first example using English

You'll have to use a different mechanism for detecting word boundaries in Arabic. If you can come up with a character class that includes all of the Arabic "code points" (as Unicode puts it) that make up words, you could use code a bit like this:

var keywords = {
"laboris": true,
"laborum": true,
"pariatur": true
// ...and so on...
};
var text = /*... get the text to work on... */;
text = text.replace(
/([abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)([^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)?/g,
replacer);

function replacer(m, c0, c1) {
if (keywords[c0]) {
c0 = '<a href="#">' + c0 + '</a>';
}
return c0 + c1;
}

注意事项:

  • 我使用 [abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ] 类来表示“单词字符”。显然,对于阿拉伯语,您必须(明显)更改此设置。
  • 我使用 [^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ] 类来表示“不是单词字符”。这与前面的类相同,但开头带有否定 (^)。
  • 正则表达式使用捕获组 ((...)) 查找任何一系列“单词字符”,后跟一系列可选非单词字符两者皆有。
  • String#replace 使用匹配的全文(后跟每个捕获组作为参数)调用 replacer 函数。
  • replacer 函数在 keywords 映射中查找第一个捕获组(单词)以查看它是否是关键字。如果是这样,它将把它包裹在一个 anchor 中。
  • replacer 函数返回可能被换行的单词以及其后的非单词文本。
  • String#replace 使用 replacer 的返回值来替换匹配的文本。

这是执行此操作的完整示例:Live Copy | Live Source

<!DOCTYPE html>
<html>
<head>
<meta charset=utf-8 />
<title>Replacing Keywords</title>
</head>
<body>
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>

<script src="http://code.jquery.com/jquery-1.9.1.min.js"></script>
<script>
(function() {
// Our keywords. There are lots of ways you can produce
// this map, here I've just done it literally
var keywords = {
"laboris": true,
"laborum": true,
"pariatur": true
};

// Loop through all our paragraphs (okay, so we only have one)
$("p").each(function() {
var $this, text;

// We'll use jQuery on `this` more than once,
// so grab the wrapper
$this = $(this);

// Get the text of the paragraph
// Note that this strips off HTML tags, a
// real-world solution might need to loop
// through the text nodes rather than act
// on the full text all at once
text = $this.text();

// Do the replacements
// These character classes match JavaScript's
// definition of a "word" character and so are
// English-centric, obviously you'd change that
text = text.replace(
/([abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)([^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)?/g,
replacer);

// Update the paragraph
$this.html(text);
});

// Our replacer. We define it separately rather than
// inline because we use it more than once
function replacer(m, c0, c1) {
// Is the word in our keywords map?
if (keywords[c0]) {
// Yes, wrap it
c0 = '<a href="#">' + c0 + '</a>';
}
return c0 + c1;
}
})();
</script>
</body>
</html>

尝试用阿拉伯语来做这件事

我尝试了阿拉伯语版本。根据维基百科上的 Arabic script in Unicode page,使用了多个代码范围,但示例中的所有文本都属于 U+0600 到 U+06FF 的主要范围。

这是我想到的:Fiddle(我更喜欢上面使用的 JSBin,但我无法以正确的方式显示文本。)

(function() {
// Our keywords. There are lots of ways you can produce
// this map, here I've just done it literally
var keywords = {
"الهدف": true,
"طهران": true,
"سيما": true,
"حاليا": true
};

// Loop through all our paragraphs (okay, so we only have two)
$("p").each(function() {
var $this, text;

// We'll use jQuery on `this` more than once,
// so grab the wrapper
$this = $(this);

// Get the text of the paragraph
// Note that this strips off HTML tags, a
// real-world solution might need to loop
// through the text nodes rather than act
// on the full text all at once
text = $this.text();

// Do the replacements
// These character classes just use the primary
// Arabic range of U+0600 to U+06FF, you may
// need to add others.
text = text.replace(
/([\u0600-\u06ff]+)([^\u0600-\u06ff]+)?/g,
replacer);

// Update the paragraph
$this.html(text);
});

// Our replacer. We define it separately rather than
// inline because we use it more than once
function replacer(m, c0, c1) {
// Is the word in our keywords map?
if (keywords[c0]) {
// Yes, wrap it
c0 = '<a href="#">' + c0 + '</a>';
}
return c0 + c1;
}
})();

我对上面的英语功能所做的只是:

  • 使用 [\u0600-\u06ff] 表示“单词字符”,使用 [^\u0600-\u06ff] 表示“非单词字符”。您可能需要添加一些其他范围 listed here(例如适当的数字样式),但同样,示例中的所有文本都属于这些范围。
  • 将关键字更改为示例中的三个关键字(文本中似乎只有两个)。

对于我非常不懂阿拉伯语的眼睛来说,它似乎有效。

关于jquery - 文本匹配不适用于阿拉伯语问题可能是由于阿拉伯语的正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16664267/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com