python - 计算威尔士语文本中的字母-6ren

python - 计算威尔士语文本中的字母

转载作者：行者123 更新时间：2023-12-01 23:17:57

我如何计算 Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch 中的字母？

print(len('Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch'))

说 58
好吧，如果有那么容易，我就不会问你了，现在是吗？!
维基百科说 ( https://en.wikipedia.org/wiki/Llanfairpwllgwyngyll#Placename_and_toponymy )

The long form of the name is the longest place name in the UnitedKingdom and one of the longest in the world at 58 characters (51"letters" since "ch" and "ll" are digraphs, and are treated as singleletters in the Welsh language).

所以我想数一数并得到答案 51。
好吧多基。

print(len(['Ll','a','n','f','a','i','r','p','w','ll','g','w','y','n','g','y','ll','g','o','g','e','r','y','ch','w','y','r','n','d','r','o','b','w','ll','ll','a','n','t','y','s','i','l','i','o','g','o','g','o','g','o','ch']))
51

是的，但那是作弊，显然我想使用这个词作为输入，而不是列表。
维基百科还说威尔士的有向图是 ch、dd、ff、ng、ll、ph、rh、th
https://en.wikipedia.org/wiki/Welsh_orthography#Digraphs
所以我们走了。让我们把长度加起来，然后去掉重复计算。

word='Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch'
count=len(word)
print('starting with count of',count)
for index in range(len(word)-1):
  substring=word[index]+word[index+1]
  if substring.lower() in ['ch','dd','ff','ng','ll','ph','rh','th']:
    print('taking off double counting of',substring)
    count=count-1
print(count)

这让我走到这一步

starting with count of 58
taking off double counting of Ll
taking off double counting of ll
taking off double counting of ng
taking off double counting of ll
taking off double counting of ch
taking off double counting of ll
taking off double counting of ll
taking off double counting of ll
taking off double counting of ch
49

看来我当时减去的太多了。我应该得到 51。现在一个问题是 llll已找到 3 ll s 并取下三个而不是两个。所以这将需要修复。 (不得重叠。)
然后还有另一个问题。 ng .维基百科没有说名称中有字母“ng”，但它被列为我上面引用的页面上的有向图之一。
维基百科在这里为我们提供了更多线索: “可能需要其他信息来区分真正的有向图和并列的字母” .它给出了“ llongyfarch ”的例子，其中ng只是一个“字母并列”，而“ llong ”是一个有向图。
所以看起来'Llanfairpwllgwy 吴 yllgogerychwyrndrobwllllantysiliogogogoch' 是其中 -ng- 只是“字母并列”的单词之一。
显然，计算机无法知道这一点。所以我将不得不向它提供维基百科所说的“附加信息”。
所以无论如何，我决定查看在线词典 http://geiriadur.ac.uk/gpc/gpc.html如果您查找 就会看到llongyfarch (来自维基百科的例子有“字母并列”)它用 显示它n 和 g 之间的垂直线 但是如果您查找“lllong”，则它不会这样做。
screenshot from dictionary (llongyfarch)

screenshot from dictionary (llongyfarch)

所以我决定好的，我们需要做的是通过放置 | 来提供附加信息。在输入字符串中就像在字典中一样，只是为了让算法知道 ng bit真的是两个字母。但显然我不想要 |本身被算作一封信。
所以现在我有这些输入:

word='llong'
ANSWER NEEDS TO BE 3 (ll o ng)

word='llon|gyfarch'
ANSWER NEEDS TO BE 9 (ll o n g y f a r ch)

word='Llanfairpwllgwyn|gyllgogerychwyrndrobwllllantysiliogogogoch'
ANSWER NEEDS TO BE 51 (Ll a n f a i r p w ll g w y n g y ll g o g e r y ch w y r n d r o b w ll ll a n t y s i l i o g o g o g o ch)

还有这个有向图列表:

['ch','dd','ff','ng','ll','ph','rh','th']

规则将是:

忽略大小写

如果你看到一个有向图，那么把它算作 1

从左到右工作，以便 llll是 ll + ll ，不是 l + ll + l

如果您看到 |算不上，但也不能完全无视，到此为止了ng是一个有向图

我希望它把它算作 51 并且出于正确的原因去做，而不仅仅是侥幸。
现在我得到了 51，但它很侥幸，因为它正在计算 |作为一个字母(1 太高)，然后它与 llll 一起起飞太多了(1 太低) - 错误取消
越来越 llong对 (3)。
越来越 llon|gyfarch错误 (10) - 计算 |再次
我怎样才能以正确的方式修复它？

最佳答案

像许多与字符串有关的问题一样，这可以通过正则表达式以简单的方式完成。

>>> word = 'Llanfairpwllgwyn|gyllgogerychwyrndrobwllllantysiliogogogoch'
>>> import re
>>> pattern = re.compile(r'ch|dd|ff|ng|ll|ph|rh|th|[^\W\d_]', flags=re.IGNORECASE)
>>> len(pattern.findall(word))
51

字符类 [^\W\d_] (from here ) 匹配不是数字或下划线的单词字符，即字母，包括那些带有变音符号的字符。

关于python - 计算威尔士语文本中的字母，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/63528797/

文章推荐： emacs 编译命令在上级目录中查找 makefile

文章推荐： dynamic - invokedynamic 什么时候真正有用(除了惰性常量)？

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 计算威尔士语文本中的字母