gpt4 book ai didi

parsing - 是否有所有国际句号标点符号的字符集?

转载 作者:行者123 更新时间:2023-12-04 10:42:20 26 4
gpt4 key购买 nike

我正在尝试将 utf-8 字符串解析为“一口大小”的段。例如,我想将文本分解为“句子”。

是否有与所有语言的句子结尾相对应的字符(或正则表达式)的综合集合?我正在寻找能够捕捉拉丁文句号、感叹号和问号、中文和日文句号等的东西。

类似上面的东西,但相当于逗号也很棒。

最佳答案

您需要使用 \p{Sentence_Break=STerm} 查看代码点或 \p{Sentence_Break=ATerm}也有 \p{Terminal_Punctuation} 的属性属性(property)。运行 the unichars script针对 Unicode v6.1,我们了解到这些代码点符合所有这些标准:

$ unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}'
U+00021 ‭ ! GC=Po SC=Common EXCLAMATION MARK
U+0002E ‭ . GC=Po SC=Common FULL STOP
U+0003F ‭ ? GC=Po SC=Common QUESTION MARK
U+00589 ‭ ։ GC=Po SC=Common ARMENIAN FULL STOP
U+0061F ‭ ؟ GC=Po SC=Common ARABIC QUESTION MARK
U+006D4 ‭ ۔ GC=Po SC=Arabic ARABIC FULL STOP
U+00700 ‭ ܀ GC=Po SC=Syriac SYRIAC END OF PARAGRAPH
U+00701 ‭ ܁ GC=Po SC=Syriac SYRIAC SUPRALINEAR FULL STOP
U+00702 ‭ ܂ GC=Po SC=Syriac SYRIAC SUBLINEAR FULL STOP
U+007F9 ‭ ߹ GC=Po SC=Nko NKO EXCLAMATION MARK
U+00964 ‭ । GC=Po SC=Common DEVANAGARI DANDA
U+00965 ‭ ॥ GC=Po SC=Common DEVANAGARI DOUBLE DANDA
U+0104A ‭ ၊ GC=Po SC=Myanmar MYANMAR SIGN LITTLE SECTION
U+0104B ‭ ။ GC=Po SC=Myanmar MYANMAR SIGN SECTION
U+01362 ‭ ። GC=Po SC=Ethiopic ETHIOPIC FULL STOP
U+01367 ‭ ፧ GC=Po SC=Ethiopic ETHIOPIC QUESTION MARK
U+01368 ‭ ፨ GC=Po SC=Ethiopic ETHIOPIC PARAGRAPH SEPARATOR
U+0166E ‭ ᙮ GC=Po SC=Canadian_Aboriginal CANADIAN SYLLABICS FULL STOP
U+01803 ‭ ᠃ GC=Po SC=Common MONGOLIAN FULL STOP
U+01809 ‭ ᠉ GC=Po SC=Mongolian MONGOLIAN MANCHU FULL STOP
U+01944 ‭ ᥄ GC=Po SC=Limbu LIMBU EXCLAMATION MARK
U+01945 ‭ ᥅ GC=Po SC=Limbu LIMBU QUESTION MARK
U+01AA8 ‭ ᪨ GC=Po SC=Tai_Tham TAI THAM SIGN KAAN
U+01AA9 ‭ ᪩ GC=Po SC=Tai_Tham TAI THAM SIGN KAANKUU
U+01AAA ‭ ᪪ GC=Po SC=Tai_Tham TAI THAM SIGN SATKAAN
U+01AAB ‭ ᪫ GC=Po SC=Tai_Tham TAI THAM SIGN SATKAANKUU
U+01B5A ‭ ᭚ GC=Po SC=Balinese BALINESE PANTI
U+01B5B ‭ ᭛ GC=Po SC=Balinese BALINESE PAMADA
U+01B5E ‭ ᭞ GC=Po SC=Balinese BALINESE CARIK SIKI
U+01B5F ‭ ᭟ GC=Po SC=Balinese BALINESE CARIK PAREREN
U+01C3B ‭ ᰻ GC=Po SC=Lepcha LEPCHA PUNCTUATION TA-ROL
U+01C3C ‭ ᰼ GC=Po SC=Lepcha LEPCHA PUNCTUATION NYET THYOOM TA-ROL
U+01C7E ‭ ᱾ GC=Po SC=Ol_Chiki OL CHIKI PUNCTUATION MUCAAD
U+01C7F ‭ ᱿ GC=Po SC=Ol_Chiki OL CHIKI PUNCTUATION DOUBLE MUCAAD
U+0203C ‭ ‼ GC=Po SC=Common DOUBLE EXCLAMATION MARK
U+0203D ‭ ‽ GC=Po SC=Common INTERROBANG
U+02047 ‭ ⁇ GC=Po SC=Common DOUBLE QUESTION MARK
U+02048 ‭ ⁈ GC=Po SC=Common QUESTION EXCLAMATION MARK
U+02049 ‭ ⁉ GC=Po SC=Common EXCLAMATION QUESTION MARK
U+02E2E ‭ ⸮ GC=Po SC=Common REVERSED QUESTION MARK
U+03002 ‭ 。 GC=Po SC=Common IDEOGRAPHIC FULL STOP
U+0A4FF ‭ ꓿ GC=Po SC=Lisu LISU PUNCTUATION FULL STOP
U+0A60E ‭ ꘎ GC=Po SC=Vai VAI FULL STOP
U+0A60F ‭ ꘏ GC=Po SC=Vai VAI QUESTION MARK
U+0A6F3 ‭ ꛳ GC=Po SC=Bamum BAMUM FULL STOP
U+0A6F7 ‭ ꛷ GC=Po SC=Bamum BAMUM QUESTION MARK
U+0A876 ‭ ꡶ GC=Po SC=Phags_Pa PHAGS-PA MARK SHAD
U+0A877 ‭ ꡷ GC=Po SC=Phags_Pa PHAGS-PA MARK DOUBLE SHAD
U+0A8CE ‭ ꣎ GC=Po SC=Saurashtra SAURASHTRA DANDA
U+0A8CF ‭ ꣏ GC=Po SC=Saurashtra SAURASHTRA DOUBLE DANDA
U+0A92F ‭ ꤯ GC=Po SC=Kayah_Li KAYAH LI SIGN SHYA
U+0A9C8 ‭ ꧈ GC=Po SC=Javanese JAVANESE PADA LINGSA
U+0A9C9 ‭ ꧉ GC=Po SC=Javanese JAVANESE PADA LUNGSI
U+0AA5D ‭ ꩝ GC=Po SC=Cham CHAM PUNCTUATION DANDA
U+0AA5E ‭ ꩞ GC=Po SC=Cham CHAM PUNCTUATION DOUBLE DANDA
U+0AA5F ‭ ꩟ GC=Po SC=Cham CHAM PUNCTUATION TRIPLE DANDA
U+0AAF0 ‭ ꫰ GC=Po SC=Meetei_Mayek MEETEI MAYEK CHEIKHAN
U+0AAF1 ‭ ꫱ GC=Po SC=Meetei_Mayek MEETEI MAYEK AHANG KHUDAM
U+0ABEB ‭ ꯫ GC=Po SC=Meetei_Mayek MEETEI MAYEK CHEIKHEI
U+0FE52 ‭ ﹒ GC=Po SC=Common SMALL FULL STOP
U+0FE56 ‭ ﹖ GC=Po SC=Common SMALL QUESTION MARK
U+0FE57 ‭ ﹗ GC=Po SC=Common SMALL EXCLAMATION MARK
U+0FF01 ‭ ! GC=Po SC=Common FULLWIDTH EXCLAMATION MARK
U+0FF0E ‭ . GC=Po SC=Common FULLWIDTH FULL STOP
U+0FF1F ‭ ? GC=Po SC=Common FULLWIDTH QUESTION MARK
U+0FF61 ‭ 。 GC=Po SC=Common HALFWIDTH IDEOGRAPHIC FULL STOP
U+11047 ‭ 𑁇 GC=Po SC=Brahmi BRAHMI DANDA
U+11048 ‭ 𑁈 GC=Po SC=Brahmi BRAHMI DOUBLE DANDA
U+110BE ‭ 𑂾 GC=Po SC=Kaithi KAITHI SECTION MARK
U+110BF ‭ 𑂿 GC=Po SC=Kaithi KAITHI DOUBLE SECTION MARK
U+110C0 ‭ 𑃀 GC=Po SC=Kaithi KAITHI DANDA
U+110C1 ‭ 𑃁 GC=Po SC=Kaithi KAITHI DOUBLE DANDA
U+11141 ‭ 𑅁 GC=Po SC=Chakma CHAKMA DANDA
U+11142 ‭ 𑅂 GC=Po SC=Chakma CHAKMA DOUBLE DANDA
U+11143 ‭ 𑅃 GC=Po SC=Chakma CHAKMA QUESTION MARK
U+111C5 ‭ 𑇅 GC=Po SC=Sharada SHARADA DANDA
U+111C6 ‭ 𑇆 GC=Po SC=Sharada SHARADA DOUBLE DANDA

反过来——也就是说,也可以查找给定代码点的属性,而不是查找给定一组属性的代码点——使用 the companion uniprops script ,它提取给定代码点的所有属性:
$ uniprops -a . \? \!
U+002E ‹.› \N{FULL STOP}
\pP \p{Po}
All Any ASCII Assigned Basic_Latin Case_Ignorable CI Common Zyyy Po P Gr_Base Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn
Pattern_Syntax PatSyn POSIX_Graph POSIX_Print POSIX_Punct Print Punctuation STerm Term Terminal_Punctuation X_POSIX_Graph X_POSIX_Print
X_POSIX_Punct
Age=1.1 Block=Basic_Latin Bidi_Class=Common_Separator BC=CS Bidi_Class=CS Block=ASCII BLK=ASCII Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Na
East_Asian_Width=Narrow EA=Na Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U
Line_Break=Infix_Numeric LB=IS Line_Break=IS Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0
Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0
IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=AT Sentence_Break=ATerm SB=AT
Word_Break=MB Word_Break=MidNumLet WB=MB _Case_Ignorable _X_Begin
U+003F ‹?› \N{QUESTION MARK}
\pP \p{Po}
All Any ASCII Assigned Basic_Latin Common Zyyy Po P Gr_Base Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn Pattern_Syntax PatSyn
POSIX_Graph POSIX_Print POSIX_Punct Print Punctuation STerm Term Terminal_Punctuation X_POSIX_Graph X_POSIX_Print X_POSIX_Punct
Age=1.1 Block=Basic_Latin Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=ASCII BLK=ASCII Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Na
East_Asian_Width=Narrow EA=Na Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U
Line_Break=EX Line_Break=Exclamation LB=EX Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0
Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0
IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=ST Sentence_Break=STerm SB=ST
Word_Break=Other WB=XX Word_Break=XX _X_Begin
U+0021 ‹!› \N{EXCLAMATION MARK}
\pP \p{Po}
All Any ASCII Assigned Basic_Latin Common Zyyy Po P Gr_Base Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn Pattern_Syntax PatSyn
POSIX_Graph POSIX_Print POSIX_Punct Print Punctuation STerm Term Terminal_Punctuation X_POSIX_Graph X_POSIX_Print X_POSIX_Punct
Age=1.1 Block=Basic_Latin Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=ASCII BLK=ASCII Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Na
East_Asian_Width=Narrow EA=Na Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U
Line_Break=EX Line_Break=Exclamation LB=EX Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0
Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0
IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=ST Sentence_Break=STerm SB=ST
Word_Break=Other WB=XX Word_Break=XX _X_Begin

我怀疑您应该更多地检查整个句子中断属性。

还有 a 3rd script in the suite, uninames ,它做这样的事情:
$ uninames sentence
; 037E GREEK QUESTION MARK
= erotimatiko
* sentence-final punctuation
* 003B is the preferred character
x (question mark - 003F)
: 003B semicolon
⁚ 205A TWO DOT PUNCTUATION
* historically used to indicate the end of a sentence or change of speaker
* extends from baseline to cap height
x (presentation form for vertical two dot leader - FE30)
x (greek acrophonic epidaurean two - 1015B)
𑂾 110BE KAITHI SECTION MARK
* marks end of sentence

我发现这三个程序对于探索 Unicode 属性是必不可少的。您可以使用 the CPAN Unicode::Tussle suite 安装它们。 ,或单独检查 here .

关于parsing - 是否有所有国际句号标点符号的字符集?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9506869/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com