python - NLTK PunktSentenceTokenizer 省略号拆分-6ren

python - NLTK PunktSentenceTokenizer 省略号拆分

转载作者：行者123 更新时间：2023-11-28 19:18:44

25

4

我正在使用 NLTK PunktSentenceTokenizer我正面临这样一种情况，其中包含由 ellipsis character (...) 分隔的多个句子的文本.这是我正在处理的示例:

>>> from nltk.tokenize import PunktSentenceTokenizer
>>> pst = PunktSentenceTokenizer()
>>> pst.sentences_from_text("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")
['Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...']

如您所见，句子没有分开。有没有办法让它像我预期的那样工作(即返回包含四个项目的列表)？

附加信息:我尝试使用 debug_decisions 函数来尝试理解为什么做出这样的决定。我得到以下结果:

>>> g = pst.debug_decisions("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")

>>> [x for x in g]
[{'break_decision': None,
  'collocation': False,
  'period_index': 27,
  'reason': 'default decision',
  'text': 'service... Cashier',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'cashier',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'},
 {'break_decision': None,
  'collocation': False,
  'period_index': 47,
  'reason': 'default decision',
  'text': 'rude... Drive',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'drive',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'},
 {'break_decision': None,
  'collocation': False,
  'period_index': 72,
  'reason': 'default decision',
  'text': 'hours... The',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'the',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'}]

不幸的是，我无法理解这些字典的含义，尽管分词器似乎确实检测到了省略号，但出于某种原因决定不用这些符号拆分句子。任何想法？

谢谢!

最佳答案

你为什么不直接使用 the split function? str.split('...')

编辑:我通过使用路透社语料库训练函数来实现它，我想你可以使用你的训练它:

from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import reuters
pst = PunktSentenceTokenizer()
pst.train(reuters.raw())
text = "Batts did not take questions or give details of the report's findings... He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office. Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April."
print(pst.sentences_from_text(text))

结果:

>>> ["Batts did not take questions or give details of the report's findings...", "He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office.", 'Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April.']

关于python - NLTK PunktSentenceTokenizer 省略号拆分，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29970846/

25

4

0

文章推荐： objective-c - 仅在设备上出现 UINavigationBar 外观错误

文章推荐： html - 我的照片没有调整到正确形式的 div

文章推荐： javascript - Chrome扩展程序弹出安装

JavaScript 拆分
假设我有这个变量 var image = "image.jpg"; 我正在尝试拆分变量图像的内容并将 _thumbs 插入其中以获得类似 image_thumbs.jpg 的内容。我该如何解决这个问
excel - 拆分，转义某些拆分
我有一个包含多个问题和答案的单元格，其组织方式类似于 CSV。因此，为了将所有这些问题和答案分开，使用逗号作为分隔符的简单拆分应该很容易分开。不幸的是，有些值使用逗号作为小数分隔符。有没有办法避免这
d - 拆分/拆分的编译问题
这是简单的代码: import std.algorithm; import std.array; import std.file; void main(string[] args) { aut
java useDelimeter 拆分 -
我正在尝试解析一个看起来像的 txt 文件 A - 19 B - 2 C - 3 我正在使用扫描仪方法读取它并在“- ”中拆分，以便我的输出看起来像: A 19 B 2 C 3 但是它似乎没有正确拆分
qt - QString 拆分
我有这些网址字符串 file:///home/we/Pictures/neededWord/3193_n.jpg file:///home/smes/Pictures/neededWord/jds_2
没有最终修剪的 Groovy 拆分
我正在解析一个 CVS 文件，如下所示: "07555555555",25.70,18/11/2010,01/03/2011,N,133,0,36,,896,537,547,,Mr,John,Doe,
管道后的 PowerShell 拆分
我在脚本中使用以下行返回 $folder 处所有文件夹的所有路径地点。 dir -recurse $folder|?{$_.PSIsContainer}|select -ExpandProperty
Javascript 拆分、替换表现奇怪
我正在尝试将字符串格式化为word+word+word 例如 “超音乐节”变成“超+音乐+节日” 我尝试过使用以下代码 query.split(" ").join("+"); 或 query.repl
Perl系统+拆分+数组
我叫 luis，住在 arg。我有一个问题，无法解决。 **IN BASH** pwd /home/labs-perl ls file1.pl file2.pl **IN PERL** my $ls
java - 拆分 JsonArray
我想从包 javax.json 中拆分 JsonArray，但我找不到完成这项工作的便捷方法。我查看了文档，只能想到迭代 JsonArray 并使用 JsonArrayBuilder 手动添加项目。
Java 正则表达式/拆分
我希望在第一个 ':' 处拆分字符串，以防止字符串的第二部分包含 ':' 时出现问题。我一直在研究正则表达式，但仍然遇到一些问题，有人可以帮我吗？谢谢。最佳答案您可以使用overload of s
python - 拆分 RDD
我想拆分列表的列表 ((A,1,2,3),(B,4,5,6),(C,7,8,9))进入: (A,1) (A,2) (A,3) (B,4) (B,5) ... 我试过rdd.flatMapValues(
Javascript 数组 - 拆分
我有一个文本文件，其中每一行都有数据。它看起来像这样: number0;text0 number1;text1 number2;text2 ..等等所以我通过 xmlhttprequest 将该文本
C#数组题(拆分)
问题很简单——比如说，我得到了函数，它接收数组作为参数 void calc(double[] data) 如何将这些数据“拆分”成两个子数组并像这样传递给子函数 calc_sub(data(0, le
Java 拆分(字符串操作)
我想显示来自 EMAIL_TEXT 数据库列的数据，在定义的字符处拆分列。出于某种原因，我的结果只打印第一行到我拆分字符串的位置，跳过其余行。这是我希望在每个“|”之后拆分的数据。这里是要拆分的数据
JavaScript - 拆分，选择给定数字后的所有内容
我有一个动态数组，我想排除字符串的第一部分，但我不知道第一部分之后会有多少对象，我想将它们全部包含在一个新字符串中。 string = "text.'''hi''','''who''' '''are'
Javascript 拆分 URL
我想拆分 URL 的某些特定部分，这是我目前所做的。 var query = window.location.pathname.split( '/' ); query = window.locati
java - 拆分、丰富和组合
我有一条消息携带 XML(订单)，其中包含多个同质节点(比如产品列表)以及其他信息(比如地址、客户详细信息等)。我必须使用另一个外部服务提供的详细信息来丰富每个“产品”，并返回带有丰富“产品”的相同完
JavaScript 拆分，更改零件编号
我有一个动态生成的大字符串，我正在拆分它。 var myString="val1, val, val3, val4..... val400" 我对此字符串进行了简单的拆分， myString= myS
java - 拆分 - 如何在结果中获取尾随的空字符串
这个问题在这里已经有了答案: Java String split removed empty values (5 个答案) 关闭 7 年前。我正在尝试使用 split(";") 将字符串转换为数组

首页

博学

6Ren·AI

商城

python - NLTK PunktSentenceTokenizer 省略号拆分