gpt4 book ai didi

python - NLTK-Python : How to format a raw text

转载 作者:太空宇宙 更新时间:2023-11-04 04:22:44 25 4
gpt4 key购买 nike

您知道我是否可以使用 NLTK(或任何其他 NLP)和 Python 格式化原始文本(没有标点符号,也没有大写字母,也没有段落之间的换行符)?

我已经浏览了文档,但找不到任何可以帮助我完成此任务的内容。

例子:

输入:

python is an interpreted high-level general-purpose programming language created by guido van rossum and first released in 1991 python has a design philosophy that emphasizes code readability notably using significant whitespace it provides constructs that enable clear programming on both small and large scales in July 2018, van rossum stepped down as the leader in the language community

输出:

Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales. In July 2018, Van Rossum stepped down as the leader in the language community.

谢谢,

最佳答案

有趣的问题。至于边界的插入,您可以训练 NLTK 的分词器(或句子拆分器)(如果您用谷歌搜索,有很多关于它的文档)。您可以尝试的一件事是获取一些句子拆分的文本,删除标点符号,然后进行训练并查看您得到的结果。类似下面(下)的东西。如前所述,该算法可能在很大程度上依赖于标点符号,并且在任何情况下下面的代码都不适用于您的例句,但也许如果您使用其他/更大/不同的域训练文本,则可能值得尝试.不完全确定这是否也适用于插入逗号和其他(非句末/首字母)标点符号。

from nltk.corpus import gutenberg
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
import re

text = ""
for file_id in gutenberg.fileids():
text += gutenberg.raw(file_id)
# remove punctuation
text = re.sub('[\.\?!]\n', '\n', text) # you will probably want to include some other potential sentence final punctuation here
trainer = PunktTrainer()
trainer.INCLUDE_ALL_COLLOCS = True
trainer.train(text)
tokenizer = PunktSentenceTokenizer(trainer.get_params())
sentences = "python is an interpreted high-level general-purpose programming language created by guido van rossum and first released in 1991 python has a design philosophy that emphasizes code readability notably using significant whitespace it provides constructs that enable clear programming on both small and large scales in July 2018, van rossum stepped down as the leader in the language community"
print(tokenizer.tokenize(sentences))

关于python - NLTK-Python : How to format a raw text,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54139341/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com