Well, both tokenizers almost work the same way, to split a given sentence into words. But you can think of TweetTokenizer
as a subset of word_tokenize
. TweetTokenizer
keeps hashtags intact while word_tokenize
doesn't.
嗯,这两个标记器的工作方式几乎相同,都是将给定的句子拆分成单词。但您可以将TweetTokenizer视为WORD_TOKENIZE的子集。TweetTokenizer保持标签不变,而word_tokenize不保持不变。
I hope the below example will clear all your doubts...
我希望下面的例子能消除你所有的疑虑。
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import word_tokenize
tt = TweetTokenizer()
tweet = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <-- @remy: This is waaaaayyyy too much for you!!!!!!"
print(tt.tokenize(tweet))
print(word_tokenize(tweet))
# output
# ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--', '@remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!']
# ['This', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrows', '<', '>', '-', '>', '<', '--', '@', 'remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!', '!', '!', '!']
You can see that word_tokenize
has split #dummysmiley
as '#'
and 'dummysmiley'
, while TweetTokenizer didn't, as '#dummysmiley'
. TweetTokenizer
is built mainly for analyzing tweets.
You can learn more about tokenizer from this link
您可以看到word_tokenize将#ummysmiley拆分为‘#’和‘ummysmiley’,而TweetTokenizer没有拆分为‘#ummysmiley’。TweetTokenizer主要是为分析推文而构建的。您可以通过此链接了解有关标记器的更多信息
It also seems to deal differently with abbreviated negations ("isn't" for example):
它对缩写否定的处理似乎也不同(例如,“is‘t”):
from nltk.tokenize import (TweetTokenizer,
wordpunct_tokenize,)
text = "The quick brown fox isn't jumping over the lazy dog, co-founder
multi-word expression. #yes!"
standard_nltk = word_tokenize(text)
print(standard_nltk)
# output: ['The', 'quick', 'brown', 'fox', 'is', "n't", 'jumping', 'over',
# 'the', 'lazy', 'dog', ',', 'co-founder', 'multi-word', 'expression', '.',
# '#', 'yes', '!']
twitter_nltk = tweet_tokenizer.tokenize(text)
print(twitter_nltk)
# output: ['The', 'quick', 'brown', 'fox', "isn't", 'jumping', 'over',
# 'the', 'lazy', 'dog', ',', 'co-founder', 'multi-word', 'expression', '.',
# '#yes', '!']
更多回答
In addition to this answer, aonther great tutorial on TweetTokenizer
can also be found here and focuses on problems with tokenizing social media data.
除了这个答案,关于TweetTokenizer的另一个很棒的教程也可以在这里找到,它专注于对社交媒体数据进行标记化的问题。
我是一名优秀的程序员,十分优秀!