gpt4 book ai didi

python - 我正在尝试标记 text/json 文件。不知道为什么,但只有第一条推文被标记化。代码如下

转载 作者:太空宇宙 更新时间:2023-11-03 16:46:22 25 4
gpt4 key购买 nike

from nltk.tokenize import word_tokenize
import json
import re

emoticons_str = r"""
(?:
[:=;] # Eyes
[oO\-]? # Nose (optional)
[D\)\]\(\]/\\OpP] # Mouth
)"""

regex_str = [
emoticons_str,
r'<[^>]+>', # HTML tags
r'(?:@[\w_]+)', # @-mentions
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0- 9a-f]))+', # URLs
r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
r'(?:[\w_]+)', # other words
r'(?:\S)' # anything else
]

tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)


def tokenize(s):
return tokens_re.findall(s)


def preprocess(s, lowercase=False):
tokens = tokenize(s)
if lowercase:
tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
return tokens


#tweet = "RT @marcobonzanini: just an example! :D http://example.com #NLP"
#print(preprocess(tweet))
# ['RT', '@marcobonzanini', ':', 'just', 'an', 'example', '!', ':D', 'http://example.com', '#NLP']

with open('../script/iphone.txt', 'r') as f:
for line in f:
tweet = json.loads(line)
tokens = preprocess(tweet['text'])
#do_something_else(tokens)
print(json.dumps(tokens, indent=4)

this is how it looks

最佳答案

我遇到了同样的问题,这是因为你的 json 文件有一个空行占用了 Json。尝试添加:

newline='\r\n'

所以读取 json 文件的代码如下所示:

with open('data/stream_sample.json', 'r', newline='\r\n') as f:
for line in f:
tweet = json.loads(line)
tokens = preprocess(tweet['text'])
print(tokens)

希望有帮助

关于python - 我正在尝试标记 text/json 文件。不知道为什么,但只有第一条推文被标记化。代码如下,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36257964/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com