gpt4 book ai didi

python - 清理 Twitter 数据 pandas python

转载 作者:行者123 更新时间:2023-12-05 09:36:56 25 4
gpt4 key购买 nike

尝试将推特数据清理为 Pandas 数据框。我似乎错过了一步。在我处理完所有的推文之后,我想我错过了用新的推文覆盖旧的推文吗?当我保存文件时,推文没有任何变化。我错过了什么?

import pandas as pd
import re
import emoji
import nltk
nltk.download('words')
words = set(nltk.corpus.words.words())

trump_df = pd.read_csv('new_Trump.csv')
for tweet in trump_df['tweet']:
tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) #Remove http links
tweet = " ".join(tweet.split())
tweet = ''.join(c for c in tweet if c not in emoji.UNICODE_EMOJI) #Remove Emojis
tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text
tweet = " ".join(w for w in nltk.wordpunct_tokenize(tweet) \
if w.lower() in words or not w.isalpha()) #Remove non-english tweets (not 100% success)
print(tweet)
trump_df.to_csv('new_Trump.csv')

最佳答案

正如您所说的那样,您永远不会将数据存储回去,让我们创建一个完成所有工作的函数,然后使用 map 将其传递给数据框。它比遍历数据框中的每个值并将其存储到列表中(选项 B)更有效。

def cleaner(tweet):
tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) #Remove http links
tweet = " ".join(tweet.split())
tweet = ''.join(c for c in tweet if c not in emoji.UNICODE_EMOJI) #Remove Emojis
tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text
tweet = " ".join(w for w in nltk.wordpunct_tokenize(tweet) \
if w.lower() in words or not w.isalpha())
return tweet
trump_df['tweet'] = trump_df['tweet'].map(lambda x: cleaner(x))
trump_df.to_csv('') #specify location

这将用修改覆盖 tweet 列。

选项 B:

如前所述,我认为这将被证明效率较低,但它就像在 for 循环之前创建一个列表一样简单,并用每条干净的推文填充它。

clean_tweets = []
for tweet in trump_df['tweet']:
tweet = re.sub("@[A-Za-z0-9]+","",tweet) #Remove @ sign
##Here's where all the cleaning takes place
clean_tweets.append(tweet)
trump_df['tweet'] = clean_tweets
trump_df.to_csv('') #Specify location

关于python - 清理 Twitter 数据 pandas python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64719706/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com