gpt4 book ai didi

python - 使用 CSV : reading and writing data in the right order

转载 作者:太空宇宙 更新时间:2023-11-03 16:46:14 25 4
gpt4 key购买 nike

我有两个与 Twitter 数据相关的 .csv 文件。一个包含推文文本,另一个包含这些推文的 ID。带有 ID 的文件是另一个文件中推文的样本样本。我正在尝试编写一个脚本来读取文本,在其他文件中搜索相应的 ID,然后编写一个新的 .csv 文件,其中包含推文的 ID 和文本较小的样本。

这是我到目前为止所拥有的:

import csv

# creates empty dictionary in which to store tweetIDs and tweet text
originals_data = {}

# declares an empty list to hold tweet text from coded datafile
# will be used to compare against the dictionary created earlier
coded_data = []
coded_all = [] # for all, not just text

# list to hold the IDs belonging to coded tweets for the round
tweet_IDs_for_coded = []

with open('first20.csv', 'rt') as round_in, open('gg_originals.csv', 'rt') as original_in:

# reader object for gg_originals
readOrigin = csv.reader(original_in, delimiter=',')
# adds values from .csv file into the dictionary
for row in readOrigin:
originals_data[row[0]] = row[1]

# reader object for round_x data
readRound = csv.reader(round_in, delimiter=",")
# appends the tweet text to a list
for row in readRound:
coded_data.append(row[0])

# iterates over id:text dictionary
for tweet_id in originals_data:
# iterates over coded_data
for tweet in coded_data:
# When tweet in list matches text in dict, sends key to list
if tweet == originals_data[tweet_id]:
tweet_IDs_for_coded.append(tweet_id)

with open('first20.csv', 'rt') as round_in, open('test2.csv', 'wt') as output:
# reader object for round_x data
readRound = csv.reader(round_in, delimiter=",")
# creates writer object to write new csv file with IDs
writeNew = csv.writer(output, delimiter=",")
# list that holds everything that's going into the csv file
everything = []
# sets row to equal a single row from round data
row = next(readRound)
row.insert(0, 'ID')
# appends ID and then all existing data to list of rows
everything.append(row)
for i, row in enumerate(readRound):
everything.append([str(tweet_IDs_for_coded[i])] + row)
writeNew.writerows(everything)

总体文件 (gg_originals.csv) 的数据如下所示:

tweet_id_str,text
534974890168700930,abcd
534267820071084033,abce
539572102441877504,abcf
539973576108294145,abcg
529278820876943361,abch
529583601244176384,abci
535172191743397888,abcj
532195210059874304,abck
537812033895669760,abcl
,
,

纯文本文件是总体的子集,如下所示:

text
abcl
abci
abcd

到目前为止,我所运行的内容似乎获得了正确的 ID,甚至将它们写入新的 .csv 文件中的新列。但是,新文件中的 ID 不在正确的行中 - 它们显示在实际上并不对应的文本行中,这很糟糕!

新文件应该如下所示:

ID,text
537812033895669760,abcl
529583601244176384,abci
534974890168700930,abcd

相反,它最终会像这样:

ID,text
529583601244176384,abcl
537812033895669760,abci
534974890168700930,abcd

已找到正确的 ID,但它们已写入错误的行。

最佳答案

好的,这段代码(我认为)实现了您想要做的事情。我询问您的操作系统的原因是 wt 在 Windows 中会提供双倍行距的 csv,所以我不得不使用 wb。此外,在单元格 A1 中插入大写“ID”会导致使用 Excel 打开时出现类型问题。都很有趣:)

我最终没有时间来跟踪你的错误并仍然给出答案,所以我已经写下了答案,如果有机会我会回去并突出显示你的工作不同步的地方(我会以前从未在 Excel 中遇到过 SYLK 错误,所以心烦意乱!)。

我交换了你的字典。这条推文本身就成为了该词典的关键。不再需要遍历字典。这也意味着您只需打开 first20.csv 一次。你原来的方法有点复杂。

import csv

with open('gg_originals.csv', 'rt') as original_in:
readOrigin = csv.reader(original_in, delimiter = ',')
originals_data = {row[1]: row[0] for row in readOrigin}

with open('first20.csv', 'rt') as round_in:
input_data = csv.reader(round_in)
data_to_match = [row[0] for row in input_data]

compiled_list = []
for item in data_to_match:
compiled_list.append([item, originals_data[item]])

with open('testoutput.csv', 'wt') as outfile:
writer = csv.writer(outfile)
writer.writerows(compiled_list)

关于python - 使用 CSV : reading and writing data in the right order,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36267997/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com