gpt4 book ai didi

python - 从大文件中提取特定行

转载 作者:塔克拉玛干 更新时间:2023-11-03 06:14:44 25 4
gpt4 key购买 nike

我有一个大文件(5,000,000 行),格式为:

'User ID,Mov ID,Rating,Timestamp'

我有另一个文件(200,000 行),编号更少。形式的记录:

'User ID, Mov ID'

我必须生成一个新文件,如果第二个文件中的 (User ID, Mov ID) 与第一个文件的 5,000,000 行中的任何记录匹配,我就不应该将其包含在我的新文件中。换句话说,新文件由唯一的用户 ID、Mov ID 组成,因为它与文件 2(200,000 行)没有任何共同点(用户 ID、Mov ID)

我正在尝试这种幼稚的方法,但它花费了太多时间。是否有更快的算法来实现?:

from sys import argv
import re
script, filename1, filename2 = argv
#open files
testing_small= open(filename1)
ratings=open(filename2)
##Open file to write thedata
ratings_training=open("ratings_training.csv",'w')

for line_rating in ratings:
flag=0;testing_small.seek(0)
for line_test in testing_small:
matched_line=re.match(line_test.rstrip(),line_rating)
if matched_line:
flag=1;break
if(flag==0):
ratings_training.write(line_rating)


testing_small.close()
ratings.close()
ratings_training.close()

我也可以使用任何基于 spark 的方法

最佳答案

例如:

# df1:
User_ID,Mov_ID,Rating,Timestamp
sam,apple,0.6,2017-03-17 09:04:39
sam,banana,0.7,2017-03-17 09:04:39
tom,apple,0.3,2017-03-17 09:04:39
tom,pear,0.9,2017-03-17 09:04:39

# df2:
User_ID,Mov_ID
sam,apple
sam,pear
tom,apple

在 Pandas 中:

import pandas as pd

df1 = pd.read_csv('./disk_file')
df2 = pd.read_csv('./tmp_file')
res = pd.merge(df1, df2, on=['User_ID', 'Mov_ID'], how='left', indicator=True)
res = res[res['_merge'] == 'left_only']
print(res)

或者在 Spark 中:

cfg = SparkConf().setAppName('MyApp')
spark = SparkSession.builder.config(conf=cfg).getOrCreate()

df1 = spark.read.load(path='file:///home/zht/PycharmProjects/test/disk_file', format='csv', sep=',', header=True)
df2 = spark.read.load(path='file:///home/zht/PycharmProjects/test/tmp_file', format='csv', sep=',', header=True)
res = df1.join(df2, on=[df1['User_ID'] == df2['User_ID'], df1['Mov_ID'] == df2['Mov_ID']], how='left_outer')
res = res.filter(df2['User_ID'].isNotNull())
res.show()

关于python - 从大文件中提取特定行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42845292/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com