gpt4 book ai didi

python - 如何使用总和和最大日期进行映射/归约?

转载 作者:行者123 更新时间:2023-12-02 21:41:59 26 4
gpt4 key购买 nike

我有一个需要映射/缩减的文件,其中输出需要总和和日期的最大值。我有总和部分的工作,但是,我不确定如何将最大日期作为减少的输出的一部分。

输入数据如下所示:

ID1,  ID2, date,                count
3000, 001, 2014-12-30 18:00:00, 2
3000, 001, 2015-01-01 10:00:00, 1
3000, 002, 2014-11-18 12:53:00, 5
3000, 002, 2014-12-20 20:14:00, 3

我的映射器将ID1 + ID2连接在一起,以便对它们进行分组。其输出如下所示:
key (ID1|ID2), value (count)
3000|001, 2
3000|001, 1
3000|002, 5
3000|002, 3

reducer 的输出如下所示:
key (ID1|ID2), value (sum)
3000|001, 3
3000|002, 8

我真正需要的是这样的输出:
key (ID1|ID2), value (sum), date (max)
3000|001, 3, 2015-01-01 10:00:00
3000|002, 8, 2014-12-20 20:14:00

映射器和化简器是用Ruby编写的,但是,我将以Python编写一个工作示例(将其翻译为Ruby)。

这是映射器代码:
require 'csv'

pattern = File.join(File.expand_path('data', File.dirname(__FILE__)), '*.txt')

Dir.glob(pattern).each do |file|
CSV.foreach(file, {col_sep: "\t", headers: false}) do |row|
puts [
"#{row[6]}|#{row[3].rjust(8, '0')}", # key = ID1 | ID2
row[7] # value = count
].join("\t")
end
end

reducer :
prev_key  = nil
key_total = 0

ARGF.each do |line|
line = line.chomp
next unless line

(key, value) = line.split("\t")

# check for new key
if prev_key && key != prev_key && key_total > 0

# output total for previous key
puts [prev_key, key_total].join("\t")

# reset key total for new key
prev_key = key
key_total = 0

elsif !prev_key
prev_key = key

end

# add to count for this current key
key_total += value.to_i

end

# this is to catch the final counts after all records have been received
puts [prev_key, key_total].join("\t")

更新

这是基于已接受答案的建议的新映射器和缩减器:

映射器:
require 'csv'

pattern = File.join(File.expand_path('data', File.dirname(__FILE__)), '*.txt')

Dir.glob(pattern).each do |file|
CSV.foreach(file, {col_sep: "\t", headers: false}) do |row|
date_time = "#{row[0]} #{row[1]}:00:00#{row[2]}" # %Y-%m-%d %H:%M:%S%z
puts [
"#{row[6]}|#{row[3].rjust(8, '0')}", # key = ID1 | ID2
"#{row[7]}|#{date_time}", # value = count | date_time
].join("\t")
end
end

reducer :
require 'date'

prev_key = nil
key_total = 0
dates = []

ARGF.each do |line|
line = line.chomp
next unless line

(key, values) = line.split("\t")
(value, date_time) = values.split('|')

# check for new key
if prev_key && key != prev_key && key_total > 0

# output total for previous key
puts [prev_key.split('|'), key_total, dates.max].join("\t")

# reset key total for new key
prev_key = key
key_total = 0

# reset dates array for new key
dates.clear

elsif !prev_key
prev_key = key

end

# add date to array for this current key
dates << DateTime.strptime(date_time, '%Y-%m-%d %H:%M:%S%z')

# add to count for this current key
key_total += value.to_i

end

# this is to catch the final counts after all records have been received
puts [prev_key.split('|'), key_total, dates.max].join("\t")

最佳答案

您只需要将日期和计数成对 并从映射器中将其作为值发出即可。然后在 reducer 中提取日期并从对值中计数。总和与以前一样计算,并跨输入值(每个键)跟踪最大日期。

关于python - 如何使用总和和最大日期进行映射/归约?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28077686/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com