gpt4 book ai didi

ruby - 从 SQL Server 2008 解析 CSV 的语义正确方法是什么?

转载 作者:数据小太阳 更新时间:2023-10-29 08:04:57 25 4
gpt4 key购买 nike

我从 SQL Server 2008 得到了一个 CSV 转储,其中包含如下行:

Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00
Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00
Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00

parse_dbenhur 很漂亮,但是可以重写它以支持逗号和引号的存在吗? parse_ugly 确实很丑。

# @dbenhur's excellent answer, which works 100% for what i originally asked for
SEP = /(?:,|\Z)/
QUOTED = /"([^"]*)"/
UNQUOTED = /([^,]*)/
FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/
def parse_dbenhur(line)
line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] }
end

def parse_ugly(line)
dumb_fields = line.chomp.split(',').map { |v| v.gsub(/\s+/, ' ') }
fields = []
open = false
dumb_fields.each_with_index do |v, i|
open ? fields.last.concat(v) : fields.push(v)
open = (v.start_with?('"') and (v.count('"') % 2 == 1) and dumb_fields[i+1] and dumb_fields[i+1].start_with?(' ')) || (open and !v.end_with?('"'))
end
fields.map { |v| (v.start_with?('"') and v.end_with?('"')) ? v[1..-2] : v }
end

lines = []
lines << 'Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00'
lines << 'Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00'
lines << 'Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00'
lines << 'Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00'
lines << 'Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00'

require 'csv'
lines.each do |line|
puts
puts line
begin
c = CSV.parse_line(line)
puts "#{c.to_csv.chomp} (size #{c.length})"
rescue
puts "FasterCSV says: #{$!}"
end
a = parse_ugly(line)
puts "#{a.to_csv.chomp} (size #{a.length})"
b = parse_dbenhur(line)
puts "#{b.to_csv.chomp} (size #{b.length})"
end

这是我运行时的输出:

Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
FasterCSV says: Illegal quoting in line 1.
Plumbing,196222006P,"REPLACE LEAD WATER SERVICE W/1"" COPPER",1996-08-09 00:00:00 (size 4)
Plumbing,196222006P,"REPLACE LEAD WATER SERVICE W/1"" COPPER",1996-08-09 00:00:00 (size 4)

Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
FasterCSV says: Unclosed quoted field on line 1.
Construction,197133031B,"""MORGAN SHOES"" ALT",1997-05-13 00:00:00 (size 4)
Construction,197133031B,"""MORGAN SHOES"" ALT",1997-05-13 00:00:00 (size 4)

Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00
FasterCSV says: Missing or stray quote in line 1
Electrical,197135021E,"SERVICE ""OUTLETS""",1997-05-15 00:00:00 (size 4)
Electrical,197135021E,"""SERVICE"," ""OUTLETS""""",1997-05-15 00:00:00 (size 5)

Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00
FasterCSV says: Missing or stray quote in line 1
Electrical,197135021E,"SERVICE ""OUTLETS"" FOOBAR",1997-05-15 00:00:00 (size 4)
Electrical,197135021E,"""SERVICE"," ""OUTLETS"" FOOBAR""",1997-05-15 00:00:00 (size 5)

Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00
Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00 (size 4)
Construction,198120036B,"""""MERITER""","""DO IT CTR"""," """"NCR"""" AND """"TRACE"""" ALTERATION""",1998-04-30 00:00:00 (size 6)
Construction,198120036B,"""""""MERITER""""","""""DO IT CTR"""""," """"NCR"""" AND """"TRACE"""" ALTERATION""",1998-04-30 00:00:00 (size 6)

更新

请注意,当字段包含逗号时,CSV 使用双引号。

更新 2

如果从有问题的字段中删除逗号也没关系...我的 parse_ugly 方法不保留它们。

更新 3

从客户那里得知是SQL Server 2008 exporting this strange CSV - 已报告给 Microsoft herehere

更新 4

@dbenhur 的回答完全符合我最初的要求,但指出我忽略了用逗号和引号显示行。我会接受 d@benhur 的回答 - 但我希望可以改进它以在上面的所有行上工作。

希望是最终更新

此代码有效(我认为它“语义正确”):

QUOTED = /"((?:[^"]|(?:""(?!")))*)"/
SEPQ = /,(?! )/
UNQUOTED = /([^,]*)/
SEPU = /,(?=(?:[^ ]|(?: +[^",]*,)))/
FIELD = /(?:#{QUOTED}#{SEPQ})|(?:#{UNQUOTED}#{SEPU})|\Z/

def parse_sql_server_2008_csv_line(line)
line.scan(FIELD)[0...-1].map{ |matches| (matches[0] || matches[1]).tr(',', ' ').gsub(/\s+/, ' ') }
end

改编自@dbenhur 和@ghostdog74 在 How can I process a CSV file with “bad commas”? 中的回答

最佳答案

以下使用正则表达式和 String#scan .我观察到,在您处理的损坏的 CSV 格式中," 只有在字段的开头 结尾时才具有引号属性。

扫描依次匹配正则表达式的字符串,因此正则表达式可以假设其开始匹配点是字段的开头。我们构造正则表达式,以便它可以匹配没有内部引号的平衡引用字段(QUOTED)非逗号字符串(UNQUOTED) .当任一替代字段表示匹配时,它后面必须跟一个分隔符,分隔符可以是逗号或字符串结尾 (SEP)

因为 UNQUOTED 可以匹配分隔符之前的零长度字段,所以扫描总是匹配末尾的空字段,我们用 [0...-1]。扫描产生一个元组数组;每个元组都是捕获组的数组,所以我们映射每个元素,使用matches[0] || 选择捕获的替代项匹配[1]

您的示例行都没有显示包含逗号和引号的字段——我不知道它在法律上是如何表示的,而且这段代码可能无法正确识别这样的字段。

SEP = /(?:,|\Z)/
QUOTED = /"([^"]*)"/
UNQUOTED = /([^,]*)/

FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/

def ugly_parse line
line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] }
end

lines.each do |l|
puts l
puts ugly_parse(l).inspect
puts
end

# Electrical,197135021E,"SERVICE, OUTLETS",1997-05-15 00:00:00
# ["Electrical", "197135021E", "SERVICE, OUTLETS", "1997-05-15 00:00:00"]
#
# Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
# ["Plumbing", "196222006P", "REPLACE LEAD WATER SERVICE W/1\" COPPER", "1996-08-09 00:00:00"]
#
# Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
# ["Construction", "197133031B", "MORGAN SHOES\" ALT", "1997-05-13 00:00:00"]

关于ruby - 从 SQL Server 2008 解析 CSV 的语义正确方法是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14573623/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com