gpt4 book ai didi

ruby-on-rails - ruby : sanitize CSV with irregular fields

转载 作者:太空宇宙 更新时间:2023-11-03 17:46:02 24 4
gpt4 key购买 nike

我有一个包含非常不规则条目的 CSV 文件。一行的第一个条目没有任何引号,整行都被引用,并且每个字段都用双引号引起来,如下所示:

# my_file.csv, opened with sublime text :

# Headers
"first_name,""last_name"",""username"",""phone_number"",""address"",""email_address"",""email_address_confirmed"",""joined_at"",""status"",""is_admin"",""accept_emails_from_admin"",""language"",""can_post_listings"""

# Sample entry
"Mr X,""Mr X"",""mrxxx"","""","""",""mr@mrx.com"",""true"",""2015-09-21 09:08:51 UTC"",""accepted"",""true"",""true"",""fr"",""true"""

我可以使用 Ruby 以外的其他东西(Excel、简单的正则表达式/替换,或任何你能想到的)来预处理文件,但由于我可能不得不多次执行此操作,因此 Ruby 解决方案会很棒。

目前我正在使用

csv = File.open(csv_file_path)
CSV.parse(csv, :headers => true)

而且我真的不知道如何才能轻松地修复每一行的第一个条目的这种差异......

问题是 CSV 没有被正确解析,而是将每一行视为一个字符串(而不是一个包含与列一样多的项目的数组)。

# csv.headers : note this is an array with a single string
["first_name,\"last_name\",\"username\",\"phone_number\",\"address\",\"email_address\",\"email_address_confirmed\",\"joined_at\",\"status\",\"is_admin\",\"accept_emails_from_admin\",\"language\",\"can_post_listings\""]

# csv.to_a.last
["xxx,\"xxxx\",\"martin\",\"\",\"\",\"xxx@xxxx.com\",\"false\",\"2016-05-12 13:06:53 UTC\",\"pending_email_confirmation\",\"false\",\"true\",\"fr\",\"false\""]

编辑:我尝试了以下内容

processed = File.readlines(path).map do |row|
row.strip # strip newlines
.gsub(/^\"|\"$/, '') # remove outer quotes
.gsub(/\"\"/, '"') # fix double quotes
end
CSV.parse(processed.join('\n'))

我遇到了 CSV::MalformedCSVError: Missing or stray quote in line 1

示例输出

# File.readlines(path).first
# => "\"first_name,\"\"last_name\"\",\"\"username\"\",\"\"phone_number\"\",\"\"address\"\",\"\"email_address\"\",\"\"email_address_confirmed\"\",\"\"joined_at\"\",\"\"status\"\",\"\"is_admin\"\",\"\"accept_emails_from_admin\"\",\"\"language\"\",\"\"can_post_listings\"\"\"\n"

# processed.first
# => "first_name,\"last_name\",\"username\",\"phone_number\",\"address\",\"email_address\",\"email_address_confirmed\",\"joined_at\",\"status\",\"is_admin\",\"accept_emails_from_admin\",\"language\",\"can_post_listings\""

编辑 2

哎呀,有时我有一些嵌套的逗号,@Dave 的回答似乎在这些情况下失败了。有这个字段

""45, street_addr - Place""

其中包含一个不是分隔符的逗号。完整条目

"Mr x,""Mr xx"",""bbernelin"","""",""45, street_addr - Place"",""xxx@xxx.fr"",""true"",""2016-04-13 11:14:08 UTC"",""accepted"",""false"",""true"",""fr"",""true"""

最佳答案

据我所知,整行都用引号括起来,然后一些字段用双引号引起来。修复让 CSV 解析器满意的问题,所以这似乎有效:

require 'csv'

processed = DATA.map do |row|
row.strip # strip newlines
.gsub(/^\"|\"$/, '') # remove outer quotes
.gsub(/\"\"/, '"') # fix double quotes
end

CSV.parse(processed.join('\n'), headers: true) do |row|
p row
end

__END__
"first_name,""last_name"",""username"",""phone_number"",""address"",""email_address"",""email_address_confirmed"",""joined_at"",""status"",""is_admin"",""accept_emails_from_admin"",""language"",""can_post_listings"""
"Mr X,""Mr X"",""mrxxx"","""","""",""mr@mrx.com"",""true"",""2015-09-21 09:08:51 UTC"",""accepted"",""true"",""true"",""fr"",""true"""

结果:

#<CSV::Row "first_name":"Mr X" "last_name":"Mr X" "username":"mdxxx"
"phone_number":"" "address":"" "email_address":"mr@mrx.com"
"email_address_confirmed":"true" "joined_at":"2015-09-21 09:08:51 UTC"
"status":"accepted" "is_admin":"true" "accept_emails_from_admin":"true"
"language":"fr" "can_post_listings":"true">

关于ruby-on-rails - ruby : sanitize CSV with irregular fields,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37265190/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com