gpt4 book ai didi

regex - 空间格式化数据到csv

转载 作者:行者123 更新时间:2023-12-01 10:43:50 24 4
gpt4 key购买 nike

很长一段时间以来,我一直在尝试将以空格分隔的数据格式化为 CSV 结构。

初始位置

初始数据表由:

Dr. Arun Raykar MBBS, MS - ENT 9 years experience Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE    Malleswaram, Bangalore INR 250 MON-SAT7:00PM-9:00PM Book Appointment   
Dr. Hema Sanath C BHMS, CFN 0 years experience Homeopath Sankirana Homeopathic Clinic Kalyan Nagar, Bangalore INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM Book Appointment
Dr. Hema Ahuja BDS,M Phil 33 years experience Dentist V2 E City Family Dental Center Electronics City, Bangalore INR 200 MON-SUN10:00AM-8:00PM Book Appointment

它包含大量空格和不必要的信息。信息是这样呈现的

Doctor's name | Degree | Years of experience | Specialization | Hospital name | Address | Fees | Schedule | and an unnecessary book appointment field.

我想把它转换成下面的格式

Doctor's name,Specialization,Hospital name,Address,Fees,Schedule

所以当前的数据应该是这样的

 Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist,SHAKTHI E.N.T CARE,Malleswaram,INR 250,MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath,Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250,MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist,V2 E City Family Dental Center,Electronics City,INR 200,MON-SUN10:00AM-8:00PM

至此,我已经成功去除预约预约字段。

问题

但是,我在对医院名称进行分类时遇到困难。因为它的间距变化很大。这个问题可行吗?

编辑

cat -A file 的输出如下:

 Dr. Arun Raykar MBBS, MS - ENT 9 years experience Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE ^I Malleswaram, Bangalore INR 250 MON-SAT7:00PM-9:00PM Book Appointment $
Dr. Hema Sanath C BHMS, CFN 0 years experience Homeopath Sankirana Homeopathic Clinic ^I Kalyan Nagar, Bangalore INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM Book Appointment $
Dr. Hema Ahuja BDS,M Phil 33 years experience Dentist V2 E City Family Dental Center ^I Electronics City, Bangalore INR 200 MON-SUN10:00AM-8:00PM Book Appointment

最佳答案

没有直接的方法将特化与医院名称分开,但根据一些假设,您也许可以使用 perl 来执行此操作:

perl -pe 's/^(\S+\s+\S+\s+\S+).+experience\s([^\t]+?)\s+(\b[A-Z0-9]{2}[^\t]+?|(?:(?!\b[A-Z0-9]{2})[^\t])*)\s+\t\s+([^,]+,).+?(INR.+?PM)\s+.*/\1,\2,\3,\4\5/' file

给予:

Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist,SHAKTHI E.N.T CARE,Malleswaram,INR 250 MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath,Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist,V2 E City Family Dental Center,Electronics City,INR 200 MON-SUN10:00AM-8:00PM

因为它是基于 perl 的正则表达式,你可以使用 regex101通过正则表达式调试器了解它是如何工作的。正则表达式非常简单,但由于包含很多部分,因此看起来令人望而生畏。

警告:以上内容能够根据两件事来分离特化:

  1. 它尝试找到第一次出现的空格后跟两个大写字符或数字,并在找到时开始匹配医院名称;或
  2. 如果没有连续的大写字符或数字,则只取第一个单词作为专业,其余为医院名称。

我知道它可能无法解决全部问题,因为总有一些行不符合上述规则,但这可以让您开始清理这些问题。如果有任何分隔不正确的地方(即当特化由超过 1 个单词组成并且医院名称没有两个连续的大写/数字),您将正确放置特化的一个单词,其余的在医院名称中。

关于regex - 空间格式化数据到csv,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21624973/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com