gpt4 book ai didi

bash - 如何用空格分隔 "sentence"中的单词?

转载 作者:行者123 更新时间:2023-11-29 08:50:20 25 4
gpt4 key购买 nike

背景

希望在 JasperServer 中自动创建域。域是用于创建临时报告的数据“ View ”。列的名称必须以人类可读的方式呈现给用户。

问题

从理论上讲,组织可能希望将超过 2,000 条数据包含在报告中。数据来自非人类友好的名称,例如:

payperiodmatchcode labordistributioncodedesc dependentrelationship actionendoption actionendoptiondesc addresstype addresstypedesc historytype psaddresstype rolename bankaccountstatus bankaccountstatusdesc bankaccounttype bankaccounttypedesc beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass beneficiaryclass beneficiaryclassdesc benefitactioncode benefitactioncodedesc benefitagecontrol benefitagecontroldesc ageconrolagelimit ageconrolnoticeperiod

问题

您将如何自动将此类名称更改为:

  • 支付期匹配代码
  • 劳动力分配代码描述
  • 依赖关系

想法

  • 使用谷歌的 Did you mean引擎,但我认为它违反了他们的服务条款:

    lynx -dump «url» | grep “你是说” | awk ...

语言

任何语言都可以,但 Perl 等文本解析器可能更适合。 (列名只有英文。)

不必要的完善

目标不是 100% 完美地分解单词;以下结果是可以接受的:

  • enrollmenteffectivedate -> 注册生效日期
  • enrollmentenddate -> 登记男性倾向日期
  • enrollmentrequirementset -> 入学要求集

无论如何,人类都需要仔细检查结果并纠正许多错误。将一组 2,000 个结果缩减为 600 个编辑将大大节省时间。专注于具有多种可能性(例如,治疗师姓名)的一些案例完全没有捕获要点。

最佳答案

有时,bruteforcing是可以接受的:

#!/usr/bin/perl

use strict; use warnings;
use File::Slurp;

my $dict_file = '/usr/share/dict/words';

my @identifiers = qw(
payperiodmatchcode labordistributioncodedesc dependentrelationship
actionendoption actionendoptiondesc addresstype addresstypedesc
historytype psaddresstype rolename bankaccountstatus
bankaccountstatusdesc bankaccounttype bankaccounttypedesc
beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass
beneficiaryclass beneficiaryclassdesc benefitactioncode
benefitactioncodedesc benefitagecontrol benefitagecontroldesc
ageconrolagelimit ageconrolnoticeperiod
);

my @mydict = qw( desc );

my $pat = join('|',
map quotemeta,
sort { length $b <=> length $a || $a cmp $b }
grep { 2 < length }
(@mydict, map { chomp; $_ } read_file $dict_file)
);

my $re = qr/$pat/;

for my $identifier ( @identifiers ) {
my @stack;
print "$identifier : ";
while ( $identifier =~ s/($re)\z// ) {
unshift @stack, $1;
}
# mark suspicious cases
unshift @stack, '*', $identifier if length $identifier;
print "@stack\n";
}

输出:

payperiodmatchcode : pay period match codelabordistributioncodedesc : labor distribution code descdependentrelationship : dependent relationshipactionendoption : action end optionactionendoptiondesc : action end option descaddresstype : address typeaddresstypedesc : address type deschistorytype : history typepsaddresstype : * ps address typerolename : role namebankaccountstatus : bank account statusbankaccountstatusdesc : bank account status descbankaccounttype : bank account typebankaccounttypedesc : bank account type descbeneficiaryamount : beneficiary amountbeneficiaryclass : beneficiary classbeneficiarypercent : beneficiary percentbenefitsubclass : benefit subclassbeneficiaryclass : beneficiary classbeneficiaryclassdesc : beneficiary class descbenefitactioncode : benefit action codebenefitactioncodedesc : benefit action code descbenefitagecontrol : benefit age controlbenefitagecontroldesc : benefit age control descageconrolagelimit : * ageconrol age limitageconrolnoticeperiod : * ageconrol notice period

另见 A Spellchecker Used to Be a Major Feat of Software Engineering .

关于bash - 如何用空格分隔 "sentence"中的单词?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3856630/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com