ref|NP_010829.1| Irc4p [Saccharom-6ren">
gpt4 book ai didi

Java 正则表达式。从不包括特定字符序列的文本中提取组。 (它的工作方式类似于向后匹配)

转载 作者:行者123 更新时间:2023-12-01 07:05:55 25 4
gpt4 key购买 nike

我读过类似的问题来解决我的问题,但没有任何解决方案。我在从以下字符串中提取组时遇到问题:

    String str = "/a> ref|NP_010829.1| Irc4p [Saccharomyces cerevisiae S288c] >gi|74676333|sp|Q03036.1|IRC4_YEAST  RecName: Full=Uncharacterized protein IRC4;  AltName: Full=Increased recombination centers protein 4 >gi|1165295|gb|AAB64982.1|  Ydr540cp [Saccharomyces cerevisiae]  >gi|51012753|gb|AAT92670.1| YDR540C [Saccharomyces cerevisiae]  >gi|151942499|gb|EDN60855.1| conserved protein [Saccharomyces  cerevisiae YJM789] >gi|190404545|gb|EDV07812.1|  conserved hypothetical protein [Saccharomyces cerevisiae  RM11-1a] >gi|259145774|emb|CAY79038.1| Irc4p [Saccharomyces  cerevisiae EC1118] >gi|285811545|tpg|DAA12369.1| TPA:  Irc4p [Saccharomyces cerevisiae S288c] >gi|323309617|gb|EGA62826.1|  Irc4p [Saccharomyces cerevisiae FostersO] >gi|323338091|gb|EGA79326.1|  Irc4p [Saccharomyces cerevisiae Vin13]  >gi|365766295|gb|EHN07794.1| Irc4p [Saccharomyces cerevisiae  x Saccharomyces kudriavzevii VIN7] >gi|392300658|gb|EIW11749.1|  Irc4p [Saccharomyces cerevisiae CEN.PK113-7D]  >gi|584366859|gb|EWG86852.1| Irc4p [Saccharomyces cerevisiae  R008] >gi|584372222|gb|EWG92158.1| Irc4p [Saccharomyces  cerevisiae P301] >gi|584376691|gb|EWG96547.1| Irc4p  [Saccharomyces cerevisiae R103] >gi|584477456|gb|EWH19199.1|  Irc4p [Saccharomyces cerevisiae P283]";

我想要做的是解析捕获包含任何字符的组的字符串,直到第一次出现 ">",产生以下字符串:

result = "/a> ref|NP_010829.1| Irc4p [Saccharomyces cerevisiae S288c]";

我使用 replaceAll(regex, replacement) 方法尝试了以下正则表达式模式:

str = str.replaceAll("^(.+)>.+", "$1");

其中 "^(.+)>.+" 应匹配任何字符,直到第一次出现 ">",但组 "^(.+)" 紧随其后,直到最后一次出现 ">"

那么结果是:

from: "/a> ref|NP_010829.1| Irc4p [Saccharomyces cerevisiae S288c] >gi|74676333|sp|Q03036.1|IRC4_YEAST  RecName: Full=Uncharacterized protein IRC4;  AltName: Full=Increased recombination centers protein 4 >gi|1165295|gb|AAB64982.1|  Ydr540cp [Saccharomyces cerevisiae]  >gi|51012753|gb|AAT92670.1| YDR540C [Saccharomyces cerevisiae]  >gi|151942499|gb|EDN60855.1| conserved protein [Saccharomyces  cerevisiae YJM789] >gi|190404545|gb|EDV07812.1|  conserved hypothetical protein [Saccharomyces cerevisiae  RM11-1a] >gi|259145774|emb|CAY79038.1| Irc4p [Saccharomyces  cerevisiae EC1118] >gi|285811545|tpg|DAA12369.1| TPA:  Irc4p [Saccharomyces cerevisiae S288c] >gi|323309617|gb|EGA62826.1|  Irc4p [Saccharomyces cerevisiae FostersO] >gi|323338091|gb|EGA79326.1|  Irc4p [Saccharomyces cerevisiae Vin13]  >gi|365766295|gb|EHN07794.1| Irc4p [Saccharomyces cerevisiae  x Saccharomyces kudriavzevii VIN7] >gi|392300658|gb|EIW11749.1|  Irc4p [Saccharomyces cerevisiae CEN.PK113-7D]  >gi|584366859|gb|EWG86852.1| Irc4p [Saccharomyces cerevisiae  R008] >gi|584372222|gb|EWG92158.1| Irc4p [Saccharomyces  cerevisiae P301] >gi|584376691|gb|EWG96547.1| Irc4p  [Saccharomyces cerevisiae R103] >gi|584477456|gb|EWH19199.1|  Irc4p [Saccharomyces cerevisiae P283]";
to: "/a> ref|NP_010829.1| Irc4p [Saccharomyces cerevisiae S288c] >gi|74676333|sp|Q03036.1|IRC4_YEAST RecName: Full=Uncharacterized protein IRC4; AltName: Full=Increased recombination centers protein 4 >gi|1165295|gb|AAB64982.1| Ydr540cp [Saccharomyces cerevisiae] >gi|51012753|gb|AAT92670.1| YDR540C [Saccharomyces cerevisiae] >gi|151942499|gb|EDN60855.1| conserved protein [Saccharomyces cerevisiae YJM789] >gi|190404545|gb|EDV07812.1| conserved hypothetical protein [Saccharomyces cerevisiae RM11-1a] >gi|259145774|emb|CAY79038.1| Irc4p [Saccharomyces cerevisiae EC1118] >gi|285811545|tpg|DAA12369.1| TPA: Irc4p [Saccharomyces cerevisiae S288c] >gi|323309617|gb|EGA62826.1| Irc4p [Saccharomyces cerevisiae FostersO] >gi|323338091|gb|EGA79326.1| Irc4p [Saccharomyces cerevisiae Vin13] >gi|365766295|gb|EHN07794.1| Irc4p [Saccharomyces cerevisiae x Saccharomyces kudriavzevii VIN7] >gi|392300658|gb|EIW11749.1| Irc4p [Saccharomyces cerevisiae CEN.PK113-7D] >gi|584366859|gb|EWG86852.1| Irc4p [Saccharomyces cerevisiae R008] >gi|584372222|gb|EWG92158.1| Irc4p [Saccharomyces cerevisiae P301] >gi|584376691|gb|EWG96547.1| Irc4p [Saccharomyces cerevisiae R103]";

要实现我的结果,就像循环检查 str.contains(">") ,然后使用 str.replaceAll("^(.+)>。 +", "$1"); 消除任何字符序列如向后匹配

最佳答案

问题是正则表达式中的 .+

^(.+)>.+

Regular expression visualization

Debuggex Demo

greedy ,这意味着(正如您所发现的),它贪婪消耗除最后一个之外的所有 > 实例。将此更改为 reluctant

^(.+?)>.+

Regular expression visualization

Debuggex Demo

就是你想要的:它不情愿只捕获第一个 >

  • 只要整体正则表达式仍然可以匹配,就会贪婪尽可能多地捕获元素。
  • 只要整体正则表达式仍然可以匹配,不情愿捕获的元素就尽可能少
<小时/>

请考虑为 Stack Overflow Regular Expressions FAQ 添加书签以供将来引用。

关于Java 正则表达式。从不包括特定字符序列的文本中提取组。 (它的工作方式类似于向后匹配),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24724299/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com