java - java中如何从乱七八糟的字符串中抓取文本？-6ren

java - java中如何从乱七八糟的字符串中抓取文本？

转载作者：行者123 更新时间：2023-12-01 10:07:26

我正在阅读一个文本文件，其中包含电影标题、年份、语言等。我正在努力捕获这些属性。

假设一些字符串是这样的:

 String s = "A Fatal Inversion" (1992)"
 String d = "(aka "Verhngnisvolles Erbe" (1992))    (Germany)"
 String f =  "\"#Yaprava\" (2013) "
 String g = "(aka \"Love Heritage\" (2002)) (International: English title)"

如果指定的话，我如何获取标题、年份、国家/地区，如果从中指定的话，会是什么样的标题？

我不太擅长使用正则表达式和模式，但我不知道在未指定它们的情况下如何找到它是什么类型的属性。我这样做是因为我试图从文本文件生成 xml。我有它的 dtd，但我不确定我是否需要它在这种情况下使用它。

编辑:这是我尝试过的。

    String pattern;
    Pattern p = Pattern.compile("\"([^\"]*)\"");
    Matcher m;



    Pattern number = Pattern.compile("\\d+");
    Matcher num;

    m = p.matcher(s);

    num = number.matcher(s);

    if(m.find()){
        System.out.println(m.group(1));
    }

    if(num.find()){
        System.out.println(num.group(0));
    }

最佳答案

我建议您首先提取年份，因为这看起来相当一致。然后我会提取国家/地区(如果存在)，其余部分我假设是标题。

为了提取国家/地区，我建议您使用已知国家/地区的名称硬编码正则表达式模式。可能需要一些迭代才能确定它们是什么，因为它们似乎非常不一致。

这段代码有点难看(但数据也是如此!):

public class Extraction {
    public final String original;
    public String year = "";
    public String title = "";
    public String country = "";

    private String remaining;

    public Extraction(String s) {
        this.original = s;
        this.remaining = s;
        extractBracketedYear();
        extractBracketedCountry();
        this.title = remaining;
    }

    private void extractBracketedYear() {
        Matcher matcher = Pattern.compile(" ?\\(([0-9]+)\\) ?").matcher(remaining);
        StringBuffer sb = new StringBuffer();
        while (matcher.find()) {
            this.year = matcher.group(1);
            matcher.appendReplacement(sb, "");
        }
        matcher.appendTail(sb);
        remaining = sb.toString();
    }

    private void extractBracketedCountry() {
        Matcher matcher = Pattern.compile("\\((Germany|International: English.*?)\\)").matcher(remaining);
        StringBuffer sb = new StringBuffer();
        while (matcher.find()) {
            this.country = matcher.group(1);
            matcher.appendReplacement(sb, "");
        }
        matcher.appendTail(sb);
        remaining = sb.toString();
    }

    public static void main(String... args) {

        for (String s : new String[] {
                "A Fatal Inversion (1992)",
                "(aka \"Verhngnisvolles Erbe\" (1992))    (Germany)",
                "\"#Yaprava\" (2013) ",
                "(aka \"Love Heritage\" (2002)) (International: English title)"}) {

            Extraction extraction = new Extraction(s);
            System.out.println("title   = " + extraction.title);
            System.out.println("country = " + extraction.country);
            System.out.println("year    = " + extraction.year);
            System.out.println();
        }
    }

}

产品:

title   = A Fatal Inversion
country = 
year    = 1992

title   = (aka "Verhngnisvolles Erbe")    
country = Germany
year    = 1992

title   = "#Yaprava"
country = 
year    = 2013

title   = (aka "Love Heritage") 
country = International: English title
year    = 2002

获得此数据后，您可以进一步操作它(例如“国际:英文标题”->“英格兰”)。

关于java - java中如何从乱七八糟的字符串中抓取文本？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36349106/

文章推荐： image - 从模型图像中删除包标签

文章推荐： java - 如何加载 *.jar 文件之外的 FXML 文件

文章推荐： vhdl - 在 vhdl 中计算(和验证)以太网 FCS (crc32)

文章推荐： Xcode 图像旋转 toValue 和 fromValue

java - socket 乱七八糟
我关于套接字的代码分为三个类，现在我已经研究它(并学习套接字 Api)几个小时了，我已经把它归结为只剩下一个错误(我可以看)。从服务器收到的消息将不会打印，而是抛出此错误。 Exception in
objective-c - 打开个人热点时 UIViewController 乱七八糟
当我打开个人热点连接时，我的布局被下推。我怎样才能删除顶部的这个热点栏或者得到这个栏的通知并根据它重新排列我的 View ？谢谢。最佳答案当状态栏的大小发生变化时，将调用您的应用程序委托(dele
css - Chrome 扩展 - 某些页面上的 CSS 乱七八糟
我正在编写一个 Chrome 扩展，它在现有页面的顶部插入一个 DIV。我为我的 DIV 指定了不同的 ID 名称“juxiSidebar”。此外，我插入到页面中的所有 ID 和类都包含前缀“juxi
javascript - 获取 <style> 原型(prototype)，CSSStyleRule/CSSRule 乱七八糟
我正在为我的工作创建一个 Javascript 框架，支持最现代的浏览器(FF 4+、Chrome、Opera 11+、IE8+)。我在尝试扩展 prototype 时遇到了一些问题样式规则(假设 r

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

java - java中如何从乱七八糟的字符串中抓取文本？