gpt4 book ai didi

java - 从模式中提取 n-gram

转载 作者:行者123 更新时间:2023-11-29 09:12:36 25 4
gpt4 key购买 nike

我试图从从文本文档中提取的模式中提取 ngram,模式具有不同数量的术语。

例如:如果模式 p={t1,t2,t3}

我们需要提取 ngram 3

应该是这样的

t1
t2
t3

t1, t2
t2,t3

t1,t2,t3

我写了一些代码,但效果不佳。

     public Hashtable<String, Pattern> findGrams(XMLDocument d) {
ArrayList<Pattern> patterns = d.getPatterns();

System.out.println("patterns " + d.getPatterns());

ArrayList terms = new ArrayList();
Hashtable Grams = new Hashtable();

String s = "";

// to extract all terms from the pattern
for (int i = 0; i < patterns.size(); i++) {
Pattern pat = (Pattern) patterns.get(i);
terms.clear();
for (int z = 0; z < pat.getNumitems(); z++) {
terms.add(pat.getItems().get(z).toString());
}

// to generate grams from the pattern
int j = 0;
int x=0;
for (int y = 1; y <= ngram ; y++) {

for ( x = 0; x < terms.size() & j != y; x++) {
s = terms.get(x).toString();

if (y > 1) {
for (j = x + 1; j < terms.size() & terms.indexOf(j) < ngram; j++) {
s = s + "," + terms.get(j).toString();
}
}

if (!Grams.contains(s)) {
System.out.println(s);
Grams.put(s, i);
}
}

}
}
return (Grams);
}

请帮忙,

最佳答案

我希望这能满足您的需求。

import java.util.*;

public class Test {

public static List<String> ngrams(int n, String str) {
List<String> ngrams = new ArrayList<String>();
String[] words = str.split(" ");
for (int i = 0; i < words.length - n + 1; i++)
ngrams.add(concat(words, i, i+n));
return ngrams;
}

public static String concat(String[] words, int start, int end) {
StringBuilder sb = new StringBuilder();
for (int i = start; i < end; i++)
sb.append((i > start ? " " : "") + words[i]);
return sb.toString();
}

public static void main(String[] args) {
for (int n = 1; n <= 3; n++) {
for (String ngram : ngrams(n, "t1 t2 t3"))
System.out.println(ngram);
System.out.println();
}
}
}

关于java - 从模式中提取 n-gram,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11452290/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com