gpt4 book ai didi

java - 在 Java 中查找搭配模式

转载 作者:搜寻专家 更新时间:2023-11-01 03:09:48 25 4
gpt4 key购买 nike

我在一个需要使用搭配的项目中工作。我创建了以下代码来提取它们。该代码接受一个字符串并返回该字符串中的搭配模式列表。我使用 Stanford POS 进行标记。

我需要你对代码的建议,当我处理大量文本时它看起来很慢。任何改进代码的建议都将不胜感激。

/**
*
* A COLLOCATION is an expression consisting of two or more words that
* correspond to some conventional way of saying things.
*
* I used the seventh Part-of-speech-tag patterns for collocation filtering that
* were suggested by Justeson and Katz(1995).
* These patterns are:
*
* -----------------------------------------
* |Tag | Pattern Example |
* -----------------------------------------
* |AN | linear function |
* |NN | regression coefficients |
* |AAN | Gaussian random variable |
* |ANN | cumulative distribution function |
* |NAN | mean squared error |
* |NNN | class probability function |
* |NPN | degrees of freedom |
* -----------------------------------------
* Where A=adjective, P=preposition, & N=noun.
*
* Stanford POS have been used for the extraction process.
* see: http://nlp.stanford.edu/software/tagger.shtml#Download
*
* more on collocation: http://nlp.stanford.edu/fsnlp/promo/colloc.pdf
* more on POS: http://acl.ldc.upenn.edu/J/J93/J93-2004.pdf
*
*/

public class GetCollocations {
public static ArrayList<String> GetCollocations(String text) throws IOException, ClassNotFoundException{
MaxentTagger tagger = new MaxentTagger("taggers/wsj-0-18-left3words.tagger");
String[] tagged = tagger.tagString(text).split("\\s+");

ArrayList<String> collocations = new ArrayList();
for (int i = 0; i < tagged.length; i++) {

String pot = tagged[i].substring(tagged[i].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {

pot = tagged[i + 1].substring(tagged[i + 1].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {

collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]));

pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}

} else if (pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) {
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);

if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}

} else if (pot.equals("IN")) {
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);

if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}

}


} else if (pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) {
pot = tagged[i + 1].substring(tagged[i + 1].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]));
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}

} else if (pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) {
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);
if (pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}
}

}

}
return collocations;

}
public static String GetWordWithoutTag(String wordWithTag){
String wordWithoutTag = wordWithTag.substring(0,wordWithTag.indexOf("_"));
return wordWithoutTag;
}

}

最佳答案

如果您的处理速度接近每秒 15,000 个单词,那么您的词性标注器已达到极限。据斯坦福Stanford POS tagger FAQ :

on a 2008 nothing-special Intel server, it tags about 15000 words per second

你的算法的其余部分看起来很好,但如果你真的想从中榨取一些汁液,你可以预先分配一个数组作为静态类变量而不是 ArrayList。本质上牺牲了前期内存成本,不必在每次调用时实例化 ArrayList 或遭受 amortized O(n) cost。添加元素。

也只是一个提高代码可读性的建议,你可以考虑使用一些私有(private)方法来检查pot变量是什么词性,

private static Boolean  _isNoun(String pot) {
if(pot.equals("NN") || pot.equals("NNS") || pot.equals("NNP") || pot.equals("NNPS")) return true;
else return false;
}

private static Boolean _isAdjective(String pot){
if(pot.equals("JJ") || pot.equals("JJR") || pot.equals("JJS")) return true;
else return false;
}

此外,如果我没记错的话,您应该能够简化您正在做的事情,结合一些 if 语句。这不会真正加快您的代码速度,但会使其更好用。请仔细阅读,我只是试图简化您的逻辑来证明我的观点。请记住下面的代码是未经测试的:

public static ArrayList<String> GetCollocations(String text) throws IOException,                ClassNotFoundException{
MaxentTagger tagger = new MaxentTagger("taggers/wsj-0-18-left3words.tagger");
String[] tagged = tagger.tagString(text).split("\\s+");
ArrayList<String> collocations = new ArrayList();

for (int i = 0; i < tagged.length; i++) {
String pot = tagged[i].substring(tagged[i].indexOf("_") + 1);

if (_isNoun(pot) || _isAdjective(pot)) {
pot = tagged[i + 1].substring(tagged[i + 1].indexOf("_") + 1);

if (_isNoun(pot) || _isAdjective(pot)) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]));
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);

if (_isNoun(pot)) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}

} else if (pot.equals("IN")) {
pot = tagged[i + 2].substring(tagged[i + 2].indexOf("_") + 1);

if (_isNoun(pot)) {
collocations.add(GetWordWithoutTag(tagged[i]) + " " + GetWordWithoutTag(tagged[i + 1]) + " " + GetWordWithoutTag(tagged[i + 2]));
}

}
}
}
return collocations;

}

关于java - 在 Java 中查找搭配模式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13186995/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com