gpt4 book ai didi

Java 程序在为 k-gram 索引语料库时突然变慢

转载 作者:行者123 更新时间:2023-12-02 11:20:27 24 4
gpt4 key购买 nike

有一个问题令我困惑。我正在为文本文件的语料库(17000 个文件)建立索引,在执行此操作时,我还将每个单词的所有 k-gram(单词的 k 长部分)存储在 HashMap 稍后使用:

public void insert( String token ) {
//For example, car should result in "^c", "ca", "ar" and "r$" for a 2-gram index

// Check if token has already been seen. if it has, all the
// k-grams for it have already been added.
if (term2id.get(token) != null) {
return;
}

id2term.put(++lastTermID, token);
term2id.put(token, lastTermID);

// is word long enough? for example, "a" can be bigrammed and trigrammed but not four-grammed.
// K must be <= token.length + 2. "ab". K must be <= 4
List<KGramPostingsEntry> postings = null;
if(K > token.length() + 2) {
return;
}else if(K == token.length() + 2) {
// insert the one K-gram "^<String token>$" into index
String kgram = "^"+token+"$";
postings = index.get(kgram);
SortedSet<String> kgrams = new TreeSet<String>();
kgrams.add(kgram);
term2KGrams.put(token, kgrams);
if (postings == null) {
KGramPostingsEntry newEntry = new KGramPostingsEntry(lastTermID);
ArrayList<KGramPostingsEntry> newList = new ArrayList<KGramPostingsEntry>();
newList.add(newEntry);
index.put("^"+token+"$", newList);
}
// No need to do anything if the posting already exists, so no else clause. There is only one possible term in this case
// Return since we are done
return;
}else {
// We get here if there is more than one k-gram in our term
// insert all k-grams in token into index
int start = 0;
int end = start+K;
//add ^ and $ to token.
String wrappedToken = "^"+token+"$";
int noOfKGrams = wrappedToken.length() - end + 1;
// get K-Grams
String kGram;
int startCurr, endCurr;
SortedSet<String> kgrams = new TreeSet<String>();

for (int i=0; i<noOfKGrams; i++) {

startCurr = start + i;
endCurr = end + i;

kGram = wrappedToken.substring(startCurr, endCurr);
kgrams.add(kGram);

postings = index.get(kGram);
KGramPostingsEntry newEntry = new KGramPostingsEntry(lastTermID);
// if this k-gram has been seen before
if (postings != null) {
// Add this token to the existing postingsList.
// We can be sure that the list doesn't contain the token
// already, else we would previously have terminated the
// execution of this function.
int lastTermInPostings = postings.get(postings.size()-1).tokenID;
if (lastTermID == lastTermInPostings) {
continue;
}
postings.add(newEntry);
index.put(kGram, postings);
}
// if this k-gram has not been seen before
else {
ArrayList<KGramPostingsEntry> newList = new ArrayList<KGramPostingsEntry>();
newList.add(newEntry);
index.put(kGram, newList);
}
}

Clock c = Clock.systemDefaultZone();
long timestart = c.millis();

System.out.println(token);
term2KGrams.put(token, kgrams);

long timestop = c.millis();
System.out.printf("time taken to put: %d\n", timestop-timestart);
System.out.print("put ");
System.out.println(kgrams);
System.out.println();

}

}

插入HashMap发生在行term2KGrams.put(token, kgrams);上(代码片段中有2个)。建立索引时,一切正常,直到突然出现 15,000 个索引文件时,情况变坏。一切都大大减慢,并且程序根本无法在合理的时间内完成。

为了尝试理解这个问题,我在函数末尾添加了一些打印内容。这是他们生成的输出:

http://soccer.org
time taken to put: 0
put [.or, //s, /so, ://, ^ht, cce, cer, er., htt, occ, org, p:/, r.o, rg$, soc, tp:, ttp]

aysos
time taken to put: 0
put [^ay, ays, os$, sos, yso]

http://www.davisayso.org/contacts.htm
time taken to put: 0
put [.da, .ht, .or, //w, /co, /ww, ://, ^ht, act, avi, ays, con, cts, dav, g/c, htm, htt, isa, nta, o.o, ont, org, p:/, rg/, s.h, say, so., tac, tm$, tp:, ts., ttp, vis, w.d, ww., www, yso]

playsoccer
time taken to put: 0
put [^pl, ays, cce, cer, er$, lay, occ, pla, soc, yso]

这对我来说看起来不错,推杆似乎并没有花费很长时间,并且 k-gram(在本例中为 trigram)是正确的。

但是人们可以在我的计算机打印此信息的速度中看到奇怪的行为。一开始,一切都以超高速打印。但到了 15 000 时,这个速度就停止了,取而代之的是,我的计算机开始一次打印几行,这当然意味着索引语料库的其他 2000 个文件将需要很长时间。

我观察到的另一件有趣的事情是,在按照描述的一段时间不规律且缓慢地打印后,进行键盘中断 (ctrl+c) 时。它给了我这样的消息:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.lang.StringLatin1.newString(StringLatin1.java:549)sahandzarrinkoub@Sahands-MBP:~/Documents/Programming/Information Retrieval/lab3 2$ sh compile_all.sh
Note: ir/PersistentHashedIndex.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.

这是否意味着我失去了内存?这是问题所在吗?如果是这样,那就太令人惊讶了,因为我之前已经在内存中存储了很多东西,例如包含每个单词的文档ID的HashMap 在语料库中,一个 HashMap 包含每个 k-gram 出现的每个单词,等等。

请告诉我您的想法以及我可以采取哪些措施来解决此问题。

最佳答案

要理解这一点,您必须首先了解 java 不会动态分配内存(或者至少不会无限期地分配内存)。默认情况下,JVM 配置为以最小堆大小和最大堆大小启动。当某些分配超出最大堆大小时,您会得到 OutOfMemoryError

您可以分别使用虚拟机参数 -Xms-Xmx 更改执行的最小和最大堆大小。至少 2 GB、但最多 4 GB 的执行示例是

java -Xms2g -Xmx4g ...

您可以在 man page for java 上找到更多选项.

但是,在更改堆内存之前,请仔细查看您的系统资源,尤其是您的系统是否启动 swapping 。如果您的系统进行交换,较大的堆大小可能会让程序运行更长时间,但性能同样较差。那么唯一可能的就是优化你的程序以使用更少的内存或升级你的机器的 RAM。

关于Java 程序在为 k-gram 索引语料库时突然变慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49958954/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com