gpt4 book ai didi

java - 单词在句子中的共现

转载 作者:搜寻专家 更新时间:2023-11-01 03:48:29 25 4
gpt4 key购买 nike

我在一个文件中有一大组句子 (10,000)。该文件每个文件包含一个句子。在整个集合中,我想找出哪些词在一个句子中一起出现,以及它们出现的频率。

例句:

"Proposal 201 has been accepted by the Chief today.", 
"Proposal 214 and 221 are accepted, as per recent Chief decision",
"This proposal has been accepted by the Chief.",
"Both proposal 3 MazerNo and patch 4 have been accepted by the Chief.",
"Proposal 214, ValueMania, has been accepted by the Chief."};

我想编写以下输出。我应该能够提供三个起始词作为程序的参数:“首席、接受、建议”

Chief accepted Proposal            5
Chief accepted Proposal has 3
Chief accepted Proposal has been 3

...
...
for all combinations.

我知道这些组合可能很大。

我在网上搜索过,没找到。我写了一些代码,但无法理解它。也许知道该域的人可能知道。

ReadFileLinesIntoArray rf = new ReadFileLinesIntoArray();

try {
String[] tmp = rf.readFromFile("c:/scripts/SelectedSentences.txt");
for (String t : tmp){
String[] keys = t.split(" ");
String[] uniqueKeys;
int count = 0;
System.out.println(t);
uniqueKeys = getUniqueKeys(keys);
for(String key: uniqueKeys)
{
if(null == key)
{
break;
}
for(String s : keys)
{
if(key.equals(s))
{
count++;
}
}
System.out.println("Count of ["+key+"] is : "+count);
count=0;
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

private static String[] getUniqueKeys(String[] keys) {
String[] uniqueKeys = new String[keys.length];

uniqueKeys[0] = keys[0];
int uniqueKeyIndex = 1;
boolean keyAlreadyExists = false;

for (int i = 1; i < keys.length; i++) {
for (int j = 0; j <= uniqueKeyIndex; j++) {
if (keys[i].equals(uniqueKeys[j])) {
keyAlreadyExists = true;
}
}

if (!keyAlreadyExists) {
uniqueKeys[uniqueKeyIndex] = keys[i];
uniqueKeyIndex++;
}
keyAlreadyExists = false;
}
return uniqueKeys;
}

有人可以帮忙编码吗?

最佳答案

您可以应用标准的信息检索数据结构,尤其是倒排索引。这是您的操作方法。

考虑您的原始句子。用一些整数标识符对它们进行编号,如下所示:

  1. "Proposal 201 has been accepted by the Chief today.",
  2. "Proposal 214 and 221 are accepted, as per recent Chief decision",
  3. "This proposal has been accepted by the Chief.",
  4. "Both proposal 3 MazerNo and patch 4 have been accepted by the Chief.",
  5. "Proposal 214, ValueMania, has been accepted by the Chief."

对于您在句子中遇到的每一对单词,将其添加到一个倒排索引中,该索引将这对单词映射到一组(一组唯一项)句子标识符。对于一个长度为 N 的句子,有 N-choose-2 对。

适当的 Java 数据结构将是 Map<String, Map<String, Set<Integer>> .按字母顺序排列这些对,以便“有”和“建议”对将仅作为 ("has", "Proposal") 而不是 ("Proposal", "has") 出现。

这张 map 将包含以下内容:

"has", "Proposal" --> Set(1, 5)
"accepted", "Proposal" --> Set(1, 2, 5)
"accepted", "has" --> Set(1, 3, 5)
etc.

例如,单词对“has”和“Proposal”的集合为 (1, 5),表示它们在句子 1 和 5 中出现。

现在假设您要查找“accepted”、“has”和“Proposal”列表中单词的共现次数。从此列表生成所有对并与它们各自的列表相交(使用 Java 的 Set.retainAll() )。这里的结果将最终设置为 (1, 5)。它的大小为2,表示有两个句子包含“accepted”、“has”和“Proposal”。

要生成所有对,只需根据需要遍历 map 即可。要生成所有大小为 N 的单词元组,您需要迭代并根据需要使用递归。

关于java - 单词在句子中的共现,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35932539/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com