gpt4 book ai didi

php - 将大量文本(聚类)与矩阵进行比较

转载 作者:行者123 更新时间:2023-12-04 02:27:28 24 4
gpt4 key购买 nike

我有以下 PHP 函数来计算文本之间的关系:

function check($terms_in_article1, $terms_in_article2) {
$length1 = count($terms_in_article1); // number of words
$length2 = count($terms_in_article2); // number of words
$all_terms = array_merge($terms_in_article1, $terms_in_article2);
$all_terms = array_unique($all_terms);
foreach ($all_terms as $all_termsa) {
$term_vector1[$all_termsa] = 0;
$term_vector2[$all_termsa] = 0;
}
foreach ($terms_in_article1 as $terms_in_article1a) {
$term_vector1[$terms_in_article1a]++;
}
foreach ($terms_in_article2 as $terms_in_article2a) {
$term_vector2[$terms_in_article2a]++;
}
$score = 0;
foreach ($all_terms as $all_termsa) {
$score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
}
$score = $score/($length1*$length2);
$score *= 500; // for better readability
return $score;
}

变量$terms_in_articleX必须是一个包含文本中出现的所有单个单词的数组。

假设我有一个包含 20,000 条文本的数据库,此函数将需要很长时间才能运行完所有连接。

如何加速这个过程?我应该将所有文本添加到一个巨大的矩阵中,而不是总是只比较两个文本吗?如果您有一些代码方法(最好是 PHP),那就太好了。

希望你能帮助我。提前致谢!

最佳答案

您可以在添加文本时拆分文本。简单的例子: preg_match_all(/\w+/, $text, $matches); 当然真正的分割并不是那么简单......但可能的,只需纠正模式:)

创建表 id(int Primary autoincrement)、value(varchar unique) 和链接表,如下所示:word_id(int)、text_id(int)、word_count(int)。然后在分割文本后用新值填充表格。

最后,您可以使用这些数据做任何您想做的事情,快速使用数据库中的索引整数(ID)进行操作。

更新:以下是表格和查询:

CREATE TABLE terms (
id int(11) NOT NULL auto_increment, value char(255) NOT NULL,
PRIMARY KEY (`id`), UNIQUE KEY `value` (`value`)
);

CREATE TABLE `terms_in_articles` (
term int(11) NOT NULL,
article int(11) NOT NULL,
cnt int(11) NOT NULL default '1',
UNIQUE KEY `term` (`term`,`article`)
);


/* Returns all unique terms in both articles (your $all_terms) */
SELECT t.id, t.value
FROM terms t, terms_in_articles a
WHERE a.term = t.id AND a.article IN (1, 2);

/* Returns your $term_vector1, $term_vector2 */
SELECT article, term, cnt
FROM terms_in_articles
WHERE article IN (1, 2) ORDER BY article;

/* Returns article and total count of term entries in it ($length1, $length2) */
SELECT article, SUM(cnt) AS total
FROM terms_in_articles
WHERE article IN (1, 2) GROUP BY article;

/* Returns your $score wich you may divide by ($length1 / $length2) from previous query */
SELECT SUM(tmp.term_score) * 500 AS total_score FROM
(
SELECT (a1.cnt * a2.cnt) AS term_score
FROM terms_in_articles a1, terms_in_articles a2
WHERE a1.article = 1 AND a2.article = 2 AND a1.term = a2.term
GROUP BY a2.term, a1.term
) AS tmp;

现在,我希望这会有所帮助?最后 2 个查询足以执行您的任务。其他查询以防万一。当然,您可以统计更多统计信息,例如“最流行的术语”等...

关于php - 将大量文本(聚类)与矩阵进行比较,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/901730/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com