gpt4 book ai didi

php - 将文档总数除以包含词干的文档数

转载 作者:行者123 更新时间:2023-11-29 02:31:39 24 4
gpt4 key购买 nike

我有 2 个表:

tb_sentence :

================================
|id|doc_id|sentence_id|sentence|
================================
| 1| 1 | 0 | AB |
| 2| 1 | 1 | CD |
| 3| 2 | 0 | EF |
| 4| 2 | 1 | GH |
| 5| 2 | 2 | IJ |
| 6| 2 | 3 | KL |
================================

首先,我计算每个 document_id 中的句子数量,并将它们保存在变量 $total_sentence 中。所以 $total_sentence 变量的值是 Array ( [0] => 2 [1] => 4 )

第二个表是tb_stem:

============================
|id|stem|doc_id|sentence_id|
============================
|1 | B | 1 | 0 |
|2 | A | 1 | 1 |
|3 | C | 2 | 0 |
|4 | A | 2 | 1 |
|5 | E | 2 | 2 |
|6 | C | 2 | 3 |
|7 | D | 2 | 4 |
|8 | G | 2 | 5 |
|9 | A | 2 | 6 |
============================

其次,我需要对每个doc_id中的stem数据进行分组,然后统计由之前的结果组成的sentence_id的数量($ token )。这个概念是将文档总数除以包含词干的文档数。代码:

$query1 = mysql_query("SELECT DISTINCT(stem) AS unique FROM `tb_stem` group by stem,doc_id ");
while ($row = mysql_fetch_array($query1)) {
$token = $row['unique']; //the result $token must be : ABACDEG
}

$query2 = mysql_query("SELECT stem, COUNT( DISTINCT sentence_id ) AS ndw FROM `tb_stem` WHERE stem = '$token' GROUP BY stem, doc_id");
while ($row = mysql_fetch_array($query2)) {
$ndw = $row['ndw']; //the result must be : 1122111
}

$idf = log($total_sentence / $ndw)+1; //$total_sentence for doc_id = 1 must be divide $ndw with the doc_id = 2, etc

但是结果在不同的文档之间是不分开的,如下表:

============================
|id|word|doc_id| ndw |idf |
============================
|1 | A | | | |
|2 | B | | | |
|3 | C | | | |
|4 | D | | | |
|5 | E | | | |
|6 | G | | | |
============================

结果必须是:

 ============================
|id|word|doc_id| ndw |idf |
============================
|1 | A | 1 | | |
|2 | B | 1 | | |
|3 | A | 2 | | |
|4 | C | 2 | | |
|5 | D | 2 | | |
|6 | E | 2 | | |
|7 | G | 2 | | |
============================

请帮助我,谢谢:)

idf 的公式是idf = log(N/df) 其中N 是文档的个数,df 是文档的个数出现术语 (t) 的文档。每个句子都被视为一个文档。这是 idf 计算的例子:Document : 你在飞行时读诗吗?许多人发现在长途飞行中阅读可以放松身心

=================================================
| Term | Document1(D1)| D2| df | idf |
=================================================
| find | 0 | 1 | 1 |log(2/1)|
| fly | 1 | 1 | 2 |log(2/2)|
| long | 0 | 1 | 1 |log(2/1)|
| people | 0 | 1 | 1 |log(2/1)|
| poetry | 1 | 0 | 1 |log(2/1)|
| read | 1 | 1 | 2 |log(2/2)|
| relax | 0 | 1 | 1 |log(2/1)|
=================================================

最佳答案

此查询将为您提供您要查找的表:

SELECT t1.doc_id, t2.token as word, t2.token_freq as df, 
log(t1.docs/t2.token_freq) as idf
FROM
(SELECT doc_id,count(sentence_id) as docs from tb_sentence group by doc_id) as t1,
(SELECT DISTINCT(stem) as token, doc_id, COUNT(sentence_id) as token_freq
FROM tb_stem GROUP BY doc_id, token) as t2
WHERE t1.doc_id = t2.doc_id

注意:原始查询中的 Unique 是 MySQL 中的保留字,会给您带来错误。

关于php - 将文档总数除以包含词干的文档数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12386129/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com