gpt4 book ai didi

algorithm - Amazon 的 Statistically Improbable Phrases 如何运作?

转载 作者:塔克拉玛干 更新时间:2023-11-03 02:15:55 26 4
gpt4 key购买 nike

Statistically Improbable Phrases 之类的东西是如何工作的?

据亚马逊称:

Amazon.com's Statistically Improbable Phrases, or "SIPs", are the most distinctive phrases in the text of books in the Search Inside!™ program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.

SIPs are not necessarily improbable within a particular book, but they are improbable relative to all books in Search Inside!. For example, most SIPs for a book on taxes are tax related. But because we display SIPs in order of their improbability score, the first SIPs will be on tax topics that this book mentions more often than other tax books. For works of fiction, SIPs tend to be distinctive word combinations that often hint at important plot elements.

例如,对于 Joel 的第一本书,SIP 是:漏洞抽象、抗锯齿文本、自己的狗粮、错误计数、每日构建、错误数据库、软件时间表

一个有趣的复杂情况是,这些是 2 个或 3 个单词的短语。这让事情变得更有趣了,因为这些短语可以相互重叠或包含。

最佳答案

这很像 Lucene 对给定搜索查询的文档进行排名的方式。他们使用一种称为 TF-IDF 的指标,其中 TF 是词频,idf 是逆文档频率。前者对文档的排名越高,查询词在该文档中出现的次数越多,而后者对文档的排名越高,如果它的查询词在所有文档中出现的频率都不高。他们计算它的具体方法是 log(文档数量/包含该术语的文档数量)——即该术语出现频率的倒数。

因此在您的示例中,这些短语是与 Joel 的书相关的 SIP,因为它们是罕见的短语(出现在几本书中)并且在他的书中多次出现。

编辑:回答关于 2-gram 和 3-gram 的问题,重叠无关紧要。考虑一下“我的两条狗是棕色的”这句话。在这里,2-gram 列表是 ["my two", "two dogs", "dogs are", "are brown"],3-gram 列表是 ["my two dogs", "two dogs are ”,“狗是棕色的”]。正如我在评论中提到的那样,对于 N 个单词的流,重叠时你会得到 N-1 个 2-gram 和 N-2 个 3-gram。因为 2-gram 只能等于其他 2-gram,对于 3-gram 也是如此,您可以分别处理这些情况。处理 2-gram 时,每个“单词”都是 2-gram,等等。

关于algorithm - Amazon 的 Statistically Improbable Phrases 如何运作?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2009498/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com