python - python中如何高效地按照字符串的顺序识别子串-6ren

python - python中如何高效地按照字符串的顺序识别子串

转载作者：塔克拉玛干更新时间：2023-11-03 04:52:57

26

4

这与我之前的问题有关:How to identify substrings in the order of the string?

对于一组给定的sentences 和一组selected_concepts 我想按照sentences< 的顺序识别selected_concepts/.

我用下面提供的代码做得很好。

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        index = sentence.find(item)
        if index >= 0:
             sentence_tokens.append((index, item))
    sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
    output.append(sentence_tokens)

但是，在我的真实数据集中，我有 13242627 个 selected_concepts 和 1234952 个 sentences。因此，我想知道是否有任何方法可以优化此代码以在更短的时间内执行。据我了解，这是 O(n^2)。因此，我担心时间复杂度(空间复杂度对我来说不是问题)。

下面提到了一个示例。

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']

selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']

output = [['data mining','process','patterns','methods','machine learning','database systems'],['data mining','interdisciplinary subfield','information'],['data mining','knowledge discovery','databases process']]

最佳答案

使用预编译的 ReGEx 怎么样？

这是一个例子:

import re

sentences = [
    'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
    'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
    'data mining is the analysis step of the knowledge discovery in databases process or kdd']

selected_concepts = [
    'machine learning',
    'patterns',
    'data mining',
    'methods',
    'database systems',
    'interdisciplinary subfield',
    'knowledege discovery',  # spelling error: “knowledge”
    'databases process',
    'information',
    'process']

re_concepts = [re.escape(t) for t in selected_concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall

output = [find_all_concepts(sentence) for sentence in sentences]

你得到:

[['data mining',
  'process',
  'patterns',
  'methods',
  'machine learning',
  'database systems'],
 ['data mining', 'interdisciplinary subfield', 'information', 'information'],
 ['data mining', 'databases process']]

关于python - python中如何高效地按照字符串的顺序识别子串，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54059612/

26

4

0

文章推荐： c++ - 计算最小N的置位位数

文章推荐： java - 将小数点 1 到 10 替换为名称 ("one", "two"..)

文章推荐： java - 哪个是 Spring 4 支持的最低 Java 版本

带文本的 ruby 串
我正在尝试创建一个程序，其中字符串的前三个字符重复给定次数，如下所示: foo('Chocolate', 3) # => 'ChoChoCho' foo('Abc', 3) # => 'AbcAbcA
c++ - 从字符串中分 ionic 串
我有以下字符串: std::string str = "Mode:AAA:val:101:id:A1"; 我想分离一个位于 "val:" 和 ":id" 之间的子字符串，这是我的方法: std::st
c++ - 我们如何有效地压缩 DNA 串
DNA 字符串可以是任意长度，包含 5 个字母(A、T、G、C、N)的任意组合。压缩包含 5 个字母(A、T、G、C、N)的 DNA 字母串的有效方法是什么？不是考虑每个字母表 3 位，我们可以使用
python - 编辑距 ionic 串
是否有一种使用 levenstein 距离将一个特定字符串与第二个较长字符串中的任何区域进行匹配的好方法？例子: str1='aaaaa' str2='bbbbbbaabaabbbb' if str
php - mcrypt 加密将 s 串 '%00' 添加到字符串末尾
使用 OAuth 并使用以下函数使用我们称为“foo”(实际上是 OAuth token )的字符串加密 key public function encrypt( $text ) { // a

首页

博学

6Ren·AI

商城

python - python中如何高效地按照字符串的顺序识别子串