gpt4 book ai didi

Python3 - 将相似的字符串组合在一起

转载 作者:行者123 更新时间:2023-12-04 01:27:16 26 4
gpt4 key购买 nike

我要做的是将小说网站上的字符串组合在一起。帖子的标题通常采用以下格式:

titles = ['Series Name: Part 1 - This is the chapter name',
'[OC] Series Name - Part 2 - Another name with the word chapter and extra oc at the start',
"[OC] Series Name = part 3 = punctuation could be not matching, so we can't always trust common substrings",
'{OC} Another cool story - Part I - This is the chapter name',
'{OC} another cool story: part II: another post title',
'{OC} another cool story part III but the author forgot delimiters',
"this is a one-off story, so it doesn't have any friends"]

分隔符等并不总是存在,并且可能会有一些变化。

我首先将字符串规范化为字母数字字符。
import re
from pprint import pprint as pp

titles = [] # from above

normalized = []
for title in titles:
title = re.sub(r'\bOC\b', '', title)
title = re.sub(r'[^a-zA-Z0-9\']+', ' ', title)
title = title.strip()
normalized.append(title)

pp(normalized)

这使
   ['Series Name Part 1 This is the chapter name',
'Series Name Part 2 Another name with the word chapter and extra oc at the start',
"Series Name part 3 punctuation could be not matching so we can't always trust common substrings",
'Another cool story Part I This is the chapter name',
'another cool story part II another post title',
'another cool story part III but the author forgot delimiters',
"this is a one off story so it doesn't have any friends"]

我希望的输出是:
['Series Name', 
'Another cool story',
"this is a one-off story, so it doesn't have any friends"] # last element optional

我知道比较字符串的几种不同方法......

difflib.SequenceMatcher.ratio()

Levenshtein edit distance

我还听说过 Jaro-Winkler 和 FuzzyWuzzy。

但真正重要的是我们可以得到一个数字来显示字符串之间的相似性。

我想我需要想出(大部分)一个二维矩阵来比较每个字符串。但是一旦我知道了,我就无法思考如何真正将它们分成几组。

我找到了 another post这似乎已经完成了第一部分......但是我不确定如何从那里继续。

scipy.cluster一开始看起来很有希望……但后来我觉得有点不知所措。

另一个想法是以某种方式结合 itertools.combinations()functools.reduce()使用上述距离度量之一。

我是不是想得太多了?看起来这应该很简单,但在我的脑海中却没有意义。

最佳答案

这是 CKM 回答中提出的想法的实现:https://stackoverflow.com/a/61671971/42346

首先去掉标点符号——这对你的目的并不重要——使用这个答案:https://stackoverflow.com/a/15555162/42346

然后我们将使用这里描述的技术之一:https://blog.eduonix.com/artificial-intelligence/clustering-similar-sentences-together-using-machine-learning/对相似的句子进行聚类。

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+') # only alphanumeric characters

lol_tokenized = []
for title in titles:
lol_tokenized.append(tokenizer.tokenize(title))

然后获取标题的数字表示:
import numpy as np 
from gensim.models import Word2Vec

m = Word2Vec(lol_tokenized,size=50,min_count=1,cbow_mean=1)
def vectorizer(sent,m):
vec = []
numw = 0
for w in sent:
try:
if numw == 0:
vec = m[w]
else:
vec = np.add(vec, m[w])
numw += 1
except Exception as e:
print(e)
return np.asarray(vec) / numw

l = []
for i in lol_tokenized:
l.append(vectorizer(i,m))

X = np.array(l)

Whoo-boy 那太多了。
现在您必须进行聚类。
from sklearn.cluster import KMeans

clf = KMeans(n_clusters=2,init='k-means++',n_init=100,random_state=0)
labels = clf.fit_predict(X)
print(labels)
for index, sentence in enumerate(lol_tokenized):
print(str(labels[index]) + ":" + str(sentence))

[1 1 0 1 0 0 0]
1:['Series', 'Name', 'Part', '1', 'This', 'is', 'the', 'chapter', 'name']
1:['OC', 'Series', 'Name', 'Part', '2', 'Another', 'name', 'with', 'the', 'word', 'chapter', 'and', 'extra', 'oc', 'at', 'the', 'start']
0:['OC', 'Series', 'Name', 'part', '3', 'punctuation', 'could', 'be', 'not', 'matching', 'so', 'we', 'can', 't', 'always', 'trust', 'common', 'substrings']
1:['OC', 'Another', 'cool', 'story', 'Part', 'I', 'This', 'is', 'the', 'chapter', 'name']
0:['OC', 'another', 'cool', 'story', 'part', 'II', 'another', 'post', 'title']
0:['OC', 'another', 'cool', 'story', 'part', 'III', 'but', 'the', 'author', 'forgot', 'delimiters']
0:['this', 'is', 'a', 'one', 'off', 'story', 'so', 'it', 'doesn', 't', 'have', 'any', 'friends']

然后你可以拉出 index == 1 的那些:
for index, sentence in enumerate(lol_tokenized): 
if labels[index] == 1:
print(sentence)

['Series', 'Name', 'Part', '1', 'This', 'is', 'the', 'chapter', 'name']
['OC', 'Series', 'Name', 'Part', '2', 'Another', 'name', 'with', 'the', 'word', 'chapter', 'and', 'extra', 'oc', 'at', 'the', 'start']
['OC', 'Another', 'cool', 'story', 'Part', 'I', 'This', 'is', 'the', 'chapter', 'name']

关于Python3 - 将相似的字符串组合在一起,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61671722/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com