gpt4 book ai didi

python - 使用 ngrams 查找匹配词

转载 作者:太空宇宙 更新时间:2023-11-04 02:43:02 26 4
gpt4 key购买 nike

数据集:

df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df[:,0:1]

Id bigram
1952043 [(Swimming,Pool),(Pool,in),(in,the),(the,roof),(roof,top),
1918916 [(Luxury,Apartments),(Apartments,consisting),(consisting,11),
1645751 [(Flat,available),(available,sale),(sale,Medavakkam),
1270503 [(Toddler,Pool),(Pool,with),(with,Jogging),(Jogging,Tracks),
1495638 [(near,medavakkam),(medavakkam,junction),(junction,calm),

我有一个 python 文件 (Categories.py),其中包含属性(property)/土地特征的无监督分类。

category = [('Luxury Apartments', 'IN', 'Recreation_Ammenities'),
('Swimming Pool', 'IN','Recreation_Ammenities'),
('Toddler Pool', 'IN', 'Recreation_Ammenities'),
('Jogging Tracks', 'IN', 'Recreation_Ammenities')]
Recreation = [e1 for (e1, rel, e2) in category if e2=='Recreation_Ammenities']

从二元组列和类别列表中找到匹配的词:

tokens=pd.Series(df["bigram"])
Lid=pd.Series(df["Id"])
matches = tokens.apply(lambda x: pd.Series(x).str.extractall("|".join(["({})".format(cat) for cat in Categories.Recreation])))

在运行上面的代码时,出现了这个错误:

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

需要这方面的帮助。

我想要的输出是:

 Id       bigram                                  Recreation_Amenities
1952043 [(Swimming,Pool),(Pool,in),(in,the),.. Swimming Pool
1918916 [(Luxury,Apartments),(Apartments,.. Luxury Apartments
1645751 [(Flat,available),(available,sale)..
1270503 [(Toddler,Pool),(Jogging,Tracks).. Toddler Pool,Jogging Tracks
1495638 [(near,medavakkam),..

最佳答案

按照这些思路应该对您有用:

def match_bigrams(row):
categories = []

for bigram in row.bigram:
joined = ' '.join(list(bigram))
if joined in Recreation:
categories.append(joined)

return categories

df['Recreation_Amenities'] = df.apply(match_bigrams, axis=1)
print(df)


Id bigram Recreation_Amenities
0 1952043 [(Swimming, Pool), (Pool, in), (in, the), (the... [Swimming Pool]
1 1918916 [(Luxury, Apartments), (Apartments, consisting... [Luxury Apartments]
2 1645751 [(Flat, available), (available, sale), (sale, ... []
3 1270503 [(Toddler, Pool), (Pool, with), (with, Jogging... [Toddler Pool, Jogging Tracks]
4 1495638 [(near, medavakkam), (medavakkam, junction), (... []

每个二元组都由一个空格连接,以便可以测试该二元组是否包含在您的类别列表中(即 if joined in Recreation)。

关于python - 使用 ngrams 查找匹配词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45902146/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com