gpt4 book ai didi

python - 为 Pandas Dataframe Columns 中的两个列表中的每个元素运行一个函数

转载 作者:行者123 更新时间:2023-12-03 14:42:00 26 4
gpt4 key购买 nike

df :

col1
['aa', 'bb', 'cc', 'dd']
['this', 'is', 'a', 'list', '2']
['this', 'list', '3']

col2
[['ee', 'ff', 'gg', 'hh'], ['qq', 'ww', 'ee', 'rr']]
[['list', 'a', 'not', '1'], ['not', 'is', 'this', '2']]
[['this', 'is', 'list', 'not'], ['a', 'not', 'list', '2']]
我想做什么:
我试图在 df col1 中的每个元素(单词)上运行下面的代码在 col2 中的每个子列表中的每个对应元素上,并将分数放在一个新列中。
所以对于 col1 中的第一行,运行 get_top_matches功能:
`col1` "aa" and `col2` "ee" and "qq"
`col1` "bb" and `col2` "ff" and "ww"
`col1` "cc" and `col2` "gg" and "ee"
`col1` "dd" and `col2` "hh" and "rr"
新列应该是什么样子:
我不确定第 2 行和第 3 行的分数应该是多少
score_col
[1.0, 1.0, 1.0, 1.0]
[.34, .33, .27, .24, .23] #not sure
[.23, .13, .26] #not sure
我之前尝试过的:
我做的时候 col1只是针对 col2 中每个列表元素的字符串,像这样,但我不知道如何针对列表元素运行它到相应的子列表元素:
df.agg(lambda x: get_top_matches(*x), axis=1)
.
.
.
.
功能码
这是 get_top_matches功能 - 只需运行这整个事情;我只为这个问题调用最后一个函数:
#jaro version
def sort_token_alphabetically(word):
token = re.split('[,. ]', word)
sorted_token = sorted(token)
return ' '.join(sorted_token)

def get_jaro_distance(first, second, winkler=True, winkler_ajustment=True,
scaling=0.1, sort_tokens=True):
"""
:param first: word to calculate distance for
:param second: word to calculate distance with
:param winkler: same as winkler_ajustment
:param winkler_ajustment: add an adjustment factor to the Jaro of the distance
:param scaling: scaling factor for the Winkler adjustment
:return: Jaro distance adjusted (or not)
"""
if sort_tokens:
first = sort_token_alphabetically(first)
second = sort_token_alphabetically(second)

if not first or not second:
raise JaroDistanceException(
"Cannot calculate distance from NoneType ({0}, {1})".format(
first.__class__.__name__,
second.__class__.__name__))

jaro = _score(first, second)
cl = min(len(_get_prefix(first, second)), 4)

if all([winkler, winkler_ajustment]): # 0.1 as scaling factor
return round((jaro + (scaling * cl * (1.0 - jaro))) * 100.0) / 100.0

return jaro

def _score(first, second):
shorter, longer = first.lower(), second.lower()

if len(first) > len(second):
longer, shorter = shorter, longer

m1 = _get_matching_characters(shorter, longer)
m2 = _get_matching_characters(longer, shorter)

if len(m1) == 0 or len(m2) == 0:
return 0.0

return (float(len(m1)) / len(shorter) +
float(len(m2)) / len(longer) +
float(len(m1) - _transpositions(m1, m2)) / len(m1)) / 3.0

def _get_diff_index(first, second):
if first == second:
pass

if not first or not second:
return 0

max_len = min(len(first), len(second))
for i in range(0, max_len):
if not first[i] == second[i]:
return i

return max_len

def _get_prefix(first, second):
if not first or not second:
return ""

index = _get_diff_index(first, second)
if index == -1:
return first

elif index == 0:
return ""

else:
return first[0:index]

def _get_matching_characters(first, second):
common = []
limit = math.floor(min(len(first), len(second)) / 2)

for i, l in enumerate(first):
left, right = int(max(0, i - limit)), int(
min(i + limit + 1, len(second)))
if l in second[left:right]:
common.append(l)
second = second[0:second.index(l)] + '*' + second[
second.index(l) + 1:]

return ''.join(common)

def _transpositions(first, second):
return math.floor(
len([(f, s) for f, s in zip(first, second) if not f == s]) / 2.0)

def get_top_matches(reference, value_list, max_results=None):
scores = []
if not max_results:
max_results = len(value_list)
for val in value_list:
score_sorted = get_jaro_distance(reference, val)
score_unsorted = get_jaro_distance(reference, val, sort_tokens=False)
scores.append((val, max(score_sorted, score_unsorted)))
scores.sort(key=lambda x: x[1], reverse=True)

return scores[:max_results]

class JaroDistanceException(Exception):
def __init__(self, message):
super(Exception, self).__init__(message)
.
.
.

尝试 1
只是试图将其与列表中的每个单词而不是每个字母进行比较:
[[[df1.agg(lambda x: get_top_matches(u,w), axis=1) for u,w in zip(x,v)]\ for v in y] for x,y in zip(df1['parent_org_name_list'], df1['children_org_name_sublists'])]
Results
尝试 2
更改 get_top_matches功能说 for val in value_list.split():结果如下 - 抓取第一个单词并将其与 col2 中每个子列表中的第一个单词进行比较5 次(不知道为什么是 5 次):
[
[0 [(myalyk, 0.73)]1 [(myalyk, 0.73)]2 [(myalyk, 0.73)]3 [(myalyk, 0.73)]4 [(myalyk, 0.73)]dtype: object]
, [0 [(myliu, 0.79)]1 [(myliu, 0.79)]2 [(myliu, 0.79)]3 [(myliu, 0.79)]4 [(myliu, 0.79)]dtype: object]
, [0 [(myllc, 0.97)]1 [(myllc, 0.97)]2 [(myllc, 0.97)]3 [(myllc, 0.97)]4 [(myllc, 0.97)]dtype: object]
, [0 [(myloc, 0.88)]1 [(myloc, 0.88)]2 [(myloc, 0.88)]3 [(myloc, 0.88)]4 [(myloc, 0.88)]dtype: object]
]
只需要在子列表中的每个单词上运行的函数。
尝试 3
get_top_matches 中删除第二次尝试代码函数并将尝试一个列表理解代码修改为下面,抓取 col2中前3个子列表中的第一个单词;需要对比 col1列出 col2 中的每个单词子列表:
[[df.agg(lambda x: get_top_matches(u,v), axis=1) for u in x ]
for v in zip(*y)]
for x,y in zip(df['col1'], df['col2'])
]
结果尝试 3
[[0    [(myllc, 0.97), (myloc, 0.88), (myliu, 0.79), 
...1 [(myllc, 0.97), (myloc, 0.88), (myliu, 0.79),
...2 [(myllc, 0.97), (myloc, 0.88), (myliu, 0.79),
...3 [(myllc, 0.97), (myloc, 0.88), (myliu, 0.79),
...4 [(myllc, 0.97), (myloc, 0.88), (myliu, 0.79),
...dtype: object]]
期待
(此示例:第 1 行有 4 个子列表,第 2 行有 2 个子列表。该函数针对第 2 列中每个子列表中的每个单词运行第 1 列中的每个单词,并将结果放入新列的子列表中。)
[[['myalyk',.97], ['oleksandr',.54], ['nychyporovych',.3], ['pp',0]], [['myliu',.88], ['srl',.43]], [['myllc',1.0]], [['myloc',1.0], ['manag',.45], ['IT',.1], ['ag',0]]], 
[[['ltd',.34], ['yuriapharm',.76]], [['yuriypra',.65], ['law',.54], ['offic',.45], ['pc',.34]]],
...

最佳答案

这有效:

# Generate DataFrame
df = pd.DataFrame (data, columns = ['col1','col2'])

# Clean Data (strip out trailing commas on some words)
df['col1'] = df['col1'].map(lambda lst: [x.rstrip(',') for x in lst])

# 1. List comprehension Technique
# zip provides pairs of col1, col2 rows
result = [[get_top_matches(u, [v]) for u in x for w in y for v in w] for x, y in zip(df['col1'], df['col2'])]

# 2. DataFrame Apply Technique
def func(x, y):
return [get_top_matches(u, [v]) for u in x for w in y for v in w]

df['func_scores'] = df.apply(lambda row: func(row['col1'], row['col2']), axis = 1)

# Verify two methods are equal
print(df['func_scores'].equals(pd.Series(result))) # True

print(df['func_scores'].to_string(index=False))
感谢所有帮助过的人

关于python - 为 Pandas Dataframe Columns 中的两个列表中的每个元素运行一个函数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63836145/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com