gpt4 book ai didi

python - 如何对 Pandas 数据框进行二分搜索以获取列值的组合?

转载 作者:行者123 更新时间:2023-12-03 18:45:24 24 4
gpt4 key购买 nike

对不起,如果这是 Pandas 文档解释的一个简单问题,但我已经尝试寻找如何做到这一点并且没有任何运气。

我有一个包含多列的 Pandas datafame,我希望能够使用二分搜索搜索特定行,因为我的数据集很大,我将进行大量搜索。

我的数据如下所示:

Name           Course   Week  Grade
------------- ------- ---- -----
Homer Simpson MATH001 1 97
Homer Simpson MATH001 3 85
Homer Simpson CSCI100 1 89
John McGuirk MATH001 2 78
John McGuirk CSCI100 1 100
John McGuirk CSCI100 2 96

我希望能够快速搜索我的数据以查找名称、类(class)和周的特定组合。名称、类(class)和周的每个不同组合在数据集中都有零或一行。如果我正在搜索的名称、类(class)和周的组合缺少值,我希望我的搜索返回 0。

例如,我想搜索值 (John McGuirk, CSCI100, 1)
有没有内置的方法来做到这一点,还是我必须编写自己的二进制搜索?

更新:

我尝试使用下面一位评论者建议的内置方式执行此操作,我还尝试使用为我的特定数据编写的自定义二进制搜索和另一个使用递归处理不同列的自定义二进制搜索来执行此操作比我的具体例子。

这些测试的数据框包含 10,000 行。我把时间放在下面。两种二分搜索都比使用 [...] 表现得更好获取行。我远非 Python 专家,所以我不确定我的代码优化得如何。
# Load data
from pandas import DataFrame, read_csv
import math
import pandas as pd
import time

file = 'grades.xlsx'
df = pd.read_excel(file)

# This was suggested by one of the commenters below
def get_grade(name, course, week):
mask = (df.name.values == name) & (df.course.values == course) & (df.week.values == week)
row = df[mask]
if row.empty == False:
return row.grade.values[0]
else:
return 0

# Binary search that is specific to my particular data
def get_grade_binary_search(name, course, week):
lower = 0
upper = len(df.index) - 1

while lower <= upper:
mid = math.floor((lower + upper) / 2)

row_name = df.iat[mid, 0]
if name < row_name:
upper = mid - 1
elif name > row_name:
lower = mid + 1
else:
row_course = df.iat[mid, 1]
if course < row_course:
upper = mid - 1
elif course > row_course:
lower = mid + 1
else:
row_week = df.iat[mid, 2]
if week < row_week:
upper = mid - 1
elif week > row_week:
lower = mid + 1
else:
return df.iat[mid, 3]

return 0

# General purpose binary search
def get_grade_binary_search_recursive(search_value):
lower = 0
upper = len(df.index) - 1

while lower <= upper:
mid = math.floor((lower + upper) / 2)

comparison = compare(search_value, 0, mid)

if comparison < 0:
upper = mid - 1
elif comparison > 0:
lower = mid + 1
else:
return df.iat[mid, len(search_value)]

# Utility method
def compare(search_value, search_column_index, df_value_index):
if search_column_index >= len(search_value):
return 0

if search_value[search_column_index] < df.iat[df_value_index, search_column_index]:
return -1
elif search_value[search_column_index] > df.iat[df_value_index, search_column_index]:
return 1
else:
return compare(search_value, search_column_index + 1, df_value_index)

以下是时间安排。我还打印了每次搜索返回值的总和,以验证是否返回了相同的行。
# Non binary search
sum_of_grades = 0
start = time.time()
for week in range(first_week, last_week + 1):
for name in names:
for course in courses:
val = get_grade(name, course, week)
sum_of_grades += val
end = time.time()
print('elapsed time: ', end - start)
print('sum of grades: ', sum_of_grades)
elapsed time: 26.130020141601562 sum of grades: 498724
# Binary search specific to this data
sum_of_grades = 0
start = time.time()
for week in range(first_week, last_week + 1):
for name in names:
for course in courses:
val = get_grade_binary_search(name, course, week)
sum_of_grades += val

end = time.time()
print('elapsed time: ', end - start)
print('sum of grades: ', sum_of_grades)
elapsed time: 4.4506165981292725 sum of grades: 498724
# Binary search with recursion
sum_of_grades = 0
start = time.time()
for week in range(first_week, last_week + 1):
for name in names:
for course in courses:
val = get_grade_binary_search_recursive([name, course, week])
sum_of_grades += val
end = time.time()
print('elapsed time: ', end - start)
print('sum_of_grades: ', sum_of_grades)
elapsed time: 7.559535264968872 sum_of_grades: 498724

最佳答案

Pandas 有 searchsorted .

来自 备注 :

Binary search is used to find the required insertion points.

关于python - 如何对 Pandas 数据框进行二分搜索以获取列值的组合?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59490045/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com