gpt4 book ai didi

python - 如何在大型数据集中筛选适合特定键的记录,同时遍历数据集中的所有元素

转载 作者:太空宇宙 更新时间:2023-11-04 01:35:38 26 4
gpt4 key购买 nike

我有大量数据,可以通过生成器/迭代器访问。在处理数据集时,我需要确定该数据集中的任何记录是否具有与正在处理的当前记录的属性值相同的属性。一种方法是使用嵌套的 for 循环。例如,如果要处理学生数据库,我可以这样做:

def fillStudentList():
# TODO: Add some code here to filll
# a student list
pass

students = fillStudentList()
sameLastNames = list()
for student1 in students1:
students2 = fillStudentList()
for student2 in students2:
if student1.lastName == student2.lastName:
sameLastNames.append((student1, student2))

当然上面的代码片段可以改进很多。该代码段的目的是显示嵌套的 for 循环模式。

现在假设我们有一个 Student 类,一个 Students 类(它)是一个迭代器,一个类 Source 以一种内存高效的方式(比如另一个迭代器)提供对数据的访问……

下面,我勾勒出了这段代码的大概样子。有没有人对如何改进这个实现有想法?目标是能够在非常大的数据集中找到具有相同属性的记录,以便随后可以处理过滤后的集合。

#!/usr/bin/python

from itertools import ifilter

class Student(object):
"""
A class that represents the first name, last name, and
grade of a student.
"""
def __init__(self, firstName, lastName, grade='K'):
"""
Initializes a Student object
"""
self.firstName = firstName
self.lastName = lastName
self.grade = grade

class Students(object):
"""
An iterator for a collection of students
"""
def __init__(self, source):
"""
"""
self._source = source
self._source_iter = source.get_iter()
self._reset = False

def __iter__(self):
return self

def next(self):
try:
if self._reset:
self._source_iter = self._source.get_iter()
self._reset = False
return self._source_iter.next()
except StopIteration:
self._reset = True
raise StopIteration

def select(self, attr, val):
"""
Return all of the Students with a given
attribute
"""
#select_iter = self._source.get_iter()
select_iter = self._source.filter(attr, val)
for selection in select_iter:
# if (getattr(selection, attr) == val):
# yield selection
yield(selection)

class Source(object):
"""
A source of data that can provide an iterator to
all of the data or provide an iterator to the
data based on some attribute
"""
def __init__(self, data):
self._data = data

def get_iter(self):
"""
Return an iterator to the data
"""
return iter(self._data)

def filter(self, attr, val):
"""
Return an iterator to the data filtered by some
attribute
"""
return ifilter(lambda rec: getattr(rec, attr) == val, self._data)

def test_it():
"""
"""
studentList = [Student("James","Smith","6"),
Student("Jill","Jones","6"),
Student("Bill","Deep","5"),
Student("Bill","Sun","5")]
source = Source(studentList)
students = Students(source)
for student in students:
print student.firstName

for same_names in students.select('firstName', student.firstName):
if same_names.lastName == student.lastName:
continue
else:
print " %s %s in grade %s has your same first name" % \
(same_names.firstName, same_names.lastName, same_names.grade)

if __name__ == '__main__':
test_it()

最佳答案

嵌套循环的复杂度为 O(n**2)。您可以改为使用排序和 itertools.groupby对于 O(nlogn) 性能:

students = fill_student_list()
same_last_names = [list(group) for lastname, group in
groupby(sorted(students, key=operator.attrgetter('lastname'))]

一般来说,您似乎在尝试做由数据库支持的 ORM 所做的事情。与其自己动手,不如使用现有的众多 ORM 之一。参见 What are some good Python ORM solutions?一个列表。它们将比您自己编写的代码更优化、更强大。

关于python - 如何在大型数据集中筛选适合特定键的记录,同时遍历数据集中的所有元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10099928/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com