gpt4 book ai didi

python - 查找 csv 中给定长度大于特定阈值的所有唯一属性集

转载 作者:行者123 更新时间:2023-12-01 07:18:40 25 4
gpt4 key购买 nike

所以我有这个数据集:https://s3.amazonaws.com/istreet-questions-us-east-1/443605/census.csv

age=Middle-aged,sex=Male,education=Bachelors,native-country=United-States,race=White,marital-status=Never-married,workclass=State-gov,occupation=Adm-clerical,hours-per-week=Full-time,income=Small,capital-gain=Low,capital-loss=None
age=Senior,sex=Male,education=Bachelors,native-country=United-States,race=White,marital-status=Married-civ-spouse,workclass=Self-emp-not-inc,occupation=Exec-managerial,hours-per-week=Part-time,income=Small,capital-gain=None,capital-loss=None
age=Middle-aged,sex=Male,education=HS-grad,native-country=United-States,race=White,marital-status=Divorced,workclass=Private,occupation=Handlers-cleaners,hours-per-week=Full-time,income=Small,capital-gain=None,capital-loss=None

30000 行

基本上有 12 个变量,我想创建一个具有 2 个输入的函数(NumberOfAttributes、SupportThreshold)。

例如,对于输入 (4,.6),我希望所有 4 个属性组合至少占总数据集的 60%。

我找到了一个解决方案,但它太耗费资源了。当我尝试提交它时,它说它超出了计算时间要求。

这是我的代码:

def attributesSet(numberOfAttributes, supportThreshold):
import csv
import pandas as pd
import itertools
import math

names = ['age','sex','education','country','race','status','workclass','occupation','hours-per-week','income','capital-gain','capital-loss']
combinations = []
final = []
for comb in itertools.combinations(names,numberOfAttributes):
combinations.append(list(comb))
url = "https://s3.amazonaws.com/istreet-questions-us-east-1/443605/census.csv"
c = pd.read_csv(url)
c.columns= names
total = len(c.index)
required = math.ceil(supportThreshold*total)

for i in combinations:
g = c.groupby(i).size().sort_values(ascending=False)
g
groups = g[g>required].index
satisfied = list(groups)
for j in satisfied:
final.append(','.join(j))

return final

基本上,它创建一个包含确定长度的所有组合的列表,并创建一个 pandas 系列,显示每个基于属性的组合和计数。

示例输入:

2
0.8

示例输出:

race=White,capital-loss=None

native-country=United-States,race=White

native-country=United-States,capital-loss=None

native-country=United-States,capital-gain=None

capital-gain=None,capital-loss=None

构成数据集 80% 以上的所有 2 个属性组合

一定有一种方法不会占用太多资源,以至于我没有看到

最佳答案

代码中有两个问题。

  1. 由于您正在从 URL 加载数据集,因此超出了计算时间。相反,您应该从当前目录读取census.csv
  2. 当元组只有一个参数/组名称时,代码中的
  3. .join(j) 会添加 ,

下面是工作示例。

def attributesSet(numberOfAttributes, supportThreshold):
import csv
import pandas as pd
import itertools
import math

names = ['age','sex','education','country','race','status','workclass','occupation','hours-per-week','income','capital-gain','capital-loss']
combinations = []
final = []
for comb in itertools.combinations(names,numberOfAttributes):
combinations.append(list(comb))
c = pd.read_csv('census.csv')
c.columns= names
total = len(c.index)
required = supportThreshold*total

for i in combinations:
g = c.groupby(i).size().sort_values(ascending=False)
groups = g[g>required].index
satisfied = list(groups)
for j in satisfied:
row = ''
for t in j:
row = row + t
if j.index(t) != len(j)-1:
row = row + ','
final.append(''+row)
return final

关于python - 查找 csv 中给定长度大于特定阈值的所有唯一属性集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57817074/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com