gpt4 book ai didi

python - 使用 UnbalancedDataset 包对数据集进行过采样时发生 KeyError(在 pandas.index.IndexEngine.get_loc 中)

转载 作者:太空宇宙 更新时间:2023-11-03 16:33:54 24 4
gpt4 key购买 nike

我正在尝试使用UnbalancedDataset对我的数据进行过度采样。遵循 sklearn 约定,我将 X,y 作为特征矩阵和目标向量。它们是 pandas.core.frame.DataFrame 类型,形状分别为 (200000, 17) 和 (200000,)。

我首先使用 sklean 的 train_test_split 分割数据。然后应用SMOTE方法对训练数据集进行过采样,导致出现以下错误:

---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
C:\Users\...\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_loc(self, key, method, tolerance)
1944 try:
-> 1945 return self._engine.get_loc(key)
1946 except KeyError:

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)()

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)()

pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)()

pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)()

KeyError: 1143

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
<ipython-input-99-1c5830417b3f> in <module>()
6 # 'SMOTE'
7 SM = SMOTE(ratio=ratio, verbose=verbose, kind='regular')
----> 8 smx, smy = SM.fit_transform(Xtrain, ytrain)

C:\Users\...\Anaconda3\lib\site-packages\unbalanceddataset-0.1-py3.5.egg\unbalanced_dataset\unbalanced_dataset.py in fit_transform(self, x, y)
274 return self.out_x, self.out_y, self.out_idx
275 else:
--> 276 self.out_x, self.out_y = self.resample()
277
278 return self.out_x, self.out_y

C:\Users\...\Anaconda3\lib\site-packages\unbalanceddataset-0.1-py3.5.egg\unbalanced_dataset\over_sampling.py in resample(self)
358 step_size=1.0,
359 random_state=self.rs,
--> 360 verbose=self.verbose)
361
362 if self.verbose:

C:\Users\...\Anaconda3\lib\site-packages\unbalanceddataset-0.1-py3.5.egg\unbalanced_dataset\unbalanced_dataset.py in make_samples(x, nn_data, y_type, nn_num, n_samples, step_size, random_state, verbose)
388
389 # Construct synthetic sample
--> 390 new[i] = x[row] - step * (x[row] - nn_data[nn_num[row, col]])
391
392 # The returned target vector is simply a repetition of the

C:\Users\...\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
1995 return self._getitem_multilevel(key)
1996 else:
-> 1997 return self._getitem_column(key)
1998
1999 def _getitem_column(self, key):

C:\Users\...\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
2002 # get column
2003 if self.columns.is_unique:
-> 2004 return self._get_item_cache(key)
2005
2006 # duplicate columns & possible reduce dimensionality

C:\Users\...\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
1348 res = cache.get(item)
1349 if res is None:
-> 1350 values = self._data.get(item)
1351 res = self._box_item_values(item, values)
1352 cache[item] = res

C:\Users\...\Anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
3288
3289 if not isnull(item):
-> 3290 loc = self.items.get_loc(item)
3291 else:
3292 indexer = np.arange(len(self.items))[isnull(self.items)]

C:\Users\...\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_loc(self, key, method, tolerance)
1945 return self._engine.get_loc(key)
1946 except KeyError:
-> 1947 return self._engine.get_loc(self._maybe_cast_indexer(key))
1948
1949 indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4154)()

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4018)()

pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12368)()

pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)()

KeyError: 1143

UnbalancedDataset 的所有欠采样方法时,我收到此错误在相同的数据上工作得很好。对于处理过采样问题有什么建议吗?

更新:

正如 glemaitre 提到的,为了解决这个问题,Pandas DataFrame 需要转换为 Numpy 数组。因此,以下转换可以解决问题:

Xc = Xtrain.as_matrix()

最佳答案

不平衡数据集需要 numpy 数组。尝试将其插入到该函数中,看看是否有效。

干杯

关于python - 使用 UnbalancedDataset 包对数据集进行过采样时发生 KeyError(在 pandas.index.IndexEngine.get_loc 中),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37380831/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com