Balancing Data for Multiple-Instance Learning with Unbalanced Classes
Problem Statement (Simplified):
问题陈述(简化):
I have a CSV file where each row is labeled as either class A or B. Class A has 906 instances, while class B has 255 instances. I want to use this Multiple-instance Learning (MIL) classifier https://github.com/garydoranjr/misvm for classification. But apparently the data is very imbalanced.
我有一个CSV文件,其中每行都标记为a类或B类。a类有906个实例,而B类有255个实例。我想使用这个多实例学习(MIL)分类器https://github.com/garydoranjr/misvm用于分类。但显然数据非常不平衡。
Additional Details:
其他详细信息:
I'm conducting an analysis on time-series patterns of specific activities, particularly brain activities. Each row in the CSV file represents a 5-second window for a single instance. The total duration of the experiment is 'n' seconds, resulting in approximately 'n/5' 5-second windows with a 1-second shift between them (ignore if unfamiliar with this concept). Therefore, the total number of rows in the CSV file is roughly calculated as:
我正在对特定活动的时间序列模式进行分析,尤其是大脑活动。CSV文件中的每一行代表单个实例的5秒窗口。实验的总持续时间为“n”秒,导致大约“n/5”个5秒的窗口之间有1秒的偏移(如果不熟悉这个概念,请忽略)。因此,CSV文件中的总行数大致计算为:
Total Rows = 906 * (n/5) + 255 * (n/5)
Question:
问题:
I'm considering duplicating rows of class B a certain number of times (e.g., 3 times) to balance the dataset. Is this a valid approach? Please also tell me if there are other approaches to tickle this kinda problem? Thanks in advance!
我正在考虑将类B的行复制一定次数(例如3次),以平衡数据集。这是一种有效的方法吗?还请告诉我是否有其他方法来解决这种问题?提前感谢!
更多回答
优秀答案推荐
我是一名优秀的程序员,十分优秀!