gpt4 book ai didi

pandas - 使用 Pandas 格式化 SurveyMonkey 数据

转载 作者:行者123 更新时间:2023-12-04 01:57:30 25 4
gpt4 key购买 nike

我有一项调查要分析,该调查是由 SurveyMonkey 的参与者完成的。不幸的是,数据的组织方式并不理想,因为每个问题的每个分类回答都有自己的列。

例如,这里是数据帧中一个响应的前几行:

     How long have you been participating in the Garden Awards Program?  \
0 One year
1 NaN
2 NaN
3 NaN
4 NaN

Unnamed: 10 Unnamed: 11 Unnamed: 12 \
0 2-3 years 4-5 years 5 or more years
1 NaN NaN NaN
2 NaN 4-5 years NaN
3 2-3 years NaN NaN
4 NaN NaN 5 or more years

How did you initially learn of the Garden Awards Program? \
0 I nominated my garden to be evaluated
1 NaN
2 I nominated my garden to be evaluated
3 NaN
4 NaN

Unnamed: 14 etc...
0 A friend or family member nominated my garden ...
1 A friend or family member nominated my garden ...
2 NaN
3 NaN
4 NaN

这个问题, How long have you been participating in the Garden Awards Program? , 有有效回复: one year , 2-3 years等,并且都可以在第一行中找到,作为哪个列保存哪个值的键。这是第一个问题。 (类似于 How did you initially learn of the Garden Awards Program? ,其中有效的响应是: I nominated my garden to be evaluatedA friend or family member nominated my garden 等)。

第二个问题是每个分类响应的附加列都是 Unnamed: N ,其中 N 是与所有问题关联的类别一样多的列。

在我开始重新映射和展平/折叠每个问题的列之前,我想知道是否还有其他方法可以使用 Pandas 处理像这样呈现的调查数据。我所有的搜索都指向 SurveyMonkey API,但我不知道这会有什么用处。

我猜我需要展平列,因此,如果有人能提出一种方法,那就太好了。我认为有一种方法可以通过抓取相邻的列来继续抓取属于分类响应的所有列,直到 Unnamed不再在列名中,但我不知道如何执行此操作。

最佳答案

我将使用以下 DataFrame (可以从 here 下载为 CSV 文件):

     Q1 Unnamed: 2 Unnamed: 3    Q2 Unnamed: 5 Unnamed: 6    Q3 Unnamed: 7 Unnamed: 8
0 A1-A A1-B A1-C A2-A A2-B A2-C A3-A A4-B A3-C
1 A1-A NaN NaN NaN A2-B NaN NaN NaN A3-C
2 NaN A1-B NaN A2-A NaN NaN NaN A4-B NaN
3 NaN NaN A1-C NaN A2-B NaN A3-A NaN NaN
4 NaN A1-B NaN NaN NaN A2-C NaN NaN A3-C
5 A1-A NaN NaN NaN A2-B NaN A3-A NaN NaN

关键假设:
  • 名称不以 Unnamed 开头的每一列实际上是一个问题的标题
  • 问题标题之间的列代表列间隔左端问题的选项

  • 解决方案概述:
  • 查找每个问题开始和结束位置的索引
  • 将每个问题展平为一列 ( pd.Series )
  • 将问题列重新合并在一起

  • 实现(第 1 部分):
    indices = [i for i, c in enumerate(df.columns) if not c.startswith('Unnamed')]
    questions = [c for c in df.columns if not c.startswith('Unnamed')]
    slices = [slice(i, j) for i, j in zip(indices, indices[1:] + [None])]

    你可以看到,对像下面这样的切片进行迭代,你会得到一个 DataFrame对应每个问题:
    for q in slices:
    print(df.iloc[:, q]) # Use `display` if using Jupyter

    实现(第 2-3 部分):
    def parse_response(s):
    try:
    return s[~s.isnull()][0]
    except IndexError:
    return np.nan

    data = [df.iloc[:, q].apply(parse_response, axis=1)[1:] for q in slices]
    df = pd.concat(data, axis=1)
    df.columns = questions

    输出:
         Q1    Q2    Q3
    1 A1-A A2-B A3-C
    2 A1-B A2-A A4-B
    3 A1-C A2-B A3-A
    4 A1-B A2-C A3-C
    5 A1-A A2-B A3-A

    关于pandas - 使用 Pandas 格式化 SurveyMonkey 数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49580883/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com