gpt4 book ai didi

apache-spark - 在 DataFrame 中搜索关键字

转载 作者:行者123 更新时间:2023-12-04 09:42:41 26 4
gpt4 key购买 nike

我有一个 Spark Dataframe 和一个“关键字”列表。
对于 4 列,我需要检查该值是否在列表中,并用特定结果(不一定是列名)填充新列“结果”。
然后我需要搜索所有剩余的列,当匹配时,结果是“其他”

示例数据帧:

df = spark.createDataFrame([
["apple", "Null","Null","alcatel","Aalst","123","01-01-2016","blu"],
["apple", "apple","Lorem ipsum dolor sit amet","Null","Excepteur sint occaecat","543","07-12-2010","cat"],
["asus","apple","nisi ut aliquid ex ea commodi consequatur?","","Null","578","06-04-2020","htc"],
["samsung","fugiat quo voluptas nulla pariatur","apple","Null","Antwerp","285","04-08-2018","asus"],
["sony","magni dolores","Null","asus","quis nostrud exercitation","386","06-06-2009","huawei"],
["vivo","laborum","Null","Veriatis","adipisci ","389","23-12-2005","oppo"],
["alcatel","laboriosam","Contains Apple","Null","Asus","104","02-03-2018","zte"],
["sharp","null","null","apple","Asus","333","07-09-2017","alcatel"]
]).toDF("a-val","b-val","c-val","d-val","e-val","f-val","g-val","h-val")



keywords = ['apple', 'asus', 'alcatel']
df.withColum('result', when(col('a-val').isin(keywords), concat(lit('a'), col(result))))
df.withColum('result', when(col('b-val').isin(keywords), concat(lit('b'), col(result))))
df.withColum('result', when(col('c-val').isin(keywords), concat(lit('c'), col(result))))
df.withColum('result', when(col('d-val').isin(keywords), concat(lit('d'), col(result))))

可能的结果;
    result
-------
a
b
c
d
a;b
b;d
a;c;d
a;other
c;d;other
...

不确定 concat是理想的方式,或者最好先创建一个列表并添加它。

按列搜索成功,但合并结果并搜索剩余的列我无法完成。

我真的很感激任何帮助!

最佳答案

IIUC 这可以做为

  • 您可以为列值映射创建字典
  • evalCol={i:i[0] if i.startswith(('a','b','c','d')) else 'other' for i in df.columns}

    {'a-val': 'a',
    'b-val': 'b',
    'c-val': 'c',
    'd-val': 'd',
    'e-val': 'other',
    'f-val': 'other',
    'g-val': 'other',
    'h-val': 'other'}
  • 然后使用它来过滤值,然后将列连接为
  • keywords = ['apple', 'asus', 'alcatel']

    df.withColumn('result',f.concat_ws(';',*[f.when(f.col(k).isin(keywords),v).otherwise(None) for k,v in evalCol.items()])).show(10,False)

    +-------+----------------------------------+------------------------------------------+--------+-------------------------+-----+----------+-------+-------+
    |a-val |b-val |c-val |d-val |e-val |f-val|g-val |h-val |result |
    +-------+----------------------------------+------------------------------------------+--------+-------------------------+-----+----------+-------+-------+
    |apple |Null |Null |alcatel |Aalst |123 |01-01-2016|blu |a;d |
    |apple |apple |Lorem ipsum dolor sit amet |Null |Excepteur sint occaecat |543 |07-12-2010|cat |a;b |
    |asus |apple |nisi ut aliquid ex ea commodi consequatur?| |Null |578 |06-04-2020|htc |a;b |
    |samsung|fugiat quo voluptas nulla pariatur|apple |Null |Antwerp |285 |04-08-2018|asus |c;other|
    |sony |magni dolores |Null |asus |quis nostrud exercitation|386 |06-06-2009|huawei |d |
    |vivo |laborum |Null |Veriatis|adipisci |389 |23-12-2005|oppo | |
    |alcatel|laboriosam |Contains Apple |Null |Asus |104 |02-03-2018|zte |a |
    |sharp |null |null |apple |Asus |333 |07-09-2017|alcatel|d;other|
    +-------+----------------------------------+------------------------------------------+--------+-------------------------+-----+----------+-------+-------+


    希望能帮助到你。

    关于apache-spark - 在 DataFrame 中搜索关键字,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62256225/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com