gpt4 book ai didi

java - 如何在 Java 中从另一个数据帧平面映射一个数据帧?

转载 作者:行者123 更新时间:2023-12-02 10:07:17 25 4
gpt4 key购买 nike

我有一个如下所示的数据框:

+-----------------+--------------------+
| id| document|
+-----------------+--------------------+
| doc1 |"word1, word2" |
| doc2 |"word3 word4" |
+-----------------+--------------------+

我想创建另一个具有以下结构的数据框:

   +-----------------+--------------------+-----------------+
| id| document| word |
+-----------------+--------------------+----------------|
| doc1 |"word1, word2" | word1 |
| doc1 |"word1 word2" | word2 |
| doc2 |"word3 word4" | word3 |
| doc2 |"word3 word4" | word4 |
+-----------------+--------------------+----------------|

我尝试了以下方法:

public static Dataset<Row> buildInvertIndex(Dataset<Row> inputRaw, SQLContext sqlContext, String id) {

JavaRDD<Row> inputInvertedIndex = inputRaw.javaRDD();
JavaRDD<Tuple3<String, String ,String>> d = inputInvertedIndex.flatMap(x -> {

List<Tuple3<String, String, String>> k = new ArrayList<>();
String data2 = x.getString(0).toString();
String[] field2 = x.getString(1).split(" ", -1);
for(String s: field2)
k.add(new Tuple3<String, String, String>(data2, x.getString(1), s));
return k.iterator();
}
);


JavaPairRDD<String, Tuple2<String, String>>d2 = d.mapToPair(x->{
return new Tuple2<String, Tuple2<String, String>>(x._3(), new Tuple2<String, String>(x._1(), x._2()));

});

Dataset<Row> d3 = sqlContext.createDataset(JavaPairRDD.toRDD(d2), Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()))).toDF();

return d3;
}

但它给出了:

+-----------------+----------------------+
| _1| _2 |
+-----------------+----------------------+
| word1 |[doc1,"word1, word2"] |
| word2 |[doc1,"word1 word2"] |
| word3 |[doc2, "word3, word4"]|
| word4 |[doc2, "word3, word4"]|
+-----------------+----------------------+

我是 java Spark 新手。所以请任何帮助,我们将不胜感激。另外,假设在上面的第二个数据框中我想计算两列文档和单词的字符串相似度度量(即 jaccard)并将结果添加到新列中,我该怎么做?

最佳答案

您可以使用explodesplit

import static org.apache.spark.sql.functions.expr;
inputRaw.withColumn("word", expr("explode(split(document, '[, ]+'))"))

关于java - 如何在 Java 中从另一个数据帧平面映射一个数据帧?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55254289/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com