gpt4 book ai didi

java - 如何在Java中对Spark DataFrame应用字符串操作

转载 作者:行者123 更新时间:2023-12-02 11:28:24 25 4
gpt4 key购买 nike

我有一个 Spark DataFrame,如下所示:

+--------------------+------+----------------+-----+--------+
| Name | Sex| Ticket |Cabin|Embarked|
+--------------------+------+----------------+-----+--------+
|Braund, Mr. Owen ...| male| A/5 21171| null| S|
|Cumings, Mrs. Joh...|female| PC 17599| C85| C|
|Heikkinen, Miss. ...|female|STON/O2. 3101282| null| S|
|Futrelle, Mrs. Ja...|female| 113803| C123| S|
|Palsson, Master. ...| male| 349909| null| S|
+--------------------+------+----------------+-----+--------+

现在我需要过滤“名称”​​列,使其仅包含标题 - 即先生、夫人、小姐、大师。所以结果列将是:

+--------------------+------+----------------+-----+--------+
| Name | Sex| Ticket |Cabin|Embarked|
+--------------------+------+----------------+-----+--------+
|Mr. | male| A/5 21171| null| S|
|Mrs. |female| PC 17599| C85| C|
|Miss. |female|STON/O2. 3101282| null| S|
|Mrs. |female| 113803| C123| S|
|Master. | male| 349909| null| S|
+--------------------+------+----------------+-----+--------+

我尝试应用子字符串操作:

List<String> list = Arrays.asList("Mr.","Mrs.", "Mrs.","Master.");
Dataset<Row> categoricalDF2 = categoricalDF.filter(col("Name").isin(list.stream().toArray(String[]::new)));

但是在Java中似乎没那么容易。在Java中如何做到这一点。请注意,我使用的是 Spark 2.2.0。

最佳答案

终于解决了这个问题,得到了我自己问题的答案。我用 UDF 扩展了 Mohit 的答案:

private static final UDF1<String, Option<String>> getTitle = (String name) ->      {
if (name.contains("Mr.")) { // If it has Mr.
return Some.apply("Mr.");
} else if (name.contains("Mrs.")) { // Or if has Mrs.
return Some.apply("Mrs.");
} else if (name.contains("Miss.")) { // Or if has Miss.
return Some.apply("Miss.");
} else if (name.contains("Master.")) { // Or if has Master.
return Some.apply("Master.");
} else { // Not any.
return Some.apply("Untitled");
}
};

然后我必须按如下方式注册前面的 UDF:

SparkSession spark = SparkSession.builder().master("local[*]")
.config("spark.sql.warehouse.dir", "/home/martin/")
.appName("Titanic")
.getOrCreate();
Dataset<Row> df = ....
spark.sqlContext().udf().register("getTitle", getTitle, DataTypes.StringType);
Dataset<Row> categoricalDF = df.select(callUDF("getTitle", col("Name")).alias("Name"), col("Sex"), col("Ticket"), col("Cabin"), col("Embarked"));
categoricalDF.show();

前面的代码产生以下输出:

+-----+------+----------------+-----+--------+
| Name| Sex| Ticket|Cabin|Embarked|
+-----+------+----------------+-----+--------+
| Mr.| male| A/5 21171| null| S|
| Mrs.|female| PC 17599| C85| C|
|Miss.|female|STON/O2. 3101282| null| S|
| Mrs.|female| 113803| C123| S|
| Mr.| male| 373450| null| S|
+-----+------+----------------+-----+--------+
only showing top 5 rows

关于java - 如何在Java中对Spark DataFrame应用字符串操作,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49464706/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com