gpt4 book ai didi

python - PySpark 中的二元组计数

转载 作者:行者123 更新时间:2023-11-30 22:14:16 27 4
gpt4 key购买 nike

我正在尝试在 PySpark 中拼凑出一个二元组计数程序,该程序采用一个文本文件并输出每个正确二元组(句子中的两个连续单词)的频率。

from pyspark.ml.feature import NGram

with use_spark_session("Bigrams") as spark:
text_file = spark.sparkContext.textFile(text_path)
sentences = text_file.flatMap(lambda line: line.split(".")) \
.filter(lambda line: len(line) > 0) \
.map(lambda line: (0, line.strip().split(" ")))
sentences_df = sentences.toDF(schema=["id", "words"])
ngram_df = NGram(n=2, inputCol="words", outputCol="bigrams").transform(sentences_df)

ngram_df.select("bigrams") 现在包含:

+--------------------+
| bigrams|
+--------------------+
|[April is, is the...|
|[It is, is one, o...|
|[April always, al...|
|[April always, al...|
|[April's flowers,...|
|[Its birthstone, ...|
|[The meaning, mea...|
|[April comes, com...|
|[It also, also co...|
|[April begins, be...|
|[April ends, ends...|
|[In common, commo...|
|[In common, commo...|
|[In common, commo...|
|[In years, years ...|
|[In years, years ...|
+--------------------+

因此,每个句子都有一个二元组列表。现在需要计算不同的二元组。如何?另外,整个代码看起来仍然不必要地冗长,所以我很高兴看到更简洁的解决方案。

最佳答案

如果您已经使用了RDD API,您可以直接执行

bigrams = text_file.flatMap(lambda line: line.split(".")) \
.map(lambda line: line.strip().split(" ")) \
.flatMap(lambda xs: (tuple(x) for x in zip(xs, xs[1:])))

bigrams.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)

否则:

from pyspark.sql.functions import explode

ngram_df.select(explode("bigrams").alias("bigram")).groupBy("bigram").count()

关于python - PySpark 中的二元组计数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50572592/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com