gpt4 book ai didi

hadoop - 用Hive计算Text变量的单词频率

转载 作者:行者123 更新时间:2023-12-02 18:57:28 24 4
gpt4 key购买 nike

我有一个变量,每一行都是一个句子。
例:

 -Row1 "Hey, how are you?
-Rwo2 "Hey, Who is there?

我希望输出是按单词分组的计数。

例:
Hey 2
How 1
are 1
...

我正在使用分割功能,但是有点卡住了。有什么想法吗?

谢谢!

最佳答案

在Hive中这是可能的。按非字母字符分割并使用横向+爆炸,然后计算单词数:

with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)

select w.word, count(*) cnt
from
(
select split(lower(initial_string),'[^a-zA-Z]+') words from your_data
)s lateral view explode(words) w as word
where w.word!=''
group by w.word;

结果:
word    cnt
are 1
hey 2
how 1
is 1
there 1
who 1
you 1

使用sentences函数的另一种方法,它返回标记化语句的数组(单词数组):
with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)

select w.word, count(*) cnt
from
(
select sentences(lower(initial_string)) sentences from your_data
)d lateral view explode(sentences) s as sentence
lateral view explode(s.sentence) w as word
group by w.word;

结果:
word    cnt
are 1
hey 2
how 1
is 1
there 1
who 1
you 1

sentences(string str, string lang, string locale)函数将一串自然语言文本标记为单词和句子,其中每个句子在适当的句子边界处断开并作为单词数组返回。 “lang”和“locale”是可选参数。例如,句子('Hello there!你好吗?')返回((“Hello”,“there”),(“How”,“are”,“you”)))

关于hadoop - 用Hive计算Text变量的单词频率,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59855489/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com