gpt4 book ai didi

nlp - 使用 Julia 生成 ngram

转载 作者:行者123 更新时间:2023-12-02 13:31:48 25 4
gpt4 key购买 nike

要在 Julia 中生成单词二元组,我可以简单地遍历原始列表和删除第一个元素的列表,例如:

julia> s = split("the lazy fox jumps over the brown dog")
8-element Array{SubString{String},1}:
"the"
"lazy"
"fox"
"jumps"
"over"
"the"
"brown"
"dog"

julia> collect(zip(s, drop(s,1)))
7-element Array{Tuple{SubString{String},SubString{String}},1}:
("the","lazy")
("lazy","fox")
("fox","jumps")
("jumps","over")
("over","the")
("the","brown")
("brown","dog")

要生成三元组,我可以使用相同的 collect(zip(...)) 习语来获取:

julia> collect(zip(s, drop(s,1), drop(s,2)))
6-element Array{Tuple{SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox")
("lazy","fox","jumps")
("fox","jumps","over")
("jumps","over","the")
("over","the","brown")
("the","brown","dog")

但是我必须手动添加第三个列表才能进行压缩,有没有一种惯用的方法可以让我可以执行 n-gram 的任何顺序?

例如我希望避免这样做来提取 5 克:

julia> collect(zip(s, drop(s,1), drop(s,2), drop(s,3), drop(s,4)))
4-element Array{Tuple{SubString{String},SubString{String},SubString{String},SubString{String},SubString{String}},1}:
("the","lazy","fox","jumps","over")
("lazy","fox","jumps","over","the")
("fox","jumps","over","the","brown")
("jumps","over","the","brown","dog")

最佳答案

通过稍微更改输出并使用 SubArray 代替 Tuple ,损失很小,但可以避免分配和内存复制。如果底层单词列表是静态的,那么这是可以的并且速度更快(在我的基准测试中也是如此)。代码:

ngram(s,n) = [view(s,i:i+n-1) for i=1:length(s)-n+1]

和输出:

julia> ngram(s,5)
SubString{String}["the","lazy","fox","jumps","over"]
SubString{String}["lazy","fox","jumps","over","the"]
SubString{String}["fox","jumps","over","the","brown"]
SubString{String}["jumps","over","the","brown","dog"]

julia> ngram(s,5)[1][3]
"fox"

对于较大的单词列表,内存要求也小得多。

另请注意,使用生成器可以更快地逐一处理 ngram,并且内存更少,并且可能足以满足所需的处理代码(计算某些内容或传递一些哈希值)。例如,使用 @Gnimuc 的解决方案,而不使用 collect,即仅 partition(s, n, 1)

关于nlp - 使用 Julia 生成 ngram,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42360957/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com