gpt4 book ai didi

来自大数据的 Clojure 频率字典

转载 作者:行者123 更新时间:2023-12-02 17:10:17 27 4
gpt4 key购买 nike

我想编写自己的朴素贝叶斯分类器我有一个这样的文件:

(这是垃圾邮件和火腿消息的数据库,第一个单词指向垃圾邮件或火腿,文本直到 eoln 是消息(大小:0.5 Mb),来自此处http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/)

ham     Go until jurong point, crazy.. Available only in bugis n gre
at world la e buffet... Cine there got amore wat...
ham Ok lar... Joking wif u oni...
spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham U dun say so early hor... U c already then say...
ham Nah I don't think he goes to usf, he lives around here though
spam FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

我想制作一个像这样的 HashMap :{“垃圾邮件”{“去”1,“直到”100,...},“火腿”{......}} HashMap ,其中每个值都是单词的频率图(分别针对火腿和垃圾邮件)

我知道,如何通过python或c++做到这一点,我通过clojure做到了,但我的解决方案在大数据中失败了(stackoverflow)

我的解决方案:

(defn read_data_from_file [fname]
(map #(split % #"\s")(map lower-case (with-open [rdr (reader fname)]
(doall (line-seq rdr))))))

(defn do-to-map [amap keyseq f]
(reduce #(assoc %1 %2 (f (%1 %2))) amap keyseq))

(defn dicts_from_data [raw_data]
(let [data (group-by #(first %) raw_data)]
(do-to-map
data (keys data)
(fn [x] (frequencies (reduce concat (map #(rest %) x)))))))

我试图找出错误的地方并写下这个

(def raw_data (read_data_from_file (first args)))
(def d (group-by #(first %) raw_data))
(def f (map frequencies raw_data))
(def d1 (reduce concat (d "spam")))
(println (reduce concat (d "ham")))

错误:

Exception in thread "main" java.lang.RuntimeException: java.lang.StackOverflowError
at clojure.lang.Util.runtimeException(Util.java:165)
at clojure.lang.Compiler.eval(Compiler.java:6476)
at clojure.lang.Compiler.eval(Compiler.java:6455)
at clojure.lang.Compiler.eval(Compiler.java:6431)
at clojure.core$eval.invoke(core.clj:2795)
at clojure.main$eval_opt.invoke(main.clj:296)
at clojure.main$initialize.invoke(main.clj:315)
.....

有人可以帮助我让它变得更好/更有效吗?PS 抱歉我的写作错误。英语不是我的母语。

最佳答案

在匿名函数中使用 apply 而不是 reduce 可以避免 StackOverflow 异常。使用 (fn [x] (频率 (应用 concat (map #(rest %) x)))) 而不是 (fn [x] (frequencies) %) x)))).

以下是对相同的代码进行了一些重构,但具有完全相同的逻辑。 read-data-from-file 已更改,以避免两次mapping 行序列。

(use 'clojure.string)
(use 'clojure.java.io)

(defn read-data-from-file [fname]
(let [lines (with-open [rdr (reader fname)]
(doall (line-seq rdr)))]
(map #(-> % lower-case (split #"\s")) lines)))

(defn do-to-map [m keyseq f]
(reduce #(assoc %1 %2 (f (%1 %2))) m keyseq))

(defn process-words [x]
(->> x
(map #(rest %))
(apply concat) ; This is the only real change from the
; original code, it used to be (reduce concat).
frequencies))

(defn dicts-from-data [raw_data]
(let [data (group-by first raw_data)]
(do-to-map data
(keys data)
process-words)))

(-> "SMSSpamCollection.txt" read-data-from-file dicts-from-data keys)

关于来自大数据的 Clojure 频率字典,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17320274/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com