gpt4 book ai didi

clojure - 改进用于迭代文本解析的 clojure 延迟序列的使用

转载 作者:行者123 更新时间:2023-12-01 01:33:37 25 4
gpt4 key购买 nike

我正在编写 this coding challenge 的 Clojure 实现,试图找到 Fasta 格式的序列记录的平均长度:

>1
GATCGA
GTC
>2
GCA
>3
AAAAA

有关更多背景,请参阅此 related StackOverflow post关于 Erlang 解决方案。

我的初学 Clojure 尝试使用惰性序列尝试一次读取文件一条记录,以便它可以扩展到大文件。然而,它相当消耗内存且速度缓慢,所以我怀疑它没有以最佳方式实现。这是使用 BioJava 的解决方案抽象出记录解析的库:
(import '(org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.contrib.duck-streams :only (reader)])

(defn seq-lengths [seq-iter]
"Produce a lazy collection of sequence lengths given a BioJava StreamReader"
(lazy-seq
(if (.hasNext seq-iter)
(cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))

(defn fasta-to-lengths [in-file seq-type]
"Use BioJava to read a Fasta input file as a StreamReader of sequences"
(seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))

(defn average [coll]
(/ (reduce + coll) (count coll)))

(when *command-line-args*
(println
(average (apply fasta-to-lengths *command-line-args*))))

以及没有外部库的等效方法:
(use '[clojure.contrib.duck-streams :only (read-lines)])

(defn seq-lengths [lines cur-length]
"Retrieve lengths of sequences in the file using line lengths"
(lazy-seq
(let [cur-line (first lines)
remain-lines (rest lines)]
(if (= nil cur-line) [cur-length]
(if (= \> (first cur-line))
(cons cur-length (seq-lengths remain-lines 0))
(seq-lengths remain-lines (+ cur-length (.length cur-line))))))))

(defn fasta-to-lengths-bland [in-file seq-type]
; pop off first item since it will be everything up to the first >
(rest (seq-lengths (read-lines in-file) 0)))

(defn average [coll]
(/ (reduce + coll) (count coll)))

(when *command-line-args*
(println
(average (apply fasta-to-lengths-bland *command-line-args*))))

当前的实现在一个大文件上需要 44 秒,而 Python 实现需要 7 秒。您能否提供有关加快代码速度并使其更直观的任何建议?使用lazy-seq是否按预期正确解析文件记录?

最佳答案

可能没关系,但是average捕获了长度序列的头部。
以下是一种完全未经测试但更懒惰的方式来做我认为你想做的事情。

(use 'clojure.java.io) ;' since 1.2

(defn lazy-avg [coll]
(let [f (fn [[v c] val] [(+ v val) (inc c)])
[sum cnt] (reduce f [0 0] coll)]
(if (zero? cnt) 0 (/ sum cnt)))

(defn fasta-avg [f]
(->> (reader f)
line-seq
(filter #(not (.startsWith % ">")))
(map #(.length %))
lazy-avg))

关于clojure - 改进用于迭代文本解析的 clojure 延迟序列的使用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3303848/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com