clojure - 在 Clojure 中比较两个大文件(即；在顶帽对齐中找到未映射的读取)-6ren

clojure - 在 Clojure 中比较两个大文件(即；在顶帽对齐中找到未映射的读取)

转载作者：行者123 更新时间：2023-12-01 11:04:40

问题:查找在一个文件中但不在另一个文件中的 ID。每个文件大约 6.5 GB。具体来说(对于生物信息学领域的那些人)，一个文件是测序读取的 fastq 文件，另一个是来自 tophat 运行的 sam 比对文件。我想确定 fastq 文件中的哪些读取不在 sam 比对文件中。

我收到 java.lang.OutOfMemory: Java heap space 错误。正如建议的那样( ref1 ， ref2 )我正在使用惰性序列。但是，我的内存仍然不足。我看过this tutorial ，但我还不太明白。因此，我发布了我不太复杂的解决方案尝试，希望我只是犯了一个小错误。

我的尝试:

由于这两个文件都无法放入内存，因此一次从 sam 文件中读取一个 block ，并将 block 中每一行的 id 放入一个集合中。然后使用集合中的 sam id 过滤 fastq id 的惰性列表，只保留那些不在集合中的 id。对下一 block sam 行和剩余的 fastq id 重复此操作。

(defn ids-not-in-sam 
  [ids samlines chunk-size]
  (lazy-seq
    (if (seq samlines)
      (ids-not-in-sam (not-in (into #{} (qnames (take chunk-size samlines))) ids)
                      (drop chunk-size samlines) chunk-size)
      ids)))

not-in 确定哪些 id 不在集合中。

(defn not-in 
  ; Return the elements x of xs which are not in the set s
  [s xs]
  (filter (complement s) xs))

qnames 从 sam 文件中的一行获取 id 字段。

(defn qnames [samlines]
  (map #(first (.split #"\t" %)) samlines))

最后，它与 io 放在一起(使用 clojure.contrib.io 中的 read-lines 和 write-lines。

(defn write-fq-not-in-sam [fqfile samfile fout chunk-size] 
    (io/write-lines fout (ids-not-in-sam (map fq-id (read-fastq fqfile))
                                         (read-sam samfile) chunk-size)))

我很确定我正在以懒惰的方式做每一件事。但是我可能在某个我没有注意到的地方捕获了序列的头部。

上面的代码中是否存在导致堆填满的错误？更重要的是，我解决问题的方法是不是全错了？这是否适合用于惰性序列，我是不是期望过高？

(错误可能在 read-sam 和 read-fastq 函数中，但我的帖子已经有点长了。如果需要，我可以稍后展示).

最佳答案