609. Find Duplicate File in System 在系统中查找重复文件-6ren

609. Find Duplicate File in System 在系统中查找重复文件

转载作者：大佬之路更新时间：2024-01-31 14:18:36

28

4

题目地址：https://leetcode.com/problems/find-duplicate-file-in-system/description/

题目描述

Given a list of directory info including directory path, and all the files with contents in this directory, you need to find out all the groups of duplicate files in the file system in terms of their paths.

Agroup of duplicate files consists of at least two files that have exactly the same content.

Asingle directory info string in the input list has the following format:

"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

Itmeans there are n files (f1.txt, f2.txt ... fn.txt with content f1_content, f2_content ... fn_content, respectively) in directory root/d1/d2/.../dm. Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

Theoutput is a list of group of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

"directory_path/file_name.txt"

Example 1:

Input:
["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]
Output:  
[["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Note:

1、 Noorderisrequiredforthefinaloutput.；
2、 Youmayassumethedirectoryname,filenameandfilecontentonlyhaslettersanddigits,andthelengthoffilecontentisintherangeof[1,50].；
3、 Thenumberoffilesgivenisintherangeof[1,20000].；
4、 Youmayassumenofilesordirectoriessharethesamenameinthesamedirectory.；
5、 Youmayassumeeachgivendirectoryinforepresentsauniquedirectory.Directorypathandfileinfoareseparatedbyasingleblankspace.；

Follow-up beyond contest:

1、 Imagineyouaregivenarealfilesystem,howwillyousearchfiles?DFSorBFS?；
2、 Ifthefilecontentisverylarge(GBlevel),howwillyoumodifyyoursolution?；
3、 Ifyoucanonlyreadthefileby1kbeachtime,howwillyoumodifyyoursolution?；
4、 Whatisthetimecomplexityofyourmodifiedsolution?Whatisthemosttime-consumingpartandmemoryconsumingpartofit?Howtooptimize?；
5、 Howtomakesuretheduplicatedfilesyoufindarenotfalsepositive?；

题目大意

把不同文件夹中所有文件内容相同的文件放到一起。

解题方法

这个题很简单，只需要使用字典进行内容==>目录的对应保存即可。因为要得到内容相同的目录的列表，所以把内容作为键，把目录列表作为值。最后的结果要目录列表内容长度>1才行。

Python代码：

class Solution(object):
    def findDuplicate(self, paths):
        """
        :type paths: List[str]
        :rtype: List[List[str]]
        """
        filemap = collections.defaultdict(list)
        for path in paths:
            roads = path.split()
            directory, files = roads[0], roads[1:]
            for file in files:
                file_s = file.split('(')
                name, content = file_s[0], file_s[1][:-1]
                full = directory + '/' + name
                filemap[content].append(full)
        return [full for full in filemap.values() if len(full) > 1]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

这个题的C++解法让我学习到了istringstream的用法，istringstream是一个比较有用的c++的输入输出控制类。

C++引入了ostringstream、istringstream、stringstream这三个类，要使用他们创建对象就必须包含<sstream>这个头文件。 istringstream类用于执行C++风格的串流的输入操作。 ostringstream类用于执行C风格的串流的输出操作。 strstream类同时可以支持C风格的串流的输入输出操作。

和常见的iostream有点类似，可以对应理解。

C++代码如下：

class Solution {
public:
    vector<vector<string>> findDuplicate(vector<string>& paths) {
        unordered_map<string, vector<string>> m;
        vector<vector<string>> res;
        for (string& path : paths) {
            istringstream is(path);
            string pre = "", t = "";
            is >> pre;
            while (is >> t) {
                int idx = t.find_last_of("(");
                string dir = pre + "/" + t.substr(0, idx);
                string content = t.substr(idx + 1, t.size() - idx - 2);
                m[content].push_back(dir);
            }
        }
        for (auto a : m) 
            if (a.second.size() > 1)
                res.push_back(a.second);
        return res;
    }
};

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

参考资料：

http://www.cnblogs.com/grandyang/p/7007974.html https://blog.csdn.net/longzaitianya1989/article/details/52909786

DDKK.COM 弟弟快看-教程，程序员编程资料站，版权归原作者所有

本文经作者：负雪明烛授权发布，任何组织或个人未经作者授权不得转发

28

4

0

文章推荐： 622. Design Circular Queue 设计循环队列

文章推荐： 645. Set Mismatch 错误的集合

文章推荐： 617. Merge Two Binary Trees 合并二叉树

文章推荐： 647. Palindromic Substrings 回文子串

mongodb - pymongo find() vs mongodb find()，pymongo find() 给出的文档数据较少
我有一个合作伙伴集合，我正在使用 pymongo 来检索数据当我使用 MongoDB 查询集合时，我看到以下结果 db.partner.find({'unique_key': 'c89dbe313
find - Linux find 命令有什么问题？
嗨，我正在尝试在一个 find 命令中查找所有 js 和 css 文件。我尝试了以下所有方法但徒劳无功: find WebContent -name "*.[jc]ss?" find WebConte
find - 使用 find 命令搜索所有具有某种文本模式的文件
我使用以下 find 命令查找并显示所有具有输入文本模式的文件。找。 -type f -print|xargs grep -n "模式" 我有很多项目文件夹，每个文件夹都有自己的名为“Makefi
find - Gnuwin32 find.exe在执行搜索之前会扩展通配符
我在Windows环境中使用Gnuwin32二进制文件。当我想查找某种类型的文件时（例如PDF），我通常运行： find . -iname '*.pdf' -print 这在任何UNIX系统上均可完
find - 使用带有两个向量的 find() 函数
我使用的是 Julia 编程语言，我知道你可以通过以下方式使用 find 函数: a = [ 1 2 3 4 3 5 3 6 7 8 9 3 ] find(a .== 3) 它将返回:3,5,7,12
javascript - find ('a,b' ) 比 find ('a' )+find ('b' 慢，为什么？
jsperf's link 我不是 jQuery 专家(甚至不是一个好的用户)，我没有研究它的整个源代码(只有一小部分不能帮助我解决这个问题)。有人可以为我解释一下吗？最佳答案这个: $p.fi
python -/cs/software/anaconda3/compiler_compat/ld : cannot find -lm cannot find -lpthread cannot find -lc
我应该如何在 CentOS 7 中修复这个错误？ [jalal@goku HW4]$ git clone https://github.com/pathak22/pyflow.git Cloning
find - 更改 {} 的参数 find exec
是否可以更改传递给 find 中的 exec 的参数？例如，我需要以不同的名称复制文件:*.txt -> *.new.txt现在我正在为两个命令执行此操作: find /root/test -name
find - clearcase: find -name 不允许多个模式？
我想通过cleartool find 命令找到*.cs 和*.cpp 文件。但它失败了。 cleartool find "M:\test_view\code" -name "*.cs *.cpp"
python - pymongo find() 与 find()[:]?
我正在使用 PyMongo，看到有人建议使用 find()[:] 而不是 find()。很好奇有什么区别？最佳答案 [:] 制作列表的浅拷贝，因此对对象的引用是相同的。我查看了 Pymongo 文档
ruby - 如何处理 Find.find 中的异常
我正在处理文件和目录，以在每个目录中查找最近修改的文件。我的代码可以工作，但作为 Ruby 的新手，我无法正确处理错误。我使用 Find.find 获取递归目录列表，为每个目录调用我自己的函数 ne
c++ -/usr/bin/ld : cannot find -ldlib/usr/bin/ld: cannot find -lcblas/usr/bin/ld: cannot find -llapack
/usr/bin/ld: cannot find -ldlib /usr/bin/ld: cannot find -lcblas /usr/bin/ld: cannot find -llapack 在
find - bash find 链接到一个 grep 然后打印
我有一些数据文件的一系列索引文件，它们基本上采用这种格式索引文件:asdfg.log.1234.2345.index 数据文件:asdfg.log 这个想法是搜索所有索引文件。如果值 XXXX 出现
find - 如何防止 find 打印 .git 文件夹？
我有一个 find我运行以查找名称包含 foo 的文件的命令. 我想跳过 .git目录。下面的命令有效除了它打印一个烦人 .git任何时候它跳过 .git目录: find . ( -name .
find - 'find' 与 'xargs' 和 'tar'
我有以下想做的事情: find . -maxdepth 6 \( -name \*.tar.gz -o -name bediskmodel -o -name src -o -name ciao -o
javascript - 在 find 的结果集上使用 jquery find
当我在表中查找隐藏字段时，我看到了两个隐藏字段。但是，我想通过 ID 进一步细化这两个字段。我注意到，当我使用“包含”在整个表上使用 find 时，我得到了 2 个字段。但是，如果我对隐藏字段的查找结
find - 列出所有文件的 md5sum : find command with xargs?
我正在使用下面的命令生成文件列表及其 m5sum。问题是某些文件或文件夹的名称中有空格。我将如何处理这些？ find -type f -name \* | xargs md5sum 最佳答案尝试:
find - 列出所有文件的 md5sum : find command with xargs?
我正在使用下面的命令生成文件列表及其 m5sum。问题是某些文件或文件夹的名称中有空格。我将如何处理这些？ find -type f -name \* | xargs md5sum 最佳答案尝试:
regex - Find -regex 比 find | 慢grep
我有一个使用正则表达式查找文件的脚本。代码如下: find $dir | grep "$regex" 脚本运行有点慢，我想优化一下。搜索需要一些时间来执行，我想从中获得更好的性能。我试过这种尝试: f
javascript - 类型错误 : Cannot find function find in object
这令人沮丧。我认为问题出在 api 响应返回的对象上。也许它是在字符串中，所以我所做的就是复制“postman”的响应并将其直接粘贴到js上。这样我就可以确定它在对象/数组中。但结果还是同样的错误。

首页

博学

6Ren·AI

商城