python - 在python中定位大型数据集中的多个文件-6ren

python - 在python中定位大型数据集中的多个文件

转载作者：行者123 更新时间：2023-12-04 07:16:28

25

4

我有一个大型图像文件存储库(约 200 万， .jpg )，单个 id 分布在多个子目录中，我试图在包含这些 id 的约 1,000 个子集的列表中定位和复制每个图像。
我对 Python 还是很陌生，所以我的第一个想法是使用 os.walk遍历每个文件的 1k 子集，以查看子集中是否有与 id 匹配的子集。这至少在理论上是有效的，但是每秒 3-5 张图像的速度似乎非常慢。一次查找一个 ID 的所有文件似乎也是如此。

import shutil
import os
import csv

# Wander to Folder, Identify Files
for root, dirs, files in os.walk(ImgFolder):
    for file in files:
        fileName = ImgFolder + str(file)
# For each file, check dictionary for match
        with open(DictFolder, 'r') as data1:
            csv_dict_reader = csv.DictReader(data1)
            for row in csv.DictReader(data1):
                img_id_line = row['id_line']
                isIdentified = (img_id_line in fileName) and ('.jpg' in fileName)
# If id_line == file ID, copy file
                if isIdentified:
                    src = fileName + '.jpg'
                    dst = dstFolder + '.jpg'
                    shutil.copyfile(src,dst)
                else:
                    continue

我一直在考虑尝试自动化查询搜索，但数据包含在 NAS 上，我没有简单的方法来索引文件以加快查询速度。我正在运行代码的机器是 W10，因此我不能使用 Ubuntu Find 方法，我收集的方法在此任务中要好得多。
任何加快进程的方法将不胜感激!

最佳答案

这里有几个脚本可以满足您的需求。index.py此脚本使用 pathlib 遍历目录，搜索具有给定扩展名的文件。它将编写一个包含两列的 TSV 文件，文件名和文件路径。

import argparse
from pathlib import Path


def main(args):
    for arg, val in vars(args).items():
        print(f"{arg} = {val}")

    ext = "*." + args.ext
    index = {}
    with open(args.output, "w") as fh:
        for file in Path(args.input).rglob(ext):
            index[file.name] = file.resolve()
            fh.write(f"{file.name}\t{file.resolve()}\n")


if __name__ == "__main__":
    p = argparse.ArgumentParser()
    p.add_argument(
        "input",
        help="Top level folder which will be recursively "
        " searched for files ending with the value "
        "provided to `--ext`",
    )
    p.add_argument("output", help="Output file name for the index tsv file")
    p.add_argument(
        "--ext",
        default="jpg",
        help="Extension to search for. Don't include `*` or `.`",
    )
    main(p.parse_args())

search.py该脚本会将索引(来自 index.py 的输出)加载到字典中，然后将 CSV 文件加载到字典中，然后对于每个 id_line它将在索引中查找文件名并尝试将其复制到输出文件夹。

import argparse
import csv
import shutil
from collections import defaultdict
from pathlib import Path


def main(args):
    for arg, val in vars(args).items():
        print(f"{arg} = {val}")

    if not Path(args.dest).is_dir():
        Path(args.dest).mkdir(parents=True)

    with open(args.index) as fh:
        index = dict(l.strip().split("\t", 1) for l in fh)
    print(f"Loaded {len(index):,} records")

    csv_dict = defaultdict(list)

    with open(args.csv) as fh:
        reader = csv.DictReader(fh)
        for row in reader:
            for (k, v) in row.items():
                csv_dict[k].append(v)

    print(f"Searching for {len(csv_dict['id_line']):,} files")
    copied = 0
    for file in csv_dict["id_line"]:
        if file in index:
            shutil.copy2(index[file], args.dest)
            copied += 1
        else:
            print(f"!! File {file!r} not found in index")
    print(f"Copied {copied} files to {args.dest}")


if __name__ == "__main__":
    p = argparse.ArgumentParser()
    p.add_argument("index", help="Index file from `index.py`")
    p.add_argument("csv", help="CSV file with target filenames")
    p.add_argument("dest", help="Target folder to copy files to")
    main(p.parse_args())

如何运行这个:

python index.py --ext "jpg" "C:\path\to\image\folder" "index.tsv"
python search.py "index.tsv" "targets.csv" "C:\path\to\output\folder"

我会先在一个/两个文件夹上尝试这个，以检查它是否具有预期的结果。

关于python - 在python中定位大型数据集中的多个文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/68728414/

25

4

0

文章推荐： Python如何从字符串中提取括号括起来的数据

文章推荐： junit - 如何从程序停止flink流作业

algorithm - 集中/分布式共享
我想做一个系统，用户可以上传和下载文件。系统将具有一个集中的地形，但在很大程度上依赖于节点将相关数据通过中心节点传输给其他节点我不希望对等端保存整个文件，而是希望它们保存整个数据集的一个压缩的加密部分
整个应用程序的 Flutter 集中/通用加载屏幕
我正在 Riverpod Auth 流程样板应用程序中工作。我想对所有异步功能甚至登录和注销使用通用加载屏幕。目前，如果 Appstate 加载我显示加载屏幕，我有 AppState 提供程序。它可
php - 集中 php 全局变量？
我有一个 functions.php 文件，其中包括以下功能: function head() { global $brand, $brandName, $logo, $slogan, $si
jquery - 将一个类添加到无限循环的随机 div 集中
我有下一个 html 代码 ... 我想选择随机的 div 数组来向它们添加一个事件类，因为我使用这个 jquery 代码 function randOrder() { return
.net - 集中/控制.NET项目和解决方案的任意生成
多年来，我创建并调整了一组NAnt脚本以执行完整的项目构建。主脚本采用一个应用程序端点（例如，一个Web应用程序项目），并从源代码控制中对其进行完整的构建。脚本已预先配置了与构建输出位置，源代码控制地
jquery - 我如何判断窗口是否在 jQuery 集中？
我希望我的 jQuery 插件在 $(window) 选择上调用时表现不同。如何检查 window 是否在集合中？到目前为止我的尝试: >>> $(window) == $(window) false
javascript - 将元素添加到现有的 jQuery 集中
考虑到我们有 let existingSet = $(); 如何通过 jQuery 将 newElements 添加到该集合中？ existingSet = existingSet.add(newEl
c++ - 插入 STL 集中
我需要在 priority_queue 中保存一个整数集合。但是我需要能够删除这些整数中的一个，即使它不是我容器的第一个元素。我无法使用 std::priority_queue。我选择使用一个集合来根
css - 集中 div 和缩放以适合屏幕
对于我的网站，我一直在尝试集中所有内容以便在移动设备上显示: http://m.bachatdeals.com 我在移动设备上打开网站后，内容下方有很多空间，我必须捏住 zoon 才能阅读，如何删除下
javascript - 集中 Kendo 验证器自定义规则
我计划为我的剑道验证器制定一些自定义规则，并希望在所有验证器之间共享。在我的验证器代码中，我有: rules: { bothorblank: function (input) {
代码有助于确定点是否在 Mandelbrot 集中(检查我的解决方案)
这是我的函数，用于测试两个点 x 和 y 在 MAX_ITERATION 255 之后是否在 mandelbrot 集合中。如果不在，它应该返回 0，如果在，则返回 1。 int isMandelbr
html - 集中 float div
致力于从移动设备扩展到桌面设备的简单网站布局。一切都按预期工作，但由于某种原因，我的 float div 没有集中放置。我已经完成了正常工作，但这次不适合我？有什么想法吗？这是我的 CSS: /*
css - 集中 float 元素
我的“div”元素有一个相对宽度，它不是绝对的，所以我不能使用精确的数字来集中。一个不错的解决方案是使用“display: inline-block”: body { text-align:
c# - 集中 MEF 组合
目前我拥有的所有类都处理它们自己的导入。使用一个典型的例子: [ImportMany] private Lazy[] someOfMyInterfaces { get; set; } public M
python - 为什么不应该将重复对象添加到我的 Python 集中？
我有一个类定义: class Question: title = "" answer = "" def __init__(self, title, answer):
c++ - 如何将用户定义的对象插入 STL 集中？
我正在尝试将一个对象 Point2D 插入到一个 Point2D 集合中，但我做不到，似乎该集合适用于 int 和 char 但不适用于对象。我需要帮助来了解如何将对象插入到集合中？？？假设我想按
android - 显示一个 PopupWindow 集中
我的应用上有一些弹出窗口，它是全屏的，代码如下: content.setLayoutParams(new LayoutParams(LayoutParams.WRAP_CONTENT,
jakarta-ee - 集中 quarkus 的通用配置
我们有一个多模块 Quarkus 项目，带有一个公共(public)库和多个应用程序。在通用的 lib 中，我们有各种缓存用于所有应用。我们希望不必在每个应用程序的所有配置文件中配置保留和容量。有
r - ggplot - 集中 facet_grid 标题并且只出现一次
这个问题在这里已经有了答案: Nested facets in ggplot2 spanning groups (2 个回答) 去年关闭。我在 ggplot 中创建了一个图表里面有两个变量 face
javascript - 集中 radio 组 Vuetify
我无法集中v-radio-group。这是我得到的:

首页

博学

6Ren·AI

商城

python - 在python中定位大型数据集中的多个文件