gpt4 book ai didi

python-3.x - 将多个 .txt 文件转换为单个 .csv 文件(python)

转载 作者:行者123 更新时间:2023-12-05 08:08:15 26 4
gpt4 key购买 nike

我需要将一个包含大约 4,000 个 .txt 文件的文件夹转换为一个包含两列的 .csv:(1) 第 1 列:“文件名”(在原始文件夹中指定);(2) 第 2 列:“内容”(应包含相应 .txt 文件中存在的所有文本)。

Here你可以看到我正在处理的一些文件。

这里与我最相似的问题是这个 ( Combine a folder of text files into a CSV with each content in a cell ),但我无法实现那里提出的任何解决方案。

我尝试的最后一个是 Nathaniel Verhaaren 在上述问题中提出的 Python 代码,但我得到了与问题作者完全相同的错误(即使在实现了一些建议之后):

import os
import csv

dirpath = 'path_of_directory'
output = 'output_file.csv'
with open(output, 'w') as outfile:
csvout = csv.writer(outfile)
csvout.writerow(['FileName', 'Content'])

files = os.listdir(dirpath)

for filename in files:
with open(dirpath + '/' + filename) as afile:
csvout.writerow([filename, afile.read()])
afile.close()

outfile.close()

其他看起来与我相似的问题(例如 Python: Parsing Multiple .txt Files into a Single .csv File?Merging multiple .txt files into a csvConverting 1000 text files into a single csv file )并没有解决我提出的这个确切问题(而且我无法根据我的情况调整提出的解决方案)。

最佳答案

我有类似的需求,所以我写了下面的类

import os
import pathlib
import glob
import csv
from collections import defaultdict

class FileCsvExport:
"""Generate a CSV file containing the name and contents of all files found"""
def __init__(self, directory: str, output: str, header = None, file_mask = None, walk_sub_dirs = True, remove_file_extension = True):
self.directory = directory
self.output = output
self.header = header
self.pattern = '**/*' if walk_sub_dirs else '*'
if isinstance(file_mask, str):
self.pattern = self.pattern + file_mask
self.remove_file_extension = remove_file_extension
self.rows = 0

def export(self) -> bool:
"""Return True if the CSV was created"""
return self.__make(self.__generate_dict())

def __generate_dict(self) -> defaultdict:
"""Finds all files recursively based on the specified parameters and returns a defaultdict"""
csv_data = defaultdict(list)
for file_path in glob.glob(os.path.join(self.directory, self.pattern), recursive = True):
path = pathlib.Path(file_path)
if not path.is_file():
continue
content = self.__get_content(path)
name = path.stem if self.remove_file_extension else path.name
csv_data[name].append(content)
return csv_data

@staticmethod
def __get_content(file_path: str) -> str:
with open(file_path) as file_object:
return file_object.read()

def __make(self, csv_data: defaultdict) -> bool:
"""
Takes a defaultdict of {k, [v]} where k is the file name and v is a list of file contents.
Writes out these values to a CSV and returns True when complete.
"""
with open(self.output, 'w', newline = '') as csv_file:
writer = csv.writer(csv_file, quoting = csv.QUOTE_ALL)
if isinstance(self.header, list):
writer.writerow(self.header)
for key, values in csv_data.items():
for duplicate in values:
writer.writerow([key, duplicate])
self.rows = self.rows + 1
return True

可以这么用

...
myFiles = r'path/to/files/'
outputFile = r'path/to/output.csv'

exporter = FileCsvExport(directory = myFiles, output = outputFile, header = ['File Name', 'Content'], file_mask = '.txt')
if exporter.export():
print(f"Export complete. Total rows: {exporter.rows}.")

在我的示例目录中,这返回

Export complete. Total rows: 6.

注意:行数不计算标题(如果存在)

这生成了以下 CSV 文件:

"File Name","Content"
"Test1","This is from Test1"
"Test2","This is from Test2"
"Test3","This is from Test3"
"Test4","This is from Test4"
"Test5","This is from Test5"
"Test5","This is in a sub-directory"

可选参数:

  • header:获取将作为 CSV 中的第一行写入的字符串列表。默认
  • file_mask:取一个字符串,可以用来指定文件类型;例如,.txt 将导致它只匹配 .txt 文件。默认
  • walk_sub_dirs:如果设置为False,则不会在子目录中搜索。默认 True
  • remove_file_extension:如果设置为False,会导致写入文件名时包含文件扩展名;例如,File.txt 而不仅仅是 File。默认 True

关于python-3.x - 将多个 .txt 文件转换为单个 .csv 文件(python),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50532578/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com