gpt4 book ai didi

python - 将一系列字符串(加上数字)写入一行 csv

转载 作者:太空宇宙 更新时间:2023-11-03 16:31:54 30 4
gpt4 key购买 nike

这不是漂亮的代码,但我有一些代码可以从 HTML 文件中获取一系列字符串并为我提供一系列字符串:author , title , date , length , text 。我有 2000 多个 html 文件,我想浏览所有这些文件并将这些数据写入一个 csv 文件。我知道所有这些都必须包含在 for 中最终循环,但在此之前我很难理解如何从获取这些值到将它们写入 csv 文件。我的想法是首先创建一个列表或元组,然后将其写入 csv 文件中的一行:

the_file = "/Users/john/Code/tedtalks/test/transcript?language=en.0"
holding = soup(open(the_file).read(), "lxml")
at = holding.find("title").text
author = at[0:at.find(':')]
title = at[at.find(":")+1 : at.find("|") ]
date = re.sub('[^a-zA-Z0-9]',' ', holding.select_one("span.meta__val").text)
length_data = holding.find_all('data', {'class' : 'talk-transcript__para__time'})
(m, s) = ([x.get_text().strip("\n\r")
for x in length_data if re.search(r"(?s)\d{2}:\d{2}",
x.get_text().strip("\n\r"))][-1]).split(':')
length = int(m) * 60 + int(s)
firstpass = re.sub(r'\([^)]*\)', '', holding.find('div', class_ = 'talk-transcript__body').text)
text = re.sub('[^a-zA-Z\.\']',' ', firstpass)
data = ([author].join() + [title] + [date] + [length] + [text])
with open("./output.csv", "w") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for line in data:
writer.writerow(line)

我一生都无法弄清楚如何让Python尊重这些是字符串并且应该存储为字符串而不是字母列表的事实。 (上面的 .join() 是我试图弄清楚的。)

展望 future :以这种方式处理 2000 个文件是否更好/更高效,将它们剥离到我想要的内容并一次写入一行 CSV,还是在 pandas 中构建一个数据框更好?然后将其写入 CSV? (所有 2000 个文件 = 160MB,因此精简后,最终数据不能超过 100MB,因此这里的大小不是很大,但展望 future 大小可能最终会成为一个问题。)

最佳答案

这将抓取所有文件并将数据放入 csv 中,您只需将路径传递到包含 html 文件的文件夹和输出文件的名称:

import re
import csv
import os
from bs4 import BeautifulSoup
from glob import iglob


def parse(soup):
# both title and author are can be parsed in separate tags.
author = soup.select_one("h4.h12.talk-link__speaker").text
title = soup.select_one("h4.h9.m5").text
# just need to strip the text from the date string, no regex needed.
date = soup.select_one("span.meta__val").text.strip()
# we want the last time which is the talk-transcript__para__time previous to the footer.
mn, sec = map(int, soup.select_one("footer.footer").find_previous("data", {
"class": "talk-transcript__para__time"}).text.split(":"))
length = (mn * 60 + sec)
# to ignore time etc.. we can just pull from the actual text fragment and remove noise i.e (Applause).
text = re.sub(r'\([^)]*\)',"", " ".join(d.text for d in soup.select("span.talk-transcript__fragment")))
return author.strip(), title.strip(), date, length, re.sub('[^a-zA-Z\.\']', ' ', text)

def to_csv(patt, out):
# open file to write to.
with open(out, "w") as out:
# create csv.writer.
wr = csv.writer(out)
# write our headers.
wr.writerow(["author", "title", "date", "length", "text"])
# get all our html files.
for html in iglob(patt):
with open(html, as f:
# parse the file are write the data to a row.
wr.writerow(parse(BeautifulSoup(f, "lxml")))

to_csv("./test/*.html","output.csv")

关于python - 将一系列字符串(加上数字)写入一行 csv,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37534849/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com