我正在使用文本文件 (ClassTest.txt) 和 pandas。该文本文件有 3 个制表符分隔的列:标题、说明和类别 - 标题和说明是普通字符串,类别是(非零)整数。
我收集的数据如下:
data = pd.read_table('ClassTest.txt')
feature_names = ['Title', 'Description']
X = data[feature_names]
y = data['Category']
但是,由于“描述”列中的值本身可以包含新行,因此“y”DataFrame 包含太多行,因为“描述”列中的大多数项目都具有多行。我试图通过将文件中的换行符设置为“|”来解决这个问题(通过重新填充它)并使用:
data = pd.read_table('ClassTest.txt', lineterminator='|')
X = data[feature_names]
y = data['Category']
这一次,我收到错误:
pandas.errors.ParserError:标记数据时出错。 C 错误:第 20 行应有 3 个字段,但看到了 5
谁能帮我解决这个问题吗?
编辑:添加以前的代码
con = lite.connect('JobDetails.db')
cur = con.cursor()
cur.execute('''SELECT Title, Description, Category FROM ReviewJobs''')
results = [list(each) for each in cur.fetchall()]
cur.execute('''SELECT Title, Description, Category FROM Jobs''')
for each in cur.fetchall():
results.append(list(each))
a = open('ClassTest.txt', 'ab')
newLine = "|"
a.write(u''.join(c for c in 'Title\tDescription\tCategory' + newLine).encode('utf-8'))
for r in results:
toWrite = "".encode('utf-8')
title = u''.join(c for c in r[0].replace("\n", " ")).encode('utf-8') + "\t".encode('utf-8')
description = u''.join(c for c in r[1]).encode('utf-8') + "\t".encode('utf-8')
toWrite += title + description
toWrite += str(r[2]).encode('utf-8') + newLine.encode('utf-8')
a.write(toWrite)
a.close()
pandas.read_table()
已弃用 - 请改用 read_csv()
。然后真正使用 CSV 格式,而不是编写大量代码来编写类似的无法处理字段内的记录或字段分隔符的内容。 Python 标准库中有 csv
模块。
将文件作为文本文件打开并将编码传递给 open()
可以让您不必在不同的地方自行编码所有内容。
#!/usr/bin/env python3
from contextlib import closing
import csv
import sqlite3
def main():
with sqlite3.connect("JobDetails.db") as connection:
with closing(connection.cursor()) as cursor:
#
# TODO Having two tables with the same columns for essentially
# the same kind of records smells like a broken DB design.
#
rows = list()
for table_name in ["reviewjobs", "jobs"]:
cursor.execute(
f"SELECT title, description, category FROM {table_name}"
)
rows.extend(cursor.fetchall())
with open("ClassTest.txt", "a", encoding="utf8") as csv_file:
writer = csv.writer(csv_file, delimiter="\t")
writer.write(["Title", "Description", "Category"])
for title, description, category in rows:
writer.writerows([title.replace("\n", " "), description, category])
if __name__ == "__main__":
main()
在另一个程序中:
data = pd.read_csv("ClassTest.txt", delimiter="\t")
我是一名优秀的程序员,十分优秀!