- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我在 Pandas 中创建了一个大型数据库,大约有 600 万行文本数据。我想将其保存为 SQL 数据库文件,但当我尝试保存它时,出现内存不足 RAM 错误。我什至将卡盘尺寸减小到 100,但它仍然崩溃。
但是,如果我只有 100,000 行的较小版本的数据框,并将其保存到未指定 chucksize 的数据库中,则保存数据框没有问题。
这是我的代码
from sqlalchemy import create_engine
engine = sqlalchemy.create_engine("sqlite:///databasefile.db")
dataframe.to_sql("CS_table", engine, chunksize = 100)
我的理解是,由于它一次只处理 100 行,因此 RAM 使用量应该反射(reflect)出保存 100 行的情况。幕后还有其他事情发生吗?也许是多线程?
在我运行此代码之前,我使用的是 4.8 GB RAM,而 Google Colab 中可用的 12.8 GB RAM 除外。运行上面的代码会耗尽所有 RAM,直到环境崩溃。
我希望能够在我的环境不崩溃的情况下将我的 pandas 数据框保存到 SQL 文件中。我所在的环境是Google Colab。 pandas 数据名有 2 列,约 600 万行。每个单元格包含大约这么多文本:
"The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data."
编辑:
我在不同阶段进行了键盘中断。这是 RAM 中第一次跳转后键盘中断的结果
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-22-51b6e444f80d> in <module>()
----> 1 dfAllT.to_sql("CS_table23", engine, chunksize = 100)
12 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in to_sql(self, name, con, schema, if_exists, index, index_label, chunksize, dtype, method)
2529 sql.to_sql(self, name, con, schema=schema, if_exists=if_exists,
2530 index=index, index_label=index_label, chunksize=chunksize,
-> 2531 dtype=dtype, method=method)
2532
2533 def to_pickle(self, path, compression='infer',
/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py in to_sql(frame, name, con, schema, if_exists, index, index_label, chunksize, dtype, method)
458 pandas_sql.to_sql(frame, name, if_exists=if_exists, index=index,
459 index_label=index_label, schema=schema,
--> 460 chunksize=chunksize, dtype=dtype, method=method)
461
462
/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py in to_sql(self, frame, name, if_exists, index, index_label, schema, chunksize, dtype, method)
1172 schema=schema, dtype=dtype)
1173 table.create()
-> 1174 table.insert(chunksize, method=method)
1175 if (not name.isdigit() and not name.islower()):
1176 # check for potentially case sensitivity issues (GH7815)
/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py in insert(self, chunksize, method)
684
685 chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])
--> 686 exec_insert(conn, keys, chunk_iter)
687
688 def _query_iterator(self, result, chunksize, columns, coerce_float=True,
/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py in _execute_insert(self, conn, keys, data_iter)
597 """
598 data = [dict(zip(keys, row)) for row in data_iter]
--> 599 conn.execute(self.table.insert(), data)
600
601 def _execute_insert_multi(self, conn, keys, data_iter):
/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in execute(self, object_, *multiparams, **params)
986 raise exc.ObjectNotExecutableError(object_)
987 else:
--> 988 return meth(self, multiparams, params)
989
990 def _execute_function(self, func, multiparams, params):
/usr/local/lib/python3.6/dist-packages/sqlalchemy/sql/elements.py in _execute_on_connection(self, connection, multiparams, params)
285 def _execute_on_connection(self, connection, multiparams, params):
286 if self.supports_execution:
--> 287 return connection._execute_clauseelement(self, multiparams, params)
288 else:
289 raise exc.ObjectNotExecutableError(self)
/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in _execute_clauseelement(self, elem, multiparams, params)
1105 distilled_params,
1106 compiled_sql,
-> 1107 distilled_params,
1108 )
1109 if self._has_events or self.engine._has_events:
/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
1246 except BaseException as e:
1247 self._handle_dbapi_exception(
-> 1248 e, statement, parameters, cursor, context
1249 )
1250
/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
1466 util.raise_from_cause(sqlalchemy_exception, exc_info)
1467 else:
-> 1468 util.reraise(*exc_info)
1469
1470 finally:
/usr/local/lib/python3.6/dist-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
127 if value.__traceback__ is not tb:
128 raise value.with_traceback(tb)
--> 129 raise value
130
131 def u(s):
/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
1222 if not evt_handled:
1223 self.dialect.do_executemany(
-> 1224 cursor, statement, parameters, context
1225 )
1226 elif not parameters and context.no_parameters:
/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/default.py in do_executemany(self, cursor, statement, parameters, context)
545
546 def do_executemany(self, cursor, statement, parameters, context=None):
--> 547 cursor.executemany(statement, parameters)
548
549 def do_execute(self, cursor, statement, parameters, context=None):
KeyboardInterrupt:
如果我在它崩溃之前进行键盘中断,这是结果
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-24-68b60fe221fe>", line 1, in <module>
dfAllT.to_sql("CS_table22", engine, chunksize = 100)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 2531, in to_sql
dtype=dtype, method=method)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 460, in to_sql
chunksize=chunksize, dtype=dtype, method=method)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 1174, in to_sql
table.insert(chunksize, method=method)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 686, in insert
exec_insert(conn, keys, chunk_iter)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 599, in _execute_insert
conn.execute(self.table.insert(), data)
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 988, in execute
return meth(self, multiparams, params)
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
distilled_params,
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
e, statement, parameters, cursor, context
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1468, in _handle_dbapi_exception
util.reraise(*exc_info)
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/util/compat.py", line 129, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1224, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/default.py", line 547, in do_executemany
cursor.executemany(statement, parameters)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 1823, in showtraceback
stb = value._render_traceback_()
AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 1132, in get_records
return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 313, in wrapped
return f(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 358, in _fixed_getinnerframes
records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
File "/usr/lib/python3.6/inspect.py", line 1488, in getinnerframes
frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
File "/usr/lib/python3.6/inspect.py", line 1446, in getframeinfo
filename = getsourcefile(frame) or getfile(frame)
File "/usr/lib/python3.6/inspect.py", line 696, in getsourcefile
if getattr(getmodule(object, filename), '__loader__', None) is not None:
File "/usr/lib/python3.6/inspect.py", line 739, in getmodule
f = getabsfile(module)
File "/usr/lib/python3.6/inspect.py", line 708, in getabsfile
_filename = getsourcefile(object) or getfile(object)
File "/usr/lib/python3.6/inspect.py", line 693, in getsourcefile
if os.path.exists(filename):
File "/usr/lib/python3.6/genericpath.py", line 19, in exists
os.stat(path)
KeyboardInterrupt
我在它崩溃之前又运行了一次,这似乎给出了另一个不同的结果
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-28-f18004debe33>", line 1, in <module>
dfAllT.to_sql("CS_table25", engine, chunksize = 100)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 2531, in to_sql
dtype=dtype, method=method)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 460, in to_sql
chunksize=chunksize, dtype=dtype, method=method)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 1174, in to_sql
table.insert(chunksize, method=method)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 686, in insert
exec_insert(conn, keys, chunk_iter)
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 598, in _execute_insert
data = [dict(zip(keys, row)) for row in data_iter]
File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 598, in <listcomp>
data = [dict(zip(keys, row)) for row in data_iter]
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 1823, in showtraceback
stb = value._render_traceback_()
AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 1132, in get_records
return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 313, in wrapped
return f(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 358, in _fixed_getinnerframes
records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
File "/usr/lib/python3.6/inspect.py", line 1488, in getinnerframes
frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
File "/usr/lib/python3.6/inspect.py", line 1446, in getframeinfo
filename = getsourcefile(frame) or getfile(frame)
File "/usr/lib/python3.6/inspect.py", line 696, in getsourcefile
if getattr(getmodule(object, filename), '__loader__', None) is not None:
File "/usr/lib/python3.6/inspect.py", line 742, in getmodule
os.path.realpath(f)] = module.__name__
File "/usr/lib/python3.6/posixpath.py", line 388, in realpath
path, ok = _joinrealpath(filename[:0], filename, {})
File "/usr/lib/python3.6/posixpath.py", line 421, in _joinrealpath
newpath = join(path, name)
KeyboardInterrupt
---------------------------------------------------------------------------
我尝试过的其他事情:
使用 dropna 删除所有 none/nan 值
dfAllT = dfAllT.applymap(str) 确保我所有的值都是字符串
dfAllT.reset_index(drop=True, inplace=True) 以确保索引未对齐。
编辑:
就像评论中提到的那样,我现在尝试在循环中使用 to_sql。
for i in range(586147):
print(i)
dfAllT.iloc[i*10000:(i+1)*10000].to_sql('CS_table', engine, if_exists= 'append')
此操作最终会占用我的 RAM,并最终导致大约中途崩溃。我想知道这是否表明 sqlite 将所有内容保存在内存中,以及是否有解决方法。
编辑:
我尝试了更多的东西,更短的夹头,在每一步之后处理引擎并创建一个新的。最终仍然吃掉了所有 RAM 并崩溃了。
for i in range(586147):
print(i)
engine = sqlalchemy.create_engine("sqlite:///CSTitlesSummariesData.db")
dfAllT.iloc[i*10:(i+1)*10].to_sql('CS_table', engine, index = False, if_exists= 'append')
engine.dispose()
gc.collect
我的想法:
所以看起来整个数据库都以某种方式保存在事件内存中。
制作它的 pandas 数据框是 5 GB(或者至少这是我尝试将其转换为 sqlite 之前的 RAM 量)。我的系统在大约 12.72 演出时崩溃。我想 sqlite 数据库占用的 RAM 比 pandas 数据框少。
最佳答案
我已经使用 df.to_sql 一年了,现在我正在努力解决我运行大量资源但它不起作用的事实。我意识到 chucksize 会使你的内存过载,pandas 加载到内存中,然后由 chuncks 发送。只好直接用sql来控制了。 (这里是我找到解决方案的地方 -> https://github.com/pandas-dev/pandas/issues/12265 我真的鼓励你读到最后。)
如果您需要从数据库中读取数据而不会使内存过载,请检查这段代码:
def get_data_by_chunks(cls, table, chunksize: int) -> iter:
with MysqlClient.get_engine().begin() as conn:
query_count = "select COUNT(*) from my_query"
row_count = conn.execute(query_count, where).fetchone()[0]
for i in range(math.ceil(row_count / chunksize)):
query = """
SELECT * FROM my_table
WHERE my_filiters
LIMIT {offset}, {row_count};
"""
yield pd.read_sql(query, conn)
for df in get_data_by_chunks(cls, table, chunksize: int):
print(df.shape)
关于python - 当 chunksize = 100 时,大(600 万行)pandas df 导致内存错误 `to_sql `,但可以轻松保存 100,000 个没有 chunksize 的文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56369565/
我已经阅读了有关不向数据库添加重复记录的 Pandas to_sql 解决方案。我正在处理日志的 csv 文件,每次我上传一个新的日志文件时,我都会读取数据并使用 Pandas 创建一个新的数据框进行
我正在将PANDAS与SQLAlchemy一起使用DataFrame.to_sql写入MYSQL DB。我喜欢打开'append' --> df.to_sql(con=con, name='table
我有一个看起来像这样的数据框 id_1 id_2 id_3 ... date_1 1 3 4 date_2 4
我想将数据帧写入现有的 sqlite(或 mysql)表,有时数据帧将包含数据库中尚不存在的新列。我需要做什么才能避免抛出错误?有没有办法告诉 pandas 或 sqlalchemy 使用潜在的新列自
我正在尝试使用Django的db连接变量将pandas数据帧插入Postgres数据库。我使用的代码是 df.to_sql('forecast',connection,if_exists='appen
我目前正在尝试稍微调整一些脚本的性能,似乎瓶颈始终是使用 pandas to_sql 函数实际插入数据库(=MSSQL)。 造成这种情况的一个因素是 mssql 的参数限制为 2100。 我与 sql
有人经历过这种情况吗? 我有一个包含“int”和“varchar”列的表 - 一个报告时间表。 我正在尝试使用 python 程序将扩展名为“.xls”的 Excel 文件导入到该表中。我正在使用 p
我正在尝试将 pandas 数据框保存为 SQL 文件 我按照文档进行了尝试 from sqlalchemy import create_engine engine = create_engine('
Panda 的 to_sql() 和 if_exists='replace' 为我的表设置了错误的字符集。为了将多个 csv 复制到 mysql 并忽略行上的错误(如重复错误),我首先将 csv 读取
我有一个如下所示的数据框: df = pd.DataFrame(index= pd.date_range('2014-01-01', periods=10)) df['date'] = df.inde
我正在尝试使用 Pandas to_sql 将 .csv 文件中的数据插入到 mssql 数据库中。不管我怎么看,我都会遇到这个错误: pyodbc.DataError: ('String data,
我想不断将数据帧行添加到 MySQL 数据库中,避免任何重复的条目进入 MySQL。 我目前通过使用 df.apply() 遍历每一行并调用 MySQL insert ignore(duplicate
如何使用 df.to_sql(if_exists = 'append') 仅附加数据框和数据库之间的唯一值。换句话说,我想评估 DF 和 DB 之间的重复项,并在写入数据库之前删除这些重复项。 这个有
我有一个订购的 Pandas Dataframe。 a0 b0 c0 d0 370025442 370020440 370020436
我正在尝试使用 Python 的 pandas to_sql 命令将月度数据发送到 MySQL 数据库。我的程序一次运行一个月的数据,我想将新数据附加到现有数据库中。然而,Python 给我一个错误:
我正在使用 df.to_sql(con=con_mysql, name='testdata', if_exists='replace', flavor='mysql') 将数据框导出到 mysql。但
我有一个 1,000,000 x 50 Pandas DataFrame,我目前正在使用以下方法写入 SQL 表: df.to_sql('my_table', con, index=False) 这需
我有一个字符串格式的数字列,我想将它作为 float 发送到 PostresSQL。如何确保 SQLAlchemy 将此列设置为 float ? (请注意,列中可能是 NaN)。这是代码 import
我想在我创建的表格中插入一些数据。我有一个如下所示的数据框: 我创建了一个表: create table online.ds_attribution_probabilities ( attributi
我正在使用 Pandas 0.18.1,在摆弄这段代码时, import pd def getIndividualDf(item): var1 = [] # ... populate
我是一名优秀的程序员,十分优秀!