python - 当 chunksize = 100 时，大(600 万行)pandas df 导致内存错误 `to_sql `，但可以轻松保存 100,000 个没有 chunksize 的文件-6ren

python - 当 chunksize = 100 时，大(600 万行)pandas df 导致内存错误 `to_sql `，但可以轻松保存 100,000 个没有 chunksize 的文件

转载作者：行者123 更新时间：2023-11-28 16:58:05

我在 Pandas 中创建了一个大型数据库，大约有 600 万行文本数据。我想将其保存为 SQL 数据库文件，但当我尝试保存它时，出现内存不足 RAM 错误。我什至将卡盘尺寸减小到 100，但它仍然崩溃。

但是，如果我只有 100,000 行的较小版本的数据框，并将其保存到未指定 chucksize 的数据库中，则保存数据框没有问题。

这是我的代码

from sqlalchemy import create_engine
engine = sqlalchemy.create_engine("sqlite:///databasefile.db")
dataframe.to_sql("CS_table", engine, chunksize = 100)

我的理解是，由于它一次只处理 100 行，因此 RAM 使用量应该反射(reflect)出保存 100 行的情况。幕后还有其他事情发生吗？也许是多线程？

在我运行此代码之前，我使用的是 4.8 GB RAM，而 Google Colab 中可用的 12.8 GB RAM 除外。运行上面的代码会耗尽所有 RAM，直到环境崩溃。

我希望能够在我的环境不崩溃的情况下将我的 pandas 数据框保存到 SQL 文件中。我所在的环境是Google Colab。 pandas 数据名有 2 列，约 600 万行。每个单元格包含大约这么多文本:

"The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data."

编辑:

我在不同阶段进行了键盘中断。这是 RAM 中第一次跳转后键盘中断的结果

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-22-51b6e444f80d> in <module>()
----> 1 dfAllT.to_sql("CS_table23", engine, chunksize = 100)

12 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in to_sql(self, name, con, schema, if_exists, index, index_label, chunksize, dtype, method)
   2529         sql.to_sql(self, name, con, schema=schema, if_exists=if_exists,
   2530                    index=index, index_label=index_label, chunksize=chunksize,
-> 2531                    dtype=dtype, method=method)
   2532 
   2533     def to_pickle(self, path, compression='infer',

/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py in to_sql(frame, name, con, schema, if_exists, index, index_label, chunksize, dtype, method)
    458     pandas_sql.to_sql(frame, name, if_exists=if_exists, index=index,
    459                       index_label=index_label, schema=schema,
--> 460                       chunksize=chunksize, dtype=dtype, method=method)
    461 
    462 

/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py in to_sql(self, frame, name, if_exists, index, index_label, schema, chunksize, dtype, method)
   1172                          schema=schema, dtype=dtype)
   1173         table.create()
-> 1174         table.insert(chunksize, method=method)
   1175         if (not name.isdigit() and not name.islower()):
   1176             # check for potentially case sensitivity issues (GH7815)

/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py in insert(self, chunksize, method)
    684 
    685                 chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])
--> 686                 exec_insert(conn, keys, chunk_iter)
    687 
    688     def _query_iterator(self, result, chunksize, columns, coerce_float=True,

/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py in _execute_insert(self, conn, keys, data_iter)
    597         """
    598         data = [dict(zip(keys, row)) for row in data_iter]
--> 599         conn.execute(self.table.insert(), data)
    600 
    601     def _execute_insert_multi(self, conn, keys, data_iter):

/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in execute(self, object_, *multiparams, **params)
    986             raise exc.ObjectNotExecutableError(object_)
    987         else:
--> 988             return meth(self, multiparams, params)
    989 
    990     def _execute_function(self, func, multiparams, params):

/usr/local/lib/python3.6/dist-packages/sqlalchemy/sql/elements.py in _execute_on_connection(self, connection, multiparams, params)
    285     def _execute_on_connection(self, connection, multiparams, params):
    286         if self.supports_execution:
--> 287             return connection._execute_clauseelement(self, multiparams, params)
    288         else:
    289             raise exc.ObjectNotExecutableError(self)

/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in _execute_clauseelement(self, elem, multiparams, params)
   1105             distilled_params,
   1106             compiled_sql,
-> 1107             distilled_params,
   1108         )
   1109         if self._has_events or self.engine._has_events:

/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1246         except BaseException as e:
   1247             self._handle_dbapi_exception(
-> 1248                 e, statement, parameters, cursor, context
   1249             )
   1250 

/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
   1466                 util.raise_from_cause(sqlalchemy_exception, exc_info)
   1467             else:
-> 1468                 util.reraise(*exc_info)
   1469 
   1470         finally:

/usr/local/lib/python3.6/dist-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
    127         if value.__traceback__ is not tb:
    128             raise value.with_traceback(tb)
--> 129         raise value
    130 
    131     def u(s):

/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1222                 if not evt_handled:
   1223                     self.dialect.do_executemany(
-> 1224                         cursor, statement, parameters, context
   1225                     )
   1226             elif not parameters and context.no_parameters:

/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/default.py in do_executemany(self, cursor, statement, parameters, context)
    545 
    546     def do_executemany(self, cursor, statement, parameters, context=None):
--> 547         cursor.executemany(statement, parameters)
    548 
    549     def do_execute(self, cursor, statement, parameters, context=None):

KeyboardInterrupt:

如果我在它崩溃之前进行键盘中断，这是结果

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-24-68b60fe221fe>", line 1, in <module>
    dfAllT.to_sql("CS_table22", engine, chunksize = 100)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 2531, in to_sql
    dtype=dtype, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 460, in to_sql
    chunksize=chunksize, dtype=dtype, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 1174, in to_sql
    table.insert(chunksize, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 686, in insert
    exec_insert(conn, keys, chunk_iter)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 599, in _execute_insert
    conn.execute(self.table.insert(), data)
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 988, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
    distilled_params,
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
    e, statement, parameters, cursor, context
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1468, in _handle_dbapi_exception
    util.reraise(*exc_info)
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/util/compat.py", line 129, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1224, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/default.py", line 547, in do_executemany
    cursor.executemany(statement, parameters)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 1823, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 1132, in get_records
    return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 313, in wrapped
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 358, in _fixed_getinnerframes
    records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
  File "/usr/lib/python3.6/inspect.py", line 1488, in getinnerframes
    frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
  File "/usr/lib/python3.6/inspect.py", line 1446, in getframeinfo
    filename = getsourcefile(frame) or getfile(frame)
  File "/usr/lib/python3.6/inspect.py", line 696, in getsourcefile
    if getattr(getmodule(object, filename), '__loader__', None) is not None:
  File "/usr/lib/python3.6/inspect.py", line 739, in getmodule
    f = getabsfile(module)
  File "/usr/lib/python3.6/inspect.py", line 708, in getabsfile
    _filename = getsourcefile(object) or getfile(object)
  File "/usr/lib/python3.6/inspect.py", line 693, in getsourcefile
    if os.path.exists(filename):
  File "/usr/lib/python3.6/genericpath.py", line 19, in exists
    os.stat(path)
KeyboardInterrupt

我在它崩溃之前又运行了一次，这似乎给出了另一个不同的结果

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-28-f18004debe33>", line 1, in <module>
    dfAllT.to_sql("CS_table25", engine, chunksize = 100)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 2531, in to_sql
    dtype=dtype, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 460, in to_sql
    chunksize=chunksize, dtype=dtype, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 1174, in to_sql
    table.insert(chunksize, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 686, in insert
    exec_insert(conn, keys, chunk_iter)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 598, in _execute_insert
    data = [dict(zip(keys, row)) for row in data_iter]
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 598, in <listcomp>
    data = [dict(zip(keys, row)) for row in data_iter]
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 1823, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 1132, in get_records
    return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 313, in wrapped
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 358, in _fixed_getinnerframes
    records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
  File "/usr/lib/python3.6/inspect.py", line 1488, in getinnerframes
    frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
  File "/usr/lib/python3.6/inspect.py", line 1446, in getframeinfo
    filename = getsourcefile(frame) or getfile(frame)
  File "/usr/lib/python3.6/inspect.py", line 696, in getsourcefile
    if getattr(getmodule(object, filename), '__loader__', None) is not None:
  File "/usr/lib/python3.6/inspect.py", line 742, in getmodule
    os.path.realpath(f)] = module.__name__
  File "/usr/lib/python3.6/posixpath.py", line 388, in realpath
    path, ok = _joinrealpath(filename[:0], filename, {})
  File "/usr/lib/python3.6/posixpath.py", line 421, in _joinrealpath
    newpath = join(path, name)
KeyboardInterrupt
---------------------------------------------------------------------------

我尝试过的其他事情:

使用 dropna 删除所有 none/nan 值

dfAllT = dfAllT.applymap(str) 确保我所有的值都是字符串

dfAllT.reset_index(drop=True, inplace=True) 以确保索引未对齐。

编辑:

就像评论中提到的那样，我现在尝试在循环中使用 to_sql。

for i in range(586147):
    print(i)
    dfAllT.iloc[i*10000:(i+1)*10000].to_sql('CS_table', engine, if_exists= 'append')

此操作最终会占用我的 RAM，并最终导致大约中途崩溃。我想知道这是否表明 sqlite 将所有内容保存在内存中，以及是否有解决方法。

编辑:

我尝试了更多的东西，更短的夹头，在每一步之后处理引擎并创建一个新的。最终仍然吃掉了所有 RAM 并崩溃了。

for i in range(586147):
    print(i)
    engine = sqlalchemy.create_engine("sqlite:///CSTitlesSummariesData.db")
    dfAllT.iloc[i*10:(i+1)*10].to_sql('CS_table', engine, index = False, if_exists= 'append')
    engine.dispose() 
    gc.collect

我的想法:

所以看起来整个数据库都以某种方式保存在事件内存中。

制作它的 pandas 数据框是 5 GB(或者至少这是我尝试将其转换为 sqlite 之前的 RAM 量)。我的系统在大约 12.72 演出时崩溃。我想 sqlite 数据库占用的 RAM 比 pandas 数据框少。

最佳答案

我已经使用 df.to_sql 一年了，现在我正在努力解决我运行大量资源但它不起作用的事实。我意识到 chucksize 会使你的内存过载，pandas 加载到内存中，然后由 chuncks 发送。只好直接用sql来控制了。 (这里是我找到解决方案的地方 -> https://github.com/pandas-dev/pandas/issues/12265 我真的鼓励你读到最后。)

如果您需要从数据库中读取数据而不会使内存过载，请检查这段代码:

def get_data_by_chunks(cls, table, chunksize: int) -> iter:
    with MysqlClient.get_engine().begin() as conn:
        query_count = "select COUNT(*) from my_query"
        row_count = conn.execute(query_count, where).fetchone()[0]

        for i in range(math.ceil(row_count / chunksize)):
            query = """
               SELECT * FROM my_table
               WHERE my_filiters
               LIMIT {offset}, {row_count};
             """
            yield pd.read_sql(query, conn)

for df in get_data_by_chunks(cls, table, chunksize: int):
    print(df.shape)

关于python - 当 chunksize = 100 时，大(600 万行)pandas df 导致内存错误 `to_sql `，但可以轻松保存 100,000 个没有 chunksize 的文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56369565/

文章推荐： css - 仅将 CSS 伪元素添加到网格中心

文章推荐： javascript - 使用javascript计算多个输入值

Pandas to_sql 使索引唯一
我已经阅读了有关不向数据库添加重复记录的 Pandas to_sql 解决方案。我正在处理日志的 csv 文件，每次我上传一个新的日志文件时，我都会读取数据并使用 Pandas 创建一个新的数据框进行
mysqldb中带有sqlAlchemy重复项的 Pandas to_sql
我正在将PANDAS与SQLAlchemy一起使用DataFrame.to_sql写入MYSQL DB。我喜欢打开'append' --> df.to_sql(con=con, name='table
python - to_sql 的数据透视表
我有一个看起来像这样的数据框 id_1 id_2 id_3 ... date_1 1 3 4 date_2 4
pandas.to_sql 将新列添加到现有表中，自动添加新列？
我想将数据帧写入现有的 sqlite(或 mysql)表，有时数据帧将包含数据库中尚不存在的新列。我需要做什么才能避免抛出错误？有没有办法告诉 pandas 或 sqlalchemy 使用潜在的新列自
django - Django中的 Pandas to_sql
我正在尝试使用Django的db连接变量将pandas数据帧插入Postgres数据库。我使用的代码是 df.to_sql('forecast',connection,if_exists='appen
python - Pandas to_sql 参数和性能
我目前正在尝试稍微调整一些脚本的性能，似乎瓶颈始终是使用 pandas to_sql 函数实际插入数据库(=MSSQL)。造成这种情况的一个因素是 mssql 的参数限制为 2100。我与 sql
python - Pandas to_sql 更改数据库表中的数据类型
有人经历过这种情况吗？我有一个包含“int”和“varchar”列的表 - 一个报告时间表。我正在尝试使用 python 程序将扩展名为“.xls”的 Excel 文件导入到该表中。我正在使用 p
python - Pandas to_sql 不创建文件
我正在尝试将 pandas 数据框保存为 SQL 文件我按照文档进行了尝试 from sqlalchemy import create_engine engine = create_engine('
python - DataFrame to_sql 使用错误的字符集进行替换
Panda 的 to_sql() 和 if_exists='replace' 为我的表设置了错误的字符集。为了将多个 csv 复制到 mysql 并忽略行上的错误(如重复错误)，我首先将 csv 读取
python - pandas to_sql 方法给出日期列错误
我有一个如下所示的数据框: df = pd.DataFrame(index= pd.date_range('2014-01-01', periods=10)) df['date'] = df.inde
python - Pandas to_sql 右截断错误
我正在尝试使用 Pandas to_sql 将 .csv 文件中的数据插入到 mssql 数据库中。不管我怎么看，我都会遇到这个错误: pyodbc.DataError: ('String data,
python - Pandas to_sql 插入忽略
我想不断将数据帧行添加到 MySQL 数据库中，避免任何重复的条目进入 MySQL。我目前通过使用 df.apply() 遍历每一行并调用 MySQL insert ignore(duplicate
python - Pandas to_sql() 更新数据库中的唯一值？
如何使用 df.to_sql(if_exists = 'append') 仅附加数据框和数据库之间的唯一值。换句话说，我想评估 DF 和 DB 之间的重复项，并在写入数据库之前删除这些重复项。这个有
python - Pandas to_sql 中行的排序
我有一个订购的 Pandas Dataframe。 a0 b0 c0 d0 370025442 370020440 370020436
Python Pandas to_sql 'append'
我正在尝试使用 Python 的 pandas to_sql 命令将月度数据发送到 MySQL 数据库。我的程序一次运行一个月的数据，我想将新数据附加到现有数据库中。然而，Python 给我一个错误:
python - pandas to_sql 截断了我的数据
我正在使用 df.to_sql(con=con_mysql, name='testdata', if_exists='replace', flavor='mysql') 将数据框导出到 mysql。但
python - 加速 Pandas to_sql()？
我有一个 1,000,000 x 50 Pandas DataFrame，我目前正在使用以下方法写入 SQL 表: df.to_sql('my_table', con, index=False) 这需
python - Pandas to_sql 设置列类型
我有一个字符串格式的数字列，我想将它作为 float 发送到 PostresSQL。如何确保 SQLAlchemy 将此列设置为 float ？ (请注意，列中可能是 NaN)。这是代码 import
python - Pandas to_sql 没有在我的表中插入任何数据
我想在我创建的表格中插入一些数据。我有一个如下所示的数据框: 我创建了一个表: create table online.ds_attribution_probabilities ( attributi
python - Pandas to_sql() 插入索引
我正在使用 Pandas 0.18.1，在摆弄这段代码时， import pd def getIndividualDf(item): var1 = [] # ... populate

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 当 chunksize = 100 时，大(600 万行)pandas df 导致内存错误 `to_sql `，但可以轻松保存 100,000 个没有 chunksize 的文件