Possible memory leak bug with `to_sql` with chunksize ? #26569

Santosh-Gupta · 2019-05-29T23:38:40Z

Code Sample, a copy-pastable example if possible

from sqlalchemy import create_engine
engine = sqlalchemy.create_engine("sqlite:///databasefile.db")
dataframe.to_sql("CS_table", engine, chunksize = 100)

Problem description

I created a large database in Pandas, about 6 million rows of text data. I wanted to save this as a SQL database file, but when I try to save it, I get an out of memory RAM error. I even reduced the chuck size to 100 and it still crashes.

However, if I just have smaller version of that dataframe with 100,000 rows, and save it to a database with no chucksize specified, I have no issues saving tha dataframe.

My understanding was that since it's only processing 100 rows at a time, the RAM usage should reflect that of a save of 100 rows. Is there something else happening behind the scenes? Perhaps multi-threading?

Before I run this code, I am using 4.8 GB RAM, out of the 12.8 GB RAM available in Google Colab. Running the above code eats up all the RAM until the enviroment crashes.

I would like to be able to save my pandas dataframe to a SQL file without my environment crashing. The environment I am in is Google Colab. The pandas datafame is 2 columns, ~6 million rows. Each cell contains about this much text:

"The dominant sequence transduction models are based on complex
recurrent or convolutional neural networks in an encoder-decoder
configuration. The best performing models also connect the encoder and
decoder through an attention mechanism. We propose a new simple
network architecture, the Transformer, based solely on attention
mechanisms, dispensing with recurrence and convolutions entirely.
Experiments on two machine translation tasks show these models to be
superior in quality while being more parallelizable and requiring
significantly less time to train. Our model achieves 28.4 BLEU on the
WMT 2014 English-to-German translation task, improving over the
existing best results, including ensembles by over 2 BLEU. On the WMT
2014 English-to-French translation task, our model establishes a new
single-model state-of-the-art BLEU score of 41.8 after training for
3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer
generalizes well to other tasks by applying it successfully to English
constituency parsing both with large and limited training data."

I posted a SO issue here

https://stackoverflow.com/questions/56369565/large-6-million-rows-pandas-df-causes-memory-error-with-to-sql-when-chunksi

And one of the users said

From stepping through the code I think it's this line, which reads creates a bunch of DataFrames

pandas/pandas/io/sql.py

Line 676 in a91da0c

chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])

chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])

Which looks like it's probably a bug. Specifically this happens prior to database insertion, in preparation.

Expected Output

The code should be executed without a crash.

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.79+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 3.6.4
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.3.0
pyarrow: 0.13.0
xarray: 0.11.3
IPython: 5.5.0
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.5.3
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: 0.4.0
matplotlib: 3.0.3
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.3.3
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10.1
s3fs: 0.2.1
fastparquet: None
pandas_gbq: 0.4.1
pandas_datareader: 0.7.0
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-05-30T17:52:26Z

Do you have any way of profiling?

Santosh-Gupta · 2019-05-30T21:34:03Z

I am not very well versed in profiling methods but I am more than happy to figure it out. Do you have any recommended methods? If it makes a difference, I am using Google Colab.

Also, I would be happy to provide the notebook I am using. It is self-contained, and downloads the data directly from my Google Drive, so all you have to do is press run and then go to the final cell to check the SQL code.

Santosh-Gupta · 2019-05-31T00:27:15Z

Here is the notebook I am using , with the to_sql code at the bottom.

https://drive.google.com/open?id=1j7JnDlmIP5QE9lwLuQo-z3jABElaesZN

The code needs my data from my Google drive folder. I could not find how let a new user download directly all contents of my folder, so I am just going to share my folder with this link

https://drive.google.com/open?id=1m6JfoIEIcX74CFSIQArZmSd0A8d0IRG8

Import this into your drive (should note take up any room in your drive since I am hosting it)

Then run all the cells in the notebook. Warning, this will ask for your Google drive credentials in the first block of code.

It's possible for me to make a minimal example, however it's very tricky because I need to make the dataframe big enough for it to fail with to_sql, but small enough to still fit inside colab.

Let me know if there's anything else I can do to make looking into this issue easier.

WillAyd · 2019-05-31T02:30:06Z

Here's an example of profiling if you can isolate the issue in your code:

#25893 (comment)

Santosh-Gupta · 2019-05-31T03:52:28Z

Not quite sure how to use that

I awkwardly tried

process = psutil.Process(os.getpid())
i = 0

def sink():
    global i
    i += 1
    if i % 100 == 0:
        mem = process.memory_info().rss / 1e6
        print("mem %fMB" % mem)

while True:
    dfAllT.to_sql("CS_table2", engine, chunksize = 100)
    sink()

and

process = psutil.Process(os.getpid())
i = 0

def sink():
    global i
    i += 1
    if i % 100 == 0:
        mem = process.memory_info().rss / 1e6
        print("mem %fMB" % mem)

while True:
    dfAllT.to_sql("CS_table2", engine, chunksize = 100).rolling(4000).max()
    sink()

But neither produced any output; my enviroment crashed without outputting anything.

TomAugspurger · 2019-05-31T09:45:13Z

https://pypi.org/project/memory-profiler/ for memory profiling. http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports for making minimal examples.

…

On Thu, May 30, 2019 at 4:34 PM Santosh Gupta ***@***.***> wrote: I am not very well versed in profiling methods but I am more than happy to figure it out. Do you have any recommended methods? If it makes a difference, I am using Google Colab. Also, I would be happy to provide the notebook I am using. It is self-contained, and downloads the data directly from my Google Drive, so all you have to do is press run and then go to the final cell to check the SQL code. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#26569?email_source=notifications&email_token=AAKAOIWPWDBBVCGITXRDIE3PYBB5JA5CNFSM4HQYBPCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWTSSAI#issuecomment-497494273>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIXXUYTUEYLC7YS6AWDPYBB5JANCNFSM4HQYBPCA> .

Santosh-Gupta · 2019-06-02T15:47:22Z

I believe the issue is not with Pandas but with sqlalchemy. I used something like for i in range(100): df.iloc[i * 100000:(i+1):100000].to_sql(...) as an alternative method for chucking but the ram still built up and crashed. So perhaps some sort of memory leak in sqlalchemy.

Here's a Stackoverflow link for anyone interested in continuing the monitor this issue.

https://stackoverflow.com/questions/56369565/large-6-million-rows-pandas-df-causes-memory-error-with-to-sql-when-chunksi

Here's is an minimal example to anyone wanting to replicate the issue

import string
import numpy as np
import pandas as pd
from random import randint
import random

def make_random_str_array(size=10, num_rows=100, chars=string.ascii_uppercase + string.digits):
    return (np.random.choice(list(chars), num_rows*size)
            .view('|U{}'.format(size)))

def alt(size, num_rows):
    data = make_random_str_array(size, num_rows=2*num_rows).reshape(-1, 2)
    dfAll = pd.DataFrame(data)
    return dfAll

dfAll = alt(randint(1000, 2000), 10000)

for i in range(330):
    print('step ', i)
    data = alt(randint(1000, 2000), 10000)
    df = pd.DataFrame(data)
    dfAll = pd.concat([ df,  dfAll ])

import sqlalchemy

from sqlalchemy import create_engine
engine = sqlalchemy.create_engine('sqlite:///testtt.db')

for i in range(500):
    print('step', i)
    dfAll.iloc[(i%330)*10000:((i%330)+1)*10000].to_sql('test_table22', engine, index = False, if_exists= 'append')

Santosh-Gupta · 2019-06-04T20:43:01Z

I mentioned this on SQLalchemy and they said that it's unlikely it's happening there, so the issue might be in SQLite driver or in pandas.

It seems under certain conditions, that there is a memory leak with repeated insertions to sqlite via sqlalchemy.

I had a hard time trying to replicate the memory leak that occurred when converting my data, through a minimal example. But this gets pretty close. With my data, it would result in a memory crash no matter what I did. But when I tried to create a self-contained minimal example, I have to play around with parameters to cause the leak.

The following code, when run in Google colab CPU mode, has a leak, although it's not severe enough to cause it to crash.

import string
import numpy as np
import pandas as pd
from random import randint
import random

def make_random_str_array(size=10, num_rows=100, chars=string.ascii_uppercase + string.digits):
    return (np.random.choice(list(chars), num_rows*size)
            .view('|U{}'.format(size)))

def alt(size, num_rows):
    data = make_random_str_array(size, num_rows=2*num_rows).reshape(-1, 2)
    dfAll = pd.DataFrame(data)
    return dfAll

dfAll = alt(randint(1000, 2000), 10000)

for i in range(330):
    print('step ', i)
    data = alt(randint(1000, 2000), 10000)
    df = pd.DataFrame(data)
    dfAll = pd.concat([ df,  dfAll ])

import sqlalchemy

from sqlalchemy import create_engine
engine = sqlalchemy.create_engine('sqlite:///testtt.db')

for i in range(500):
    print('step', i)
    dfAll.iloc[(i%330)*10000:((i%330)+1)*10000].to_sql('test_table22', engine, index = False, if_exists= 'append')

This was run on Google Colab CPU enviroment.

The database itself isn't causing the memory leak, because I can restart my enviroment, and the previously inserted data is still there, and connecting to that database doesn't cause an increase in memory. The issue seems to be under certain conditions, repeated insertions via looping to_sql , or one to_sql with chucksize specified.

Santosh-Gupta · 2019-06-09T15:53:51Z

Turns out it's a Python issue

https://stackoverflow.com/questions/56418825/repeated-insertions-into-sqlite-database-via-sqlalchemy-causing-memory-leak/56511487?noredirect=1#comment99616175_56511487

Santosh-Gupta closed this as completed Jun 2, 2019

Santosh-Gupta reopened this Jun 4, 2019

Santosh-Gupta closed this as completed Jun 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible memory leak bug with `to_sql` with chunksize ? #26569

Possible memory leak bug with `to_sql` with chunksize ? #26569

Santosh-Gupta commented May 29, 2019

INSTALLED VERSIONS

WillAyd commented May 30, 2019

Santosh-Gupta commented May 30, 2019

Santosh-Gupta commented May 31, 2019

WillAyd commented May 31, 2019

Santosh-Gupta commented May 31, 2019

TomAugspurger commented May 31, 2019 via email

Santosh-Gupta commented Jun 2, 2019 •

edited

Loading

Santosh-Gupta commented Jun 4, 2019

Santosh-Gupta commented Jun 9, 2019

Possible memory leak bug with to_sql with chunksize ? #26569

Possible memory leak bug with to_sql with chunksize ? #26569

Comments

Santosh-Gupta commented May 29, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented May 30, 2019

Santosh-Gupta commented May 30, 2019

Santosh-Gupta commented May 31, 2019

WillAyd commented May 31, 2019

Santosh-Gupta commented May 31, 2019

TomAugspurger commented May 31, 2019 via email

Santosh-Gupta commented Jun 2, 2019 • edited Loading

Santosh-Gupta commented Jun 4, 2019

Santosh-Gupta commented Jun 9, 2019

Possible memory leak bug with `to_sql` with chunksize ? #26569

Possible memory leak bug with `to_sql` with chunksize ? #26569

Output of `pd.show_versions()`

Santosh-Gupta commented Jun 2, 2019 •

edited

Loading