-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Possible memory leak bug with to_sql
with chunksize ?
#26569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Do you have any way of profiling? |
I am not very well versed in profiling methods but I am more than happy to figure it out. Do you have any recommended methods? If it makes a difference, I am using Google Colab. Also, I would be happy to provide the notebook I am using. It is self-contained, and downloads the data directly from my Google Drive, so all you have to do is press run and then go to the final cell to check the SQL code. |
Here is the notebook I am using , with the https://drive.google.com/open?id=1j7JnDlmIP5QE9lwLuQo-z3jABElaesZN The code needs my data from my Google drive folder. I could not find how let a new user download directly all contents of my folder, so I am just going to share my folder with this link https://drive.google.com/open?id=1m6JfoIEIcX74CFSIQArZmSd0A8d0IRG8 Import this into your drive (should note take up any room in your drive since I am hosting it) Then run all the cells in the notebook. Warning, this will ask for your Google drive credentials in the first block of code. It's possible for me to make a minimal example, however it's very tricky because I need to make the dataframe big enough for it to fail with Let me know if there's anything else I can do to make looking into this issue easier. |
Here's an example of profiling if you can isolate the issue in your code: |
Not quite sure how to use that I awkwardly tried
and
But neither produced any output; my enviroment crashed without outputting anything. |
https://pypi.org/project/memory-profiler/ for memory profiling.
http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports for
making minimal examples.
…On Thu, May 30, 2019 at 4:34 PM Santosh Gupta ***@***.***> wrote:
I am not very well versed in profiling methods but I am more than happy to
figure it out. Do you have any recommended methods? If it makes a
difference, I am using Google Colab.
Also, I would be happy to provide the notebook I am using. It is
self-contained, and downloads the data directly from my Google Drive, so
all you have to do is press run and then go to the final cell to check the
SQL code.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#26569?email_source=notifications&email_token=AAKAOIWPWDBBVCGITXRDIE3PYBB5JA5CNFSM4HQYBPCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWTSSAI#issuecomment-497494273>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIXXUYTUEYLC7YS6AWDPYBB5JANCNFSM4HQYBPCA>
.
|
I believe the issue is not with Pandas but with sqlalchemy. I used something like Here's a Stackoverflow link for anyone interested in continuing the monitor this issue. Here's is an minimal example to anyone wanting to replicate the issue
|
I mentioned this on SQLalchemy and they said that it's unlikely it's happening there, so the issue might be in SQLite driver or in pandas. It seems under certain conditions, that there is a memory leak with repeated insertions to sqlite via sqlalchemy. I had a hard time trying to replicate the memory leak that occurred when converting my data, through a minimal example. But this gets pretty close. With my data, it would result in a memory crash no matter what I did. But when I tried to create a self-contained minimal example, I have to play around with parameters to cause the leak. The following code, when run in Google colab CPU mode, has a leak, although it's not severe enough to cause it to crash.
This was run on Google Colab CPU enviroment. The database itself isn't causing the memory leak, because I can restart my enviroment, and the previously inserted data is still there, and connecting to that database doesn't cause an increase in memory. The issue seems to be under certain conditions, repeated insertions via looping to_sql , or one to_sql with chucksize specified. |
Code Sample, a copy-pastable example if possible
Problem description
I created a large database in Pandas, about 6 million rows of text data. I wanted to save this as a SQL database file, but when I try to save it, I get an out of memory RAM error. I even reduced the chuck size to 100 and it still crashes.
However, if I just have smaller version of that dataframe with 100,000 rows, and save it to a database with no chucksize specified, I have no issues saving tha dataframe.
My understanding was that since it's only processing 100 rows at a time, the RAM usage should reflect that of a save of 100 rows. Is there something else happening behind the scenes? Perhaps multi-threading?
Before I run this code, I am using 4.8 GB RAM, out of the 12.8 GB RAM available in Google Colab. Running the above code eats up all the RAM until the enviroment crashes.
I would like to be able to save my pandas dataframe to a SQL file without my environment crashing. The environment I am in is Google Colab. The pandas datafame is 2 columns, ~6 million rows. Each cell contains about this much text:
I posted a SO issue here
https://stackoverflow.com/questions/56369565/large-6-million-rows-pandas-df-causes-memory-error-with-to-sql-when-chunksi
And one of the users said
From stepping through the code I think it's this line, which reads creates a bunch of DataFrames
pandas/pandas/io/sql.py
Line 676 in a91da0c
chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])
Which looks like it's probably a bug. Specifically this happens prior to database insertion, in preparation.
Expected Output
The code should be executed without a crash.
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.79+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.2
pytest: 3.6.4
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.3.0
pyarrow: 0.13.0
xarray: 0.11.3
IPython: 5.5.0
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.5.3
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: 0.4.0
matplotlib: 3.0.3
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.3.3
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10.1
s3fs: 0.2.1
fastparquet: None
pandas_gbq: 0.4.1
pandas_datareader: 0.7.0
gcsfs: None
The text was updated successfully, but these errors were encountered: