Skip to content

DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys #14736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xgdgsc opened this issue Nov 25, 2016 · 8 comments · Fixed by #14788
Labels
Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question
Milestone

Comments

@xgdgsc
Copy link
Contributor

xgdgsc commented Nov 25, 2016

Code Sample, a copy-pastable example if possible

http://stackoverflow.com/questions/32750970/python-pandas-merge-causing-memory-overflow

# coding: utf-8
import pandas as pd
data = pd.read_csv('https://gist.githubusercontent.com/xgdgsc/8671a22136e1da937f1046a5f211c0ff/raw/d261706a6e7d1d7014e45e47122ead71e7159ef4/small.csv', index_col='<Date>')
print(data.shape)
another = data[[ ' <Open>']]
joined = data.join([another])
print(joined.shape)

Problem description

Currently having index with duplicate keys when joining dataframes would cause severe memory overflow, sometimes freezes the computer and user has to hard reboot, which can be disastrous for unsaved work.

Expected Output

Adding a simple checking before joining/merging , stop the operation and warn the user would be enough.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Nov 25, 2016

can you show a mini example?

@xgdgsc
Copy link
Contributor Author

xgdgsc commented Nov 25, 2016

https://gist.github.com/xgdgsc/8671a22136e1da937f1046a5f211c0ff

You can see the example above. When joining on dup indexes, the joined frame size grows by square (400) of the size of the original frames (20). It would overflow memory quite easily without caution.

@jreback
Copy link
Contributor

jreback commented Nov 25, 2016

@xgdgsc pls add a small copy-pastable example.

@xgdgsc
Copy link
Contributor Author

xgdgsc commented Nov 25, 2016

Thanks. Added in description.

@jreback
Copy link
Contributor

jreback commented Nov 25, 2016

@xgdgsc can you construct an example that is just code? (and does not pull a remote url)

small example, it doesn't have to blow up. I simply want to see the structure.

@xgdgsc
Copy link
Contributor Author

xgdgsc commented Nov 25, 2016

The code in description is just code which loads a tiny csv. It doesn' t blow up. It just shows the size of the joined frame to showcase the problem.

@jreback
Copy link
Contributor

jreback commented Nov 25, 2016

@xgdgsc

so I am not sure I understand the problem. you are merging non-unique with non-unique. That's what you asked pandas to do.

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question labels Nov 25, 2016
@jreback
Copy link
Contributor

jreback commented Nov 25, 2016

you can easily see the issue here as well.

In [24]: pd.merge(data, another, left_index=True, right_index=True, indicator=True)
Out[24]: 
            <Time>   <Open>_x   <High>   <Low>   <Close>   <Volume>   <FastAvg>   <MedAvg>   <SlowAvg>   <Open>_y _merge
<Date>                                                                                                                  
2013/1/4  09:01:00       6306     6306    6256      6259      13278           0          0           0       6306   both
2013/1/4  09:01:00       6306     6306    6256      6259      13278           0          0           0       6260   both
2013/1/4  09:01:00       6306     6306    6256      6259      13278           0          0           0       6261   both
2013/1/4  09:01:00       6306     6306    6256      6259      13278           0          0           0       6212   both
2013/1/4  09:01:00       6306     6306    6256      6259      13278           0          0           0       6206   both
2013/1/4  09:01:00       6306     6306    6256      6259      13278           0          0           0       6212   both
...            ...        ...      ...     ...       ...        ...         ...        ...         ...        ...    ...
2013/1/4  09:20:00       6224     6227    6221      6221       1142           0          0           0       6232   both
2013/1/4  09:20:00       6224     6227    6221      6221       1142           0          0           0       6231   both
2013/1/4  09:20:00       6224     6227    6221      6221       1142           0          0           0       6220   both
2013/1/4  09:20:00       6224     6227    6221      6221       1142           0          0           0       6222   both
2013/1/4  09:20:00       6224     6227    6221      6221       1142           0          0           0       6216   both
2013/1/4  09:20:00       6224     6227    6221      6221       1142           0          0           0       6224   both

[400 rows x 11 columns]

In [25]: (pd.merge(data, another, left_index=True, right_index=True, indicator=True)._merge=='both').all()
Out[25]: True

I suppose have a nice doc example would be the best for now. So if you want to add a warning section in the reshape docs would be great. http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

I am not sure when / how to even show a warning / error message. This IS a valid use case.

@jreback jreback added this to the Next Major Release milestone Nov 25, 2016
@jreback jreback changed the title memory overflow when joining/merging dataframes on index with duplicate keys DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys Nov 25, 2016
xgdgsc added a commit to xgdgsc/pandas that referenced this issue Dec 2, 2016
@jreback jreback modified the milestones: 0.19.2, Next Major Release Dec 4, 2016
jorisvandenbossche pushed a commit that referenced this issue Dec 11, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.20.0, 0.19.2 Dec 11, 2016
ischurov pushed a commit to ischurov/pandas that referenced this issue Dec 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question
Projects
None yet
3 participants