DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys #14736

xgdgsc · 2016-11-25T05:51:00Z

Code Sample, a copy-pastable example if possible

http://stackoverflow.com/questions/32750970/python-pandas-merge-causing-memory-overflow

# coding: utf-8
import pandas as pd
data = pd.read_csv('https://gist.githubusercontent.com/xgdgsc/8671a22136e1da937f1046a5f211c0ff/raw/d261706a6e7d1d7014e45e47122ead71e7159ef4/small.csv', index_col='<Date>')
print(data.shape)
another = data[[ ' <Open>']]
joined = data.join([another])
print(joined.shape)

Problem description

Currently having index with duplicate keys when joining dataframes would cause severe memory overflow, sometimes freezes the computer and user has to hard reboot, which can be disastrous for unsaved work.

Expected Output

Adding a simple checking before joining/merging , stop the operation and warn the user would be enough.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-11-25T12:52:59Z

can you show a mini example?

xgdgsc · 2016-11-25T14:06:50Z

https://gist.github.com/xgdgsc/8671a22136e1da937f1046a5f211c0ff

You can see the example above. When joining on dup indexes, the joined frame size grows by square (400) of the size of the original frames (20). It would overflow memory quite easily without caution.

jreback · 2016-11-25T14:22:14Z

@xgdgsc pls add a small copy-pastable example.

xgdgsc · 2016-11-25T14:33:33Z

Thanks. Added in description.

jreback · 2016-11-25T14:34:23Z

@xgdgsc can you construct an example that is just code? (and does not pull a remote url)

small example, it doesn't have to blow up. I simply want to see the structure.

xgdgsc · 2016-11-25T14:36:28Z

The code in description is just code which loads a tiny csv. It doesn' t blow up. It just shows the size of the joined frame to showcase the problem.

jreback · 2016-11-25T14:47:08Z

@xgdgsc

so I am not sure I understand the problem. you are merging non-unique with non-unique. That's what you asked pandas to do.

jreback · 2016-11-25T14:57:49Z

you can easily see the issue here as well.

In [24]: pd.merge(data, another, left_index=True, right_index=True, indicator=True)
Out[24]: 
            <Time>   <Open>_x   <High>   <Low>   <Close>   <Volume>   <FastAvg>   <MedAvg>   <SlowAvg>   <Open>_y _merge
<Date>                                                                                                                  
2013/1/4  09:01:00       6306     6306    6256      6259      13278           0          0           0       6306   both
2013/1/4  09:01:00       6306     6306    6256      6259      13278           0          0           0       6260   both
2013/1/4  09:01:00       6306     6306    6256      6259      13278           0          0           0       6261   both
2013/1/4  09:01:00       6306     6306    6256      6259      13278           0          0           0       6212   both
2013/1/4  09:01:00       6306     6306    6256      6259      13278           0          0           0       6206   both
2013/1/4  09:01:00       6306     6306    6256      6259      13278           0          0           0       6212   both
...            ...        ...      ...     ...       ...        ...         ...        ...         ...        ...    ...
2013/1/4  09:20:00       6224     6227    6221      6221       1142           0          0           0       6232   both
2013/1/4  09:20:00       6224     6227    6221      6221       1142           0          0           0       6231   both
2013/1/4  09:20:00       6224     6227    6221      6221       1142           0          0           0       6220   both
2013/1/4  09:20:00       6224     6227    6221      6221       1142           0          0           0       6222   both
2013/1/4  09:20:00       6224     6227    6221      6221       1142           0          0           0       6216   both
2013/1/4  09:20:00       6224     6227    6221      6221       1142           0          0           0       6224   both

[400 rows x 11 columns]

In [25]: (pd.merge(data, another, left_index=True, right_index=True, indicator=True)._merge=='both').all()
Out[25]: True

I suppose have a nice doc example would be the best for now. So if you want to add a warning section in the reshape docs would be great. http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging

I am not sure when / how to even show a warning / error message. This IS a valid use case.

…es on index with duplicate keys (#14788) closes #14736

…es on index with duplicate keys (pandas-dev#14788) closes pandas-dev#14736

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question labels Nov 25, 2016

jreback added Docs Difficulty Novice labels Nov 25, 2016

jreback added this to the Next Major Release milestone Nov 25, 2016

jreback changed the title ~~memory overflow when joining/merging dataframes on index with duplicate keys~~ DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys Nov 25, 2016

xgdgsc added a commit to xgdgsc/pandas that referenced this issue Dec 2, 2016

add doc regarding pandas-dev#14736

93a63a5

xgdgsc mentioned this issue Dec 2, 2016

DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys #14788

Merged

1 task

jreback modified the milestones: 0.19.2, Next Major Release Dec 4, 2016

jorisvandenbossche closed this as completed in #14788 Dec 11, 2016

jorisvandenbossche pushed a commit that referenced this issue Dec 11, 2016

DOC: warning section on memory overflow when joining/merging datafram…

602cc46

…es on index with duplicate keys (#14788) closes #14736

jorisvandenbossche modified the milestones: 0.20.0, 0.19.2 Dec 11, 2016

ischurov pushed a commit to ischurov/pandas that referenced this issue Dec 19, 2016

DOC: warning section on memory overflow when joining/merging datafram…

923cb8d

…es on index with duplicate keys (pandas-dev#14788) closes pandas-dev#14736

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys #14736

DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys #14736

xgdgsc commented Nov 25, 2016 •

edited

Loading

jreback commented Nov 25, 2016

xgdgsc commented Nov 25, 2016

jreback commented Nov 25, 2016

xgdgsc commented Nov 25, 2016

jreback commented Nov 25, 2016 •

edited

Loading

xgdgsc commented Nov 25, 2016

jreback commented Nov 25, 2016

jreback commented Nov 25, 2016

DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys #14736

DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys #14736

Comments

xgdgsc commented Nov 25, 2016 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Nov 25, 2016

xgdgsc commented Nov 25, 2016

jreback commented Nov 25, 2016

xgdgsc commented Nov 25, 2016

jreback commented Nov 25, 2016 • edited Loading

xgdgsc commented Nov 25, 2016

jreback commented Nov 25, 2016

jreback commented Nov 25, 2016

xgdgsc commented Nov 25, 2016 •

edited

Loading

Output of `pd.show_versions()`

jreback commented Nov 25, 2016 •

edited

Loading