-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys #14736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can you show a mini example? |
https://gist.github.com/xgdgsc/8671a22136e1da937f1046a5f211c0ff You can see the example above. When joining on dup indexes, the joined frame size grows by square (400) of the size of the original frames (20). It would overflow memory quite easily without caution. |
@xgdgsc pls add a small copy-pastable example. |
Thanks. Added in description. |
@xgdgsc can you construct an example that is just code? (and does not pull a remote url) small example, it doesn't have to blow up. I simply want to see the structure. |
The code in description is just code which loads a tiny csv. It doesn' t blow up. It just shows the size of the joined frame to showcase the problem. |
so I am not sure I understand the problem. you are merging non-unique with non-unique. That's what you asked pandas to do. |
you can easily see the issue here as well.
I suppose have a nice doc example would be the best for now. So if you want to add a I am not sure when / how to even show a warning / error message. This IS a valid use case. |
…es on index with duplicate keys (pandas-dev#14788) closes pandas-dev#14736
Code Sample, a copy-pastable example if possible
http://stackoverflow.com/questions/32750970/python-pandas-merge-causing-memory-overflow
Problem description
Currently having index with duplicate keys when joining dataframes would cause severe memory overflow, sometimes freezes the computer and user has to hard reboot, which can be disastrous for unsaved work.
Expected Output
Adding a simple checking before joining/merging , stop the operation and warn the user would be enough.
Output of
pd.show_versions()
pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None
The text was updated successfully, but these errors were encountered: