DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys #14788

xgdgsc · 2016-12-02T09:41:58Z

closes DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys #14736

jorisvandenbossche · 2016-12-02T22:38:30Z

doc/source/merging.rst

+
+.. warning::
+
+ * Joining on index with duplicate keys when joining large dataframes would cause severe memory overflow, sometimes freezes the            computer and user has to hard reboot, which can be dangerous for unsaved work. Please make sure no duplicate keys in index              before joining.


It's not needed to start with * (as this will give a one-bullet list). You only need an indentation so the text belongs to the warning.
Further, you have some wonky spaces in the text.

On the content: I don't think this is specific to joining on the index? But just on joining on a key with duplicate values?

jreback · 2016-12-04T17:25:50Z

doc/source/merging.rst

+
+.. warning::
+
+   Joining on keys with duplicate values when joining large dataframes would cause severe memory overflow, sometimes freezes the


This is not very useful. I think better is a simple statement that joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions. Please show a small example as well.

I think it is strange to include example in a warning section. Is there any example in existing warning sections? Why do you think an example is necessary?

@xgdgsc well, reading what you just did, it is completely non-obvious what you mean. an example is worth 1000 words. And when I mean example, I mean about 2 lines of code.

In [20]: left = pd.DataFrame({'A' : [1,2], 'B' : [1, 2]}) In [21]: left Out[21]: A B 0 1 1 1 2 2 In [22]: right = pd.DataFrame({'A' : [4,5,6], 'B': [2,2,2]}) In [23]: right Out[23]: A B 0 4 2 1 5 2 2 6 2 In [24]: pd.merge(left, right, on='B', how='outer') Out[24]: A_x B A_y 0 1 1 NaN 1 2 2 4.0 2 2 2 5.0 3 2 2 6.0 In [25]: left_dups = pd.DataFrame({'A' : [1,2], 'B' : [2, 2]}) In [26]: left_dups Out[26]: A B 0 1 2 1 2 2 In [27]: pd.merge(left_dups, right, on='B', how='outer') Out[27]: A_x B A_y 0 1 2 4 1 1 2 5 2 1 2 6 3 2 2 4 4 2 2 5 5 2 2 6

I would actually show this example and put it in a sub-section with a nice warning. The idea is that having duplicates on BOTH inputs blows up the result.

Yeah current description reads like a pandas problem, though it is actually an usage issue.

jreback · 2016-12-11T15:32:58Z

doc/source/merging.rst

+.. warning::
+
+  Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row dimensions, 
+  may result in memory overflow, which can be dangerous for unsaved work. 


remove 'which can be dangerou...'

The last sentence should be something like: it is the users responsibility to manage duplicate values in keys before join......

jreback · 2016-12-11T15:33:10Z

doc/source/merging.rst

+.. ipython:: python
+   :suppress:
+
+   @savefig merging_merge_on_key_multiple.png


can you past the picture in the PR for us to see

jreback · 2016-12-11T17:11:03Z

lgtm. @jorisvandenbossche

jorisvandenbossche · 2016-12-11T20:26:17Z

yes, thanks @xgdgsc

* origin/master: (22 commits) BUG: astype falsely converts inf to integer (GH14265) (pandas-dev#14343) BUG: Apply min_itemsize to index even when not appending DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys (pandas-dev#14788) BLD: missing - on secure BLD: new access token on pandas-dev TST: Test DatetimeIndex weekend offset (pandas-dev#14853) BLD: escape GH_TOKEN in build_docs TST: Correct results with np.size and crosstab (pandas-dev#4003) (pandas-dev#14755) Frame benchmarking sum instead of mean (pandas-dev#14824) CLN: lint of test_base.py BUG: Allow TZ-aware DatetimeIndex in merge_asof() (pandas-dev#14844) BUG: GH11847 Unstack with mixed dtypes coerces everything to object TST: skip testing on windows for specific formatting which sometimes hangs (pandas-dev#14851) BLD: try new gh token for pandas-docs CLN/PERF: clean-up of the benchmarks (pandas-dev#14099) ENH: add timedelta as valid type for interpolate with method='time' (pandas-dev#14799) DOC: add section on groupby().rolling/expanding/resample (pandas-dev#14801) TST: add test to confirm GH14606 (specify category dtype for empty) (pandas-dev#14752) BLD: use org name in build-docs.sh BF(TST): use = (native) instead of < (little endian) for target data types (pandas-dev#14832) ...

…es on index with duplicate keys (pandas-dev#14788) closes pandas-dev#14736

xgdgsc added 2 commits December 2, 2016 17:37

add doc regarding pandas-dev#14736

93a63a5

remove blank line

795a369

jorisvandenbossche reviewed Dec 2, 2016

View reviewed changes

jorisvandenbossche added the Docs label Dec 2, 2016

xgdgsc added 2 commits December 3, 2016 09:29

change index to keys , fix spaces and format

06d4430

fix format

fb021fb

jreback modified the milestone: 0.19.2 Dec 4, 2016

jreback reviewed Dec 4, 2016

View reviewed changes

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Dec 4, 2016

xgdgsc added 2 commits December 11, 2016 22:46

add example , change warning position

403b099

fix format

ee9fc60

jreback reviewed Dec 11, 2016

View reviewed changes

xgdgsc added 2 commits December 11, 2016 23:52

change fig name, change sentence

7581ce0

remove blank line

48e5322

jreback added this to the 0.19.2 milestone Dec 11, 2016

jorisvandenbossche merged commit 602cc46 into pandas-dev:master Dec 11, 2016

jorisvandenbossche modified the milestones: 0.20.0, 0.19.2 Dec 11, 2016

ischurov pushed a commit to ischurov/pandas that referenced this pull request Dec 19, 2016

DOC: warning section on memory overflow when joining/merging datafram…

923cb8d

…es on index with duplicate keys (pandas-dev#14788) closes pandas-dev#14736

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys #14788

DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys #14788

xgdgsc commented Dec 2, 2016

jorisvandenbossche Dec 2, 2016

jreback Dec 4, 2016

xgdgsc Dec 4, 2016

jreback Dec 4, 2016

jreback Dec 4, 2016

sinhrks Dec 4, 2016

jreback Dec 11, 2016

jreback Dec 11, 2016

xgdgsc Dec 11, 2016

jreback commented Dec 11, 2016

jorisvandenbossche commented Dec 11, 2016


		.. warning::

		* Joining on index with duplicate keys when joining large dataframes would cause severe memory overflow, sometimes freezes the computer and user has to hard reboot, which can be dangerous for unsaved work. Please make sure no duplicate keys in index before joining.


		.. warning::

		Joining on keys with duplicate values when joining large dataframes would cause severe memory overflow, sometimes freezes the

DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys #14788

DOC: warning section on memory overflow when joining/merging dataframes on index with duplicate keys #14788

Conversation

xgdgsc commented Dec 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 11, 2016

jorisvandenbossche commented Dec 11, 2016