Skip to content

DOC: DataFrame.merge behaviour for suffix=(False, False) #22141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 4, 2018

Conversation

elmq0022
Copy link
Contributor

@elmq0022 elmq0022 commented Jul 31, 2018

@pep8speaks
Copy link

pep8speaks commented Jul 31, 2018

Hello @elmq0022! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 24, 2018 at 23:57 Hours UTC

@elmq0022 elmq0022 changed the title documented DataFrame.merge behaviour for suffix=(False, False) DOC: DataFrame.merge behaviour for suffix=(False, False) Jul 31, 2018
@codecov
Copy link

codecov bot commented Jul 31, 2018

Codecov Report

Merging #22141 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #22141   +/-   ##
=======================================
  Coverage   92.04%   92.04%           
=======================================
  Files         169      169           
  Lines       50776    50776           
=======================================
  Hits        46737    46737           
  Misses       4039     4039
Flag Coverage Δ
#multiple 90.45% <ø> (ø) ⬆️
#single 42.23% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/frame.py 97.2% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55d176d...304af4f. Read the comment docs.

@gfyoung gfyoung added the Docs label Aug 2, 2018
@gfyoung gfyoung requested a review from jreback August 2, 2018 04:53
@gfyoung gfyoung added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Aug 2, 2018
@@ -183,7 +183,8 @@
the order of the join keys depends on the join type (how keyword).
suffixes : 2-length sequence (tuple, list, ...)
Suffix to apply to overlapping column names in the left and right
side, respectively.
side, respectively. If (False, False), overlapping
column names raise an error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example in the doc-string to make this even more clear? (we don't even have a suffix example at all......)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback, the doc string example does illustrate the behavior for the default suffix arguments.

Also, there is a more detailed suffix example here: https://pandas.pydata.org/pandas-docs/stable/merging.html#overlapping-value-columns

I suggest adding a specific note or example to the link above for suffix=(False, False) instead of the doc string.

I feel a more experienced user would likely know what to do just from the parameter argument description while a more novice user will probably need to look at the longer documentation with copy/paste examples that work with ipython or jupyter.

I also suggest the merge, join, and concat methods all have a reference to the more complete documentation in the see also section. I've found this documentation very helpful in the past and still refer to if from time to time.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Also, I'm happy to just update the doc string. Just wanted to offer a different perspective. Let me know what you prefer. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elmq0022 yeah an updated doc-string would be great

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the type, I'd have suffixes : tuple of (str, str), default ('_x', '_y'). Small detail, but I'd rephrase you addition to sound like To raise an exception on overlapping columns, use (False, False). Being able to use (False, False) to raise looks like a feature to me, and I think it's better to present it this way. May be we could also have it in the Notes section.

But it's just an idea, I'm happy with the PR the way it is. It's a good addition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elmq0022 can you make the changes suggested in the previous comment?

@elmq0022
Copy link
Contributor Author

@jreback, I updated the docstring with a couple suffix examples.

@elmq0022
Copy link
Contributor Author

@jreback any additional feedback? Thanks!

@@ -254,6 +255,23 @@
3 foo 5 foo 8
4 bar 2 bar 6
5 baz 3 baz 7

>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment for both of these example of what they are showing

@jreback jreback requested a review from datapythonista August 20, 2018 22:39
@elmq0022
Copy link
Contributor Author

elmq0022 commented Aug 20, 2018 via email

@elmq0022 elmq0022 force-pushed the GH22045 branch 2 times, most recently from 4252f79 to 57fdba5 Compare August 21, 2018 04:02
@elmq0022
Copy link
Contributor Author

@jreback and @datapythonista, I added comments to the examples.

@elmq0022
Copy link
Contributor Author

@datapythonista, I made the updates. This should be good to go now. Thanks!

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to merge after fixing the typo and the variable names.

But if you want to work on making the docstring follow best practices and our standards, you can also:

  • In parameters replace string by str, boolean by bool and integer by int
  • In the returns add a description (indented in the next line after the type.
  • Make the short summary (first line) shorter to fit in a single line
  • run scripts/validate_docstrings pandas.DataFrame.merge to see if there is any other issue

Thanks!

Suffix to apply to overlapping column names in the left and right
side, respectively.
side, respectively. To raise an exption on overlapping columns use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: s/exption/exception

Merge DataFrames A and B. Specify the left and right suffix
to append to the name of any overlapping columns.

>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I didn't see before that in the original docstring the DataFrame names were A and B. Do you mind changing them to df1 and df2. That's the standard we use, and A and B in Python should be used for classes, not instances.

Also, we can take care of that in a different PR if you prefer, but it'd be great to have a more real world example (it makes it easier to understand what's going on), and in this case I find the how='outer' misleading, as it doesn't add value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can make those updates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also thanks for the location of the validation scripts and such. This is very helpful for my first documentation pull request.

3 foo 5 foo 8
4 bar 2 bar 6
5 baz 3 baz 7

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this part of the doc-tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was able to verify this using pytest --doctest-modules.

@elmq0022
Copy link
Contributor Author

@datapythonista I made the updates.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some comments on minor things. For the rest looks good to me.

@@ -212,7 +212,8 @@

Returns
-------
DataFrame
df: DataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just leave DataFrame here?

@@ -197,7 +197,7 @@
"right_only" for observations whose merge key only appears in 'right'
DataFrame, and "both" if the observation's merge key is found in both.

validate : string, default None
validate : str, default None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you replace the default None by optional

... 'value': [1, 2, 3, 5]})
>>> B = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
>>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
... 'value': [5, 6, 7, 8]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation of the continuation lines seems wrong.

@elmq0022
Copy link
Contributor Author

@jreback, @datapythonista two items to address:

First, the following Travis job is failing.

43166.3
Python: 3.5
JOB="2.7, locale, slow, old NumPy" ENV_FILE="ci/travis-27-locale.yaml" LOCALE_OVERRIDE="zh_CN.UTF-8" SLOW=true
3 min 51 sec

The error message is:
ImportError: libgfortran.so.1: cannot open shared object file: No such file or directory.

I don't think this error is on my end, but let me know if I have to do something differently. I did just rebase and rebuild all the C extensions.

Second, I think I get what @jreback was referring to prior. I am not sure if the pandas.DataFram.merge doctests are necessarily run during CI. In doctests.sh on line 23 there is no merge argument mentioned.

pytest --doctest-modules -v pandas/core/frame.py \
        -k"-assign -axes -combine -isin -itertuples -join -nlargest -nsmallest -nunique -pivot_table -quantile -query -reindex -reindex_axis -replace -round -set_index -stack -to_dict -to_stata -transform"

Should we add it here between -join and -nlargest? Not sure where else this would go or what else would kick off the tests.

@datapythonista
Copy link
Member

@elmq0022 the CI error seems indeed unrelated to your changes. About the doctests, the - in -join -nlargest ... means that the tests are being skipped. So, that list is the list of tests that are failing now. If -merge is not there means that the examples are already correct, so as far as they continue to be correct, nothing needs to be done (if you have a example that breaks or with output different than what presented, the CI should report it).

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Later on, we can think on providing more examples about the behavior described in the extended summary, but I'd address that in a separated PR. Thanks @elmq0022

@elmq0022
Copy link
Contributor Author

@datapythonista @jreback, thanks for all your help. Will need to rebase and resubmit to pass the CI?

@datapythonista
Copy link
Member

Yes, if you can merge master into your branch and push, that should trigger the CI again.

@elmq0022
Copy link
Contributor Author

elmq0022 commented Aug 25, 2018

@jreback and @datapythonista all green.

@elmq0022
Copy link
Contributor Author

@jreback friendly reminder. Thanks!

@datapythonista datapythonista merged commit 3285bdc into pandas-dev:master Sep 4, 2018
@datapythonista
Copy link
Member

Thanks for the work on this @elmq0022

@elmq0022
Copy link
Contributor Author

elmq0022 commented Sep 4, 2018

@datapythonista no problem. Happy to help. Thanks for feedback and merging the request.

aeltanawy pushed a commit to aeltanawy/pandas that referenced this pull request Sep 20, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Docs
Projects
None yet
5 participants