Skip to content

API: add top-level functions as method #15513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from

Conversation

ResidentMario
Copy link
Contributor

@ResidentMario ResidentMario commented Feb 26, 2017

@ResidentMario
Copy link
Contributor Author

Note: this conversation is continued from #12640.

This is a PR which aliases certain top-level functions into DataFrame and/or Series object methods, for reasons of syntax and method chaining. Still a WIP atm.

The top-level method bodies remain in the tooling libraries in which they were defined before, but the docstrings have been moved to _shared_docs dictionary objects in core/frame.py and/or core/series.py. _shared_docs is then back-ported into the tooling libraries and attached to the original method using the Appender decorator.

This is done so that the docstrings between the top-level and object-method differ in two respects:

  1. The new object method gets a new in 0.20.0 marker in the docs.
  2. The new object method links to the old top-level function in See also, and vice versa.

This reuses an earlier-established convention used elsewhere throughout the library.

Have not written nor attempted to pass tests just yet.

That being said, in the conversation on the issue @jreback mentions tweaking the pattern by possible e.g. using another newly created decorator. I agree that this could be a smart change, because the use of Appender is starting to become a little fritzy (see my notes on pd.get_dummies below).

In this PR: pd.crosstab <-> pd.DataFrame.crosstab, pd.melt <-> pd.DataFrame.melt, pd.cut <-> pd.Series.cut, pd.qcut <-> pd.Series.qcut.

Not in this PR: pd.pivot_table <-> pd.DataFrame.pivot_table, which was already done long ago but needs a refactor; and pd.get_dummies <-> pd.DataFrame.get_dummies/pd.Series.get_dummies, which is a bit problematic. The issue with the latter is the question as to where the docstring should go: should it be hosted in series.py or in frame.py? Whichever of those files doesn't host it would have to import it from its sibling.

Finally, Appender is used somewhat haphazardously right now. What I mean by that is that the docstrings are sometimes imported as e.g. from ... import base; base._shared_docs and sometimes directly as from ... import _shared_docs. I think it'd be useful to establish a convention here: from base import _shared_docs as _base_shared_docs and from frame import _shared_docs as _frame_shared_docs, for example.

@jreback
Copy link
Contributor

jreback commented Feb 26, 2017

@ResidentMario to address a point that you made.

When we have a shared method that is needed for both Series & DataFrame, we actually put it in pandas.core.generic and put the shared doc-string there. So that's a good place for .get_dummies (note that there is a section near the end so that the methods are only for Series/DataFrames, and not all inherited objects).

@jreback
Copy link
Contributor

jreback commented Feb 26, 2017

alternatively, since some of our files (e.g. frame.py) are getting very large. We could do something like:

pandas/core/docs.py where only doc-strings exist (or maybe pandas/core/frame/docs and similar). A bit odd to separate code from the doc-strings, but it might be appropriate in this case.

@jorisvandenbossche

@jreback jreback added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 26, 2017
@ResidentMario
Copy link
Contributor Author

Re: generic, gotcha, can-do then. It looks like the two remaining functions can also be touched up using Appender alone, then.

Re: globing off the docs, I don't have a strong opinion on this. Having a file for docstrings is weird, but then again, so is having docstrings floating somewhere about their method bodies :). It would certainly make things significantly cleaner, architecturally.

I wouldn't mind implementing either/or. The latter kind of sounds like a different PR, though.

@jreback
Copy link
Contributor

jreback commented Feb 26, 2017

@ResidentMario yep go ahead and make changes to move things to generic.py as appropriate. The generaized move to something like a docs.py would be a different PR (looking more some community thought whether this makes sense).

@jorisvandenbossche
Copy link
Member

Sorry that I didn't raise my concert earlier on the issue, and @ResidentMario I find it difficult to give this backlash given the very nice effort you are putting in this issue.
But, I am personally not sure I find it a good idea to add all of these top-level functions as methods on a dataframe/series. I don't have much time this evening to look more into detail, but we already have a large amount of methods on DataFrame/Series, and I think adding new ones should really be worth it.

@jorisvandenbossche
Copy link
Member

Regarding the docstrings, that's something I also already thought about once. Our current infrastructure of shared_docs is a bit of a mess for contributors, and can use some rethinking. But that indeed sounds like another issue/PR.

In case of melt, you could also do it like pivot_table, which is added to DataFrame in pivot.py

@ResidentMario
Copy link
Contributor Author

ResidentMario commented Feb 26, 2017

So long as you agree with the method chaining philosophy, the ability to use df.melt(...) instead of df.pipe(lambda df: pd.melt(df, ...)) seems to absolutely be worth the maintenance burden.

The other consideration is a broader one about API design. Where does pandas want to try to keep its algorithms? Should they be object methods, or top-level functions? Having things available in both was once convenient (pd.pivot_table(data) beats pd.DataFrame(data).pivot_table()), but the API is so large now that having some kind of strict differential—e.g. IO tasks only in the top-level, everything else in object mehtods—seems like a necessary way forward.

Otherwise, pandas risks confusing newcomers by providing them too many options: to do x you can use either route, to do y you have to use an object, z is only available as a top-level...if there could be a simple and memorable rule controlling which is where, that would be really helpful for users!

It sounds like pandas has already been moving towards object methods for some time. Looking at the docs, here are the current top-level methods (excluding file I/O):

  • bdate_range
  • concat
  • crosstab
  • cut
  • date_range
  • eval
  • factorize
  • get_dummies
  • get_option
  • isnull
  • melt
  • merge
  • merge_asof
  • merge_ordered
  • notnull
  • option_context
  • period_range
  • piviot
  • pivot_table
  • qcut
  • reset_option
  • set_option
  • test
  • to_datetime
  • to_numeric
  • to_timedelta
  • wide_to_long

Maybe the agreement to be reached here is on which of these "ought" to be where?

cc: @TomAugspurger

@jreback
Copy link
Contributor

jreback commented Feb 26, 2017

@ResidentMario

so most of the functions on the list, should certainly be top-level, and yes we should have a clear policy.

One easy differentiator is a function which only takes a single argument, that is a Series OR a DataFrame only

  • This eliminates things like concat/merge.pd.concat can take multiple objects, as for example does pd.merge. Now one could make the case that DataFrame.merge should exist (though we have .join which by-definition does auto alignment).

  • A function like .cut/.qcut does fall into this category, but OTOH, it can take multiple types of arguments (e.g. ndarray/Index/Series), so I would put about +0 for adding these (include also the .to_datetime/.to_numeric/.to_timedelta in this bucket).).

  • Something that I am +1 on converting is .melt, as this completes the reshaping ops, with .unstack/.stack/.pivot already methods.

In fact I would go as far as deprecating pd.pivot/pd.pivot_table, which are only useful as methods.

So I agree there is much confusion here on what should be where. Another example, we used to have only pd.null/isnull. But these make much more sense as methods, esp for having clean-concise code.

@jorisvandenbossche is right that we have too many methods as it is, BUT, I still think we need to clean this up and have as much consistency as possible.

@codecov-io
Copy link

codecov-io commented Feb 27, 2017

Codecov Report

Merging #15513 into master will decrease coverage by -0.01%.
The diff coverage is 76.47%.

@@            Coverage Diff             @@
##           master   #15513      +/-   ##
==========================================
- Coverage   90.36%   90.36%   -0.01%     
==========================================
  Files         136      136              
  Lines       49553    49587      +34     
==========================================
+ Hits        44780    44807      +27     
- Misses       4773     4780       +7
Impacted Files Coverage Δ
pandas/tools/pivot.py 95.1% <100%> (+0.06%)
pandas/core/reshape.py 99.26% <100%> (ø)
pandas/tools/tile.py 96.95% <100%> (+0.09%)
pandas/core/frame.py 97.64% <60%> (-0.19%)
pandas/core/series.py 94.64% <63.63%> (-0.33%)
pandas/core/generic.py 96.31% <0%> (ø)
pandas/core/common.py 91.36% <0%> (+0.33%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b3ae4c7...643d54b. Read the comment docs.

@shoyer
Copy link
Member

shoyer commented Feb 27, 2017

So long as you agree with the method chaining philosophy, the ability to use df.melt(...) instead of df.pipe(lambda df: pd.melt(df, ...)) seems to absolutely be worth the maintenance burden.

I think the alternative of df.pipe(pd.melt, ...) is not too bad.

@ResidentMario
Copy link
Contributor Author

ResidentMario commented Feb 27, 2017

Oof, my bad there with the method signature. I still don't want to have to use pipe though, if I can help it, for a built-in.

@ResidentMario
Copy link
Contributor Author

ResidentMario commented Feb 27, 2017

Alright, I've added commits for each of the functions. Whichever ones we decide to transfer can be cherry-picked off the pull request.

get_dummies was easily the most difficult of the set, since it required transferring some things between files. Hopefully I got the implementation right.

Tests still TBD, pending further discussion (?).

@jorisvandenbossche
Copy link
Member

@jreback I think you are missing a few methods on DataFrame :-)
Both pivot_table and merge already exist also as DataFrame methods (one can discuss whether it is good that they are duplicated, but in any case they already are).

A bit more specific concerns:

  • crosstab:
    • the current signature is not suitable for adding it as a DataFrame method. As a consequence, your current implementation in this PR will also not work.
    • The result of crosstab can already be obtained with DataFrame.pivot_table using aggfunc='count', so a bit less essential although a convenient shortcut. But the convenient shortcut can also be a top-level function
  • melt:

I would also not move the get_dummies implementation, but keep it here consistent with the approach for the other reshaping methods: functionality in a separate module, import from there for adding it to NDFrame. We don't have any other public functions (apart from methods) defined in generic.py

@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

@jorisvandenbossche you can see how much I use DataFrame.merge !

yeah I think main method that needs inclusion is .melt (though ackowledge the signature confusion), but that's the case either way.

and agree on .get_dummies. I would argue that we should call this .dummies for consistency, but that's another issue.

ok @ResidentMario can you strip out just .melt for now (into a new PR) or update this one, either way.

@ResidentMario
Copy link
Contributor Author

I'll open a new PR with just the melt changes, then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: add top-level functions as method
5 participants