API: add top-level functions as method #15513

ResidentMario · 2017-02-26T19:03:52Z

closes API: add top-level functions as method #12640
- pd.crosstab (DataFrame)
- pd.melt (DataFrame) (xref API: DataFrame.melt(), DataFrame.wide_to_long() #14876)
- pd.pivot_table (might need a new name as would be confusing with DataFrame.pivot)
- pd.get_dummies (both)
- pd.cut/qcut (only on Series)
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

ResidentMario · 2017-02-26T19:25:10Z

Note: this conversation is continued from #12640.

This is a PR which aliases certain top-level functions into DataFrame and/or Series object methods, for reasons of syntax and method chaining. Still a WIP atm.

The top-level method bodies remain in the tooling libraries in which they were defined before, but the docstrings have been moved to _shared_docs dictionary objects in core/frame.py and/or core/series.py. _shared_docs is then back-ported into the tooling libraries and attached to the original method using the Appender decorator.

This is done so that the docstrings between the top-level and object-method differ in two respects:

The new object method gets a new in 0.20.0 marker in the docs.
The new object method links to the old top-level function in See also, and vice versa.

This reuses an earlier-established convention used elsewhere throughout the library.

Have not written nor attempted to pass tests just yet.

That being said, in the conversation on the issue @jreback mentions tweaking the pattern by possible e.g. using another newly created decorator. I agree that this could be a smart change, because the use of Appender is starting to become a little fritzy (see my notes on pd.get_dummies below).

In this PR: pd.crosstab <-> pd.DataFrame.crosstab, pd.melt <-> pd.DataFrame.melt, pd.cut <-> pd.Series.cut, pd.qcut <-> pd.Series.qcut.

Not in this PR: pd.pivot_table <-> pd.DataFrame.pivot_table, which was already done long ago but needs a refactor; and pd.get_dummies <-> pd.DataFrame.get_dummies/pd.Series.get_dummies, which is a bit problematic. The issue with the latter is the question as to where the docstring should go: should it be hosted in series.py or in frame.py? Whichever of those files doesn't host it would have to import it from its sibling.

Finally, Appender is used somewhat haphazardously right now. What I mean by that is that the docstrings are sometimes imported as e.g. from ... import base; base._shared_docs and sometimes directly as from ... import _shared_docs. I think it'd be useful to establish a convention here: from base import _shared_docs as _base_shared_docs and from frame import _shared_docs as _frame_shared_docs, for example.

jreback · 2017-02-26T19:31:05Z

@ResidentMario to address a point that you made.

When we have a shared method that is needed for both Series & DataFrame, we actually put it in pandas.core.generic and put the shared doc-string there. So that's a good place for .get_dummies (note that there is a section near the end so that the methods are only for Series/DataFrames, and not all inherited objects).

jreback · 2017-02-26T19:32:37Z

alternatively, since some of our files (e.g. frame.py) are getting very large. We could do something like:

pandas/core/docs.py where only doc-strings exist (or maybe pandas/core/frame/docs and similar). A bit odd to separate code from the doc-strings, but it might be appropriate in this case.

@jorisvandenbossche

ResidentMario · 2017-02-26T19:42:53Z

Re: generic, gotcha, can-do then. It looks like the two remaining functions can also be touched up using Appender alone, then.

Re: globing off the docs, I don't have a strong opinion on this. Having a file for docstrings is weird, but then again, so is having docstrings floating somewhere about their method bodies :). It would certainly make things significantly cleaner, architecturally.

I wouldn't mind implementing either/or. The latter kind of sounds like a different PR, though.

jreback · 2017-02-26T20:22:35Z

@ResidentMario yep go ahead and make changes to move things to generic.py as appropriate. The generaized move to something like a docs.py would be a different PR (looking more some community thought whether this makes sense).

jorisvandenbossche · 2017-02-26T20:35:56Z

Sorry that I didn't raise my concert earlier on the issue, and @ResidentMario I find it difficult to give this backlash given the very nice effort you are putting in this issue.
But, I am personally not sure I find it a good idea to add all of these top-level functions as methods on a dataframe/series. I don't have much time this evening to look more into detail, but we already have a large amount of methods on DataFrame/Series, and I think adding new ones should really be worth it.

jorisvandenbossche · 2017-02-26T20:39:25Z

Regarding the docstrings, that's something I also already thought about once. Our current infrastructure of shared_docs is a bit of a mess for contributors, and can use some rethinking. But that indeed sounds like another issue/PR.

In case of melt, you could also do it like pivot_table, which is added to DataFrame in pivot.py

ResidentMario · 2017-02-26T21:22:40Z

So long as you agree with the method chaining philosophy, the ability to use df.melt(...) instead of df.pipe(lambda df: pd.melt(df, ...)) seems to absolutely be worth the maintenance burden.

The other consideration is a broader one about API design. Where does pandas want to try to keep its algorithms? Should they be object methods, or top-level functions? Having things available in both was once convenient (pd.pivot_table(data) beats pd.DataFrame(data).pivot_table()), but the API is so large now that having some kind of strict differential—e.g. IO tasks only in the top-level, everything else in object mehtods—seems like a necessary way forward.

Otherwise, pandas risks confusing newcomers by providing them too many options: to do x you can use either route, to do y you have to use an object, z is only available as a top-level...if there could be a simple and memorable rule controlling which is where, that would be really helpful for users!

It sounds like pandas has already been moving towards object methods for some time. Looking at the docs, here are the current top-level methods (excluding file I/O):

bdate_range
concat
crosstab
cut
date_range
eval
factorize
get_dummies
get_option
isnull
melt
merge
merge_asof
merge_ordered
notnull
option_context
period_range
piviot
pivot_table
qcut
reset_option
set_option
test
to_datetime
to_numeric
to_timedelta
wide_to_long

Maybe the agreement to be reached here is on which of these "ought" to be where?

cc: @TomAugspurger

jreback · 2017-02-26T23:47:24Z

@ResidentMario

so most of the functions on the list, should certainly be top-level, and yes we should have a clear policy.

One easy differentiator is a function which only takes a single argument, that is a Series OR a DataFrame only

This eliminates things like concat/merge.pd.concat can take multiple objects, as for example does pd.merge. Now one could make the case that DataFrame.merge should exist (though we have .join which by-definition does auto alignment).
A function like .cut/.qcut does fall into this category, but OTOH, it can take multiple types of arguments (e.g. ndarray/Index/Series), so I would put about +0 for adding these (include also the .to_datetime/.to_numeric/.to_timedelta in this bucket).).
Something that I am +1 on converting is .melt, as this completes the reshaping ops, with .unstack/.stack/.pivot already methods.

In fact I would go as far as deprecating pd.pivot/pd.pivot_table, which are only useful as methods.

So I agree there is much confusion here on what should be where. Another example, we used to have only pd.null/isnull. But these make much more sense as methods, esp for having clean-concise code.

@jorisvandenbossche is right that we have too many methods as it is, BUT, I still think we need to clean this up and have as much consistency as possible.

codecov-io · 2017-02-27T01:14:34Z

Codecov Report

Merging #15513 into master will decrease coverage by -0.01%.
The diff coverage is 76.47%.

@@            Coverage Diff             @@
##           master   #15513      +/-   ##
==========================================
- Coverage   90.36%   90.36%   -0.01%     
==========================================
  Files         136      136              
  Lines       49553    49587      +34     
==========================================
+ Hits        44780    44807      +27     
- Misses       4773     4780       +7

Impacted Files	Coverage Δ
pandas/tools/pivot.py	`95.1% <100%> (+0.06%)`	✅
pandas/core/reshape.py	`99.26% <100%> (ø)`	✅
pandas/tools/tile.py	`96.95% <100%> (+0.09%)`	✅
pandas/core/frame.py	`97.64% <60%> (-0.19%)`	❌
pandas/core/series.py	`94.64% <63.63%> (-0.33%)`	❌
pandas/core/generic.py	`96.31% <0%> (ø)`	✅
pandas/core/common.py	`91.36% <0%> (+0.33%)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b3ae4c7...643d54b. Read the comment docs.

shoyer · 2017-02-27T02:19:51Z

So long as you agree with the method chaining philosophy, the ability to use df.melt(...) instead of df.pipe(lambda df: pd.melt(df, ...)) seems to absolutely be worth the maintenance burden.

I think the alternative of df.pipe(pd.melt, ...) is not too bad.

ResidentMario · 2017-02-27T02:21:51Z

Oof, my bad there with the method signature. I still don't want to have to use pipe though, if I can help it, for a built-in.

ResidentMario · 2017-02-27T02:33:25Z

Alright, I've added commits for each of the functions. Whichever ones we decide to transfer can be cherry-picked off the pull request.

get_dummies was easily the most difficult of the set, since it required transferring some things between files. Hopefully I got the implementation right.

Tests still TBD, pending further discussion (?).

jorisvandenbossche · 2017-02-27T08:23:01Z

@jreback I think you are missing a few methods on DataFrame :-)
Both pivot_table and merge already exist also as DataFrame methods (one can discuss whether it is good that they are duplicated, but in any case they already are).

A bit more specific concerns:

crosstab:
- the current signature is not suitable for adding it as a DataFrame method. As a consequence, your current implementation in this PR will also not work.
- The result of crosstab can already be obtained with DataFrame.pivot_table using aggfunc='count', so a bit less essential although a convenient shortcut. But the convenient shortcut can also be a top-level function
melt:
- I agree this one would be nice to complement the existing reshaping dataframe method pivot
- But, there have also some issues been raised about the current melt API (API: Gather and other inspiration from tidyr #10109, API: DataFrame.melt(), DataFrame.wide_to_long() #14876 (comment)), so we should make sure that the current melt is what we actually want to add as a method

I would also not move the get_dummies implementation, but keep it here consistent with the approach for the other reshaping methods: functionality in a separate module, import from there for adding it to NDFrame. We don't have any other public functions (apart from methods) defined in generic.py

jreback · 2017-02-27T14:38:09Z

@jorisvandenbossche you can see how much I use DataFrame.merge !

yeah I think main method that needs inclusion is .melt (though ackowledge the signature confusion), but that's the case either way.

and agree on .get_dummies. I would argue that we should call this .dummies for consistency, but that's another issue.

ok @ResidentMario can you strip out just .melt for now (into a new PR) or update this one, either way.

ResidentMario · 2017-02-27T16:32:29Z

I'll open a new PR with just the melt changes, then.

jreback added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 26, 2017

ResidentMario closed this Feb 27, 2017

ResidentMario force-pushed the master branch from 643d54b to d80275d Compare February 27, 2017 16:31

ResidentMario mentioned this pull request Feb 27, 2017

API: add top-level melt function as method #15521

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: add top-level functions as method #15513

API: add top-level functions as method #15513

ResidentMario commented Feb 26, 2017 •

edited

Loading

ResidentMario commented Feb 26, 2017

jreback commented Feb 26, 2017

jreback commented Feb 26, 2017

ResidentMario commented Feb 26, 2017

jreback commented Feb 26, 2017

jorisvandenbossche commented Feb 26, 2017

jorisvandenbossche commented Feb 26, 2017

ResidentMario commented Feb 26, 2017 •

edited

Loading

jreback commented Feb 26, 2017

codecov-io commented Feb 27, 2017 •

edited

Loading

shoyer commented Feb 27, 2017

ResidentMario commented Feb 27, 2017 •

edited

Loading

ResidentMario commented Feb 27, 2017 •

edited

Loading

jorisvandenbossche commented Feb 27, 2017

jreback commented Feb 27, 2017

ResidentMario commented Feb 27, 2017

API: add top-level functions as method #15513

API: add top-level functions as method #15513

Conversation

ResidentMario commented Feb 26, 2017 • edited Loading

ResidentMario commented Feb 26, 2017

jreback commented Feb 26, 2017

jreback commented Feb 26, 2017

ResidentMario commented Feb 26, 2017

jreback commented Feb 26, 2017

jorisvandenbossche commented Feb 26, 2017

jorisvandenbossche commented Feb 26, 2017

ResidentMario commented Feb 26, 2017 • edited Loading

jreback commented Feb 26, 2017

codecov-io commented Feb 27, 2017 • edited Loading

Codecov Report

shoyer commented Feb 27, 2017

ResidentMario commented Feb 27, 2017 • edited Loading

ResidentMario commented Feb 27, 2017 • edited Loading

jorisvandenbossche commented Feb 27, 2017

jreback commented Feb 27, 2017

ResidentMario commented Feb 27, 2017

ResidentMario commented Feb 26, 2017 •

edited

Loading

ResidentMario commented Feb 26, 2017 •

edited

Loading

codecov-io commented Feb 27, 2017 •

edited

Loading

ResidentMario commented Feb 27, 2017 •

edited

Loading

ResidentMario commented Feb 27, 2017 •

edited

Loading