Skip to content

BUG/ENH: categorical returned during a transform #8065

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Aug 18, 2014 · 10 comments
Closed

BUG/ENH: categorical returned during a transform #8065

jreback opened this issue Aug 18, 2014 · 10 comments
Labels
Bug Categorical Categorical Data Type Groupby

Comments

@jreback
Copy link
Contributor

jreback commented Aug 18, 2014

http://stackoverflow.com/questions/25372877/python-pandas-groupby-and-qcut-doesnt-work-in-0-14-1

df = pd.DataFrame({'x': np.random.rand(20), 'grp': ['a'] * 10 + ['b'] * 10})

worked in 0.13 (but returns a 2 element series with a list)

df.groupby('grp')['x'].transform(pd.qcut, 3)

much cleaner.

df.groupby('grp')['x'].apply(lambda x: pd.Series(pd.qcut(x,3), index=x.index))
@jreback jreback added this to the 0.15.0 milestone Aug 18, 2014
@jreback
Copy link
Contributor Author

jreback commented Aug 18, 2014

cc @JanSchulz maybe need some inference on the returned categoricals

@jankatins
Copy link
Contributor

Ok, here are multiple issues:

  • qcut should return a ordered categorical
  • df.groupby('grp')['x'].transform(pd.qcut, 3) should (almost) never work, as the returned categoricals have different levels and so can't be concat'ed into one Series
  • The error message is wrong: It should be about "Categoricals with different levels can't be merged into one Series" and not that something can't be converted to float.

One problem is that in case of categorical the as_type(common_type) doesn't work, as it never sees a categorical because the common_type is constructed from np.asarray(res) and is therefore not a cat anymore. On the other hand this is wrong a lot earlier: result = self._selected_obj.values.copy() -> this is leading to trouble, as the result type has nothing to do with the inputs -> categorical must be special cased here :-(

IMO the best way would be to construct a Series when we have the first res but with the index of the input. Not sure if that actually works (different length: index has all, res only for one group).

@jreback: can you merge the fixup PR, then I can submit a new PR for this one without the need to rebase constantly the whole PR.

@jreback
Copy link
Contributor Author

jreback commented Aug 19, 2014

@JanSchulz ok, fixups merged.

yeh the transform(pd.qcut,3) is a little bogus , not saying it should work, just better error message

@jreback
Copy link
Contributor Author

jreback commented Sep 26, 2014

@JanSchulz if you get to this before 0.15 ok, otherwise will push it (I may take a look as well).

@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 26, 2014
@jankatins
Copy link
Contributor

Let's see if I find time tomorrow and if not it will have to wait until after my holidays :-/ The first point is addressed (qcut returns a ordered categorical).

@jankatins jankatins mentioned this issue Sep 27, 2014
11 tasks
@jankatins
Copy link
Contributor

This is not easy, this is the current code:

       dtype = self._selected_obj.dtype
        result = self._selected_obj.values.copy()

        wrapper = lambda x: func(x, *args, **kwargs)
        for i, (name, group) in enumerate(self):

            object.__setattr__(group, 'name', name)
            res = wrapper(group)

            if hasattr(res, 'values'):
                res = res.values

            # may need to astype
            try:
                common_type = np.common_type(np.array(res), result)
                if common_type != result.dtype:
                    result = result.astype(common_type)
            except:
                pass

            indexer = self._get_index(name)
            result[indexer] = res

        result = _possibly_downcast_to_dtype(result, dtype)

But this is not going to work with categoricals, because it stuffs the new values together with the old ones which will always convert the categoricals to a numpy type. There are three possible ways:

  • Rewrite the the function to use concat if res is Categorical (which will probably not be performant...)
  • simple throw a warning (categorical converted to string/number) or an error (categoricals can't be used in transform/....)
  • document that using transforms and returned categorical data is "undefined behaviour" and probbly not going to work

The "throw an error" variant is not always "correct", because one could return a categorical with the same categories for all groups and this should actually succeed.

@jankatins
Copy link
Contributor

@jreback can you comment here?

@jreback
Copy link
Contributor Author

jreback commented Sep 29, 2014

@JanSchulz let me take a look.

@jreback
Copy link
Contributor Author

jreback commented Sep 29, 2014

@JanSchulz this is related to #7883

The issue is that I did a short-cut to make Series.transformation fast. But that didn't with multiple-dtypes well.

So have to defer this (as you can use the work-around I suggested in this issue). We can fix for 0.15.1.

@jreback jreback added the Bug label Sep 29, 2014
@jreback jreback modified the milestones: 0.15.1, 0.16.0, 0.15.2 Nov 2, 2014
@jreback jreback modified the milestones: 0.16.0, 0.15.2 Nov 29, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@TomAugspurger TomAugspurger modified the milestones: Contributions Welcome, No action Jul 6, 2018
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 6, 2018

This now returns an object series of Intervals (soon to be an IntervalArray).

I think the original issue is fixed.

In [140]: df.groupby('grp')['x'].transform(pd.qcut, 3)
Out[140]:
0       (0.796, 0.945]
1       (0.571, 0.796]
2       (0.571, 0.796]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby
Projects
None yet
Development

No branches or pull requests

3 participants