Skip to content

BUG: Default to stat axis for SparseDataFrame when axis=None #13066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented May 3, 2016

Title is self-explanatory. Closes #13048.

@gfyoung
Copy link
Member Author

gfyoung commented May 3, 2016

Surprisingly, there aren't a lot of places where axis is used when calling a cumulative function except for SparseDataFrame because that is the only (non-deprecated) multi-dimensional sparse object.

@@ -74,3 +74,4 @@ Performance Improvements

Bug Fixes
~~~~~~~~~
- Bug in ``SparseDataFrame`` in which ``axis=None`` did not default to ``self._stat_axis_number`` (:issue:`13048`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say axis=0, users have no idea what self._stat_axis_number is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jreback jreback added Sparse Sparse Data Type Compat pandas objects compatability with Numpy or Python functions labels May 3, 2016
@gfyoung gfyoung force-pushed the stats-axis-default branch 2 times, most recently from a671f55 to ad81154 Compare May 3, 2016 18:30
# no exception should be raised even though
# numpy passes in 'axis=None'
funcs = ['sum', 'cumsum', 'var', 'mean', 'prod',
'cumprod', 'std']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add kurt, skew, sem, max, min, idxmax, idxmin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.kurt, np.skew, np.sem, np.idxmin, and np.idxmax do not exist. Will add for np.max and np.min though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so do for np.argsort/argmin as well then.

Copy link
Member Author

@gfyoung gfyoung May 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.argmin or np.argmax breaks because __array_wrap__ for SparseDataFrame expects a 2-D input, which is not what you get when you call either of those functions. I think this is a bug (or lacking feature) on the pandas side because argmin and argmax should return scalars, meaning __array_wrap__ should have to accept them. @jreback thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arg mom and max are aliases for idxmin/max which are correct

Copy link
Member Author

@gfyoung gfyoung May 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really? I think that's true only for Series. Doesn't exist for DataFrame. Unless you want to add them for compat purposes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh u r testing frames

need same tests for series

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. But what about the aliasing for DataFrame?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is aliasing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Series, you alias idxmin to argmin and likewise idxmax to argmax for ndarray compat. However, that aliasing doesn't exist for DataFrame.

@gfyoung gfyoung force-pushed the stats-axis-default branch from ad81154 to 678beab Compare May 5, 2016 15:52
@gfyoung
Copy link
Member Author

gfyoung commented May 5, 2016

@jreback : would be great to get some feedback regarding the argmin/argmax failure (due to the SparseDataFrame not accepting scalar input for __array_wrap__) as well the potential aliasing of idxmin to argmin (and idxmax to argmax) for SparseDataFrame

@gfyoung gfyoung force-pushed the stats-axis-default branch from 678beab to 4dddda2 Compare May 5, 2016 18:26
@jreback
Copy link
Contributor

jreback commented May 5, 2016

@gfyoung can you a copy-pastable example here of issue with argmax/min we generally don't support them AFAICT on non Series.

@gfyoung
Copy link
Member Author

gfyoung commented May 5, 2016

>>> from pandas import SparseDataFrame
>>> import numpy as np
>>> sdf = SparseDataFrame([[1, 2, 3], [4, 5, 6]])
>>> np.argmin(sdf)
...
ValueError: Must pass 2-d input

I am not advocating that it be added to the list of supported methods, but I see no reason to test it per se in the list of functions for test_numpy_func_call (for SparseDataFrame that is)...that's all. Or rather, we should test it that raises like this? Not sure...

@gfyoung gfyoung force-pushed the stats-axis-default branch from 4dddda2 to 7cd9244 Compare May 6, 2016 12:41
@jreback
Copy link
Contributor

jreback commented May 6, 2016

doesn't work on dense either. I suppose you could make a separate issue about this. Its not high on priority list though.

In [5]: np.argmin(sdf.to_dense())
ValueError: Must pass 2-d input

# numpy passes in 'axis=None' or `axis=-1'
funcs = ['sum', 'cumsum', 'var', 'mean',
'prod', 'cumprod', 'std', 'argsort',
'argmin', 'argmax', 'sort', 'min', 'max']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so let's take out argmin/max

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gfyoung
Copy link
Member Author

gfyoung commented May 6, 2016

@jreback : I don't consider to be an issue, as the documentation makes it clear that you should call the DataFrame.idxmin or DataFrame.idxmax instead of the np.argmin or np.argmax. I just wanted to double check that I could take it out of the list.

@jreback
Copy link
Contributor

jreback commented May 6, 2016

@gfyoung yep, take em out. ping when green.

@jreback jreback added this to the 0.18.2 milestone May 6, 2016
@gfyoung gfyoung force-pushed the stats-axis-default branch from 7cd9244 to da30e21 Compare May 6, 2016 18:41
@gfyoung
Copy link
Member Author

gfyoung commented May 6, 2016

@jreback : Travis is all green. Ready to merge if there is nothing else.

@jreback
Copy link
Contributor

jreback commented May 7, 2016

thanks!

@jreback jreback closed this in c089110 May 7, 2016
@gfyoung gfyoung deleted the stats-axis-default branch May 7, 2016 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants