-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH add bins argument to value_counts #4502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# don't sort | ||
hist = s.value_counts(sort=False) | ||
expected = Series([3, 1, 4, 2], index=list('acbd')) | ||
assert_series_equal(hist, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one fails on python 3, if result is platform dep then not sure what I can test for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort the results and expected, e.g. ``hist.sort(); expected.sort()`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback so just tests it's giving correct Series, but don't care whether or not it has sorted. seems reasonable. thanks!
side issue...why is |
@jreback You can throw anything at it (e.g. arrays) |
@hayd oh...makes sense |
@jreback it's pretty much completely untested though (from what I could find), hence adding these few tests, should probably come up with some more... |
maybe throw different dtypes at it, maybe number that look like each other e.g. |
maybe should even raise on |
Passing in a mixture of ints and floats [1, 1.] is ok (and think it should be), it raises when passing ['1'. 1]. will add test for those. (These should probably be caught better in cut, get weird TypeError, will make an issue/fix later.) |
@hayd I was thinking of this: which I guess works, so ok
|
Ag (behaviour in master):
:s kind of solved if we use dtype=object for unknown data, but I know you dislike that... one of the first lines is:
eeew
|
actually string hashing is just fine (but slower), so for unknown data that is ok; though maybe should just raise if it is really mixed dtype (e.g. strings and ints)....hmmm |
I don't like it:
dtype=object solves everything :p |
sorry...accid closed |
@jreback any thoughts what would be good behaviour on this one. I propose passing in dtype (default None, which feeds |
why would u pass in dtype? |
only reason is if you pass in something not an array. Thinking about it, maybe easier just to use Series rather than asarray... then don't need it. |
ENH Series method accepts Categorical Also, adds arguments present in top-level function to the Series method (sort and ascending).
@jreback removed dtype arg, and use Series constructor instead, also made Series constructor accept a Categorical (since this appeared to be a supported use case of |
@@ -148,7 +148,7 @@ def factorize(values, sort=False, order=None, na_sentinel=-1): | |||
return labels, uniques | |||
|
|||
|
|||
def value_counts(values, sort=True, ascending=False, normalize=False): | |||
def value_counts(values, sort=True, ascending=False, normalize=False, bins=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe? add raise_on_error=False
, which you could automatically just exclude non-numeric
or just do it by definition; I think we do this is for example describe (it should just work);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Categorical was the thing that throws.
Is there a neat way to just include only numerics from a Series?
Maybe call numeric_only
like groupby: dda2363. Not sure it should be default though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to push somethign to enable _get_numeric_data/_get_bool_data
in all NDFrame shortly
see #4675 |
@hayd _get_numeric_data is now available (OT, should we add get_numeric_data as a 'user' method?) |
Am I being incredibly stupid:
|
of course the other option is we just smash it with convert_numeric, but that seems like something a user should do. |
hmm |
#4677 fixes |
ah so that essentially gets numeric columns only, I was thinking it would return the numeric things:
In the value_counts case this means thowing away everything if it's dtype=object, which seems a little weird. |
@hayd give another try I fixed empty Series so that the dtype propogates |
_get_numeric_data is essentially just saying give me a numeric dtype (no conversions are done) You maybe also do this:
|
Yeah, but what I mean is, I'm confused why you would want |
@hayd its not consistent with anything else, you would end up special casing a Series (so what about a single-dtyped Frame)? I think you need to be explicit about the conversion |
contrast with
|
|
I'm not really sure what is best behaviour here. I don't think dropping the entire Series (if there is some bad data) will ever be what user wants... I think better to raise and then they do the convert_numeric/remove bad data. Maybe |
you always have a Series right? then just do |
So let's give it a |
sure, maybe (I am actually lamenting hmm....maybe an option to or is that what you are suggesting? I am confusing myself |
no, because it is actually a conversion. So I think convert_numeric is good (though perhaps it's more a to_numeric... but consistency is probably good here) numeric looks like it just includes numeric values (which isn't what this is doing). Earlier I was advocating some kind of is_numeric on values (maybe this would be a mask)... :s |
@hayd ok...are you suggesting this for in essence |
I think I was talking about value_counts. I kinda think it should just raise (as is current implementation), think we should merge as is, going round in circles (and I'm confused!)! |
I'm fine with mergine too...go ahead |
one sec |
nvm...go ahead....value_counts is only a Series method so only 1 dtype is possible... |
ENH add bins argument to value_counts
fixes #3945
Also, adds arguments present in top-level function to the Series
method (sort and ascending).
Note: this only works with numeric data (it raises a TypeError if there was a TypeError problem in cut, however perhaps cut should raise a more descirptive error...)
Note: Have a test failing on python3 (topically) a sorting one.