Skip to content

ENH add bins argument to value_counts #4502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 27, 2013
Merged

Conversation

hayd
Copy link
Contributor

@hayd hayd commented Aug 7, 2013

fixes #3945

Also, adds arguments present in top-level function to the Series
method (sort and ascending).

Note: this only works with numeric data (it raises a TypeError if there was a TypeError problem in cut, however perhaps cut should raise a more descirptive error...)

In [1]: s = pd.Series(np.random.randn(100))

In [2]: s.value_counts(bins=4)
Out[2]:
 0.017354    44
-1.154431    35
 1.189140    14
-2.330904     7
dtype: int64

In [3]: s.value_counts(bins=4, normalize=True)  # also works with sort and ascending arguments.
Out[3]:
 0.017354    0.44
-1.154431    0.35
 1.189140    0.14
-2.330904    0.07
dtype: float64

In [11]: s = pd.Series(list('abc'))

In [12]: s.value_counts(bins=1)
TypeError: bins argument only works with numeric data.

Note: Have a test failing on python3 (topically) a sorting one.

# don't sort
hist = s.value_counts(sort=False)
expected = Series([3, 1, 4, 2], index=list('acbd'))
assert_series_equal(hist, expected)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one fails on python 3, if result is platform dep then not sure what I can test for here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort the results and expected, e.g. ``hist.sort(); expected.sort()`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback so just tests it's giving correct Series, but don't care whether or not it has sorted. seems reasonable. thanks!

@jreback
Copy link
Contributor

jreback commented Aug 7, 2013

side issue...why is pd.value_counts a top-level method?

@hayd
Copy link
Contributor Author

hayd commented Aug 7, 2013

@jreback You can throw anything at it (e.g. arrays)

@jreback
Copy link
Contributor

jreback commented Aug 7, 2013

@hayd oh...makes sense

@hayd
Copy link
Contributor Author

hayd commented Aug 7, 2013

@jreback it's pretty much completely untested though (from what I could find), hence adding these few tests, should probably come up with some more...

@jreback
Copy link
Contributor

jreback commented Aug 7, 2013

maybe throw different dtypes at it, maybe number that look like each other e.g. '1',1.0,1 ?

@jreback
Copy link
Contributor

jreback commented Aug 7, 2013

maybe should even raise on floats?

@hayd
Copy link
Contributor Author

hayd commented Aug 7, 2013

Passing in a mixture of ints and floats [1, 1.] is ok (and think it should be), it raises when passing ['1'. 1].

will add test for those.

(These should probably be caught better in cut, get weird TypeError, will make an issue/fix later.)

@jreback
Copy link
Contributor

jreback commented Aug 7, 2013

@hayd I was thinking of this:

which I guess works, so ok

In [2]: Series([1.,1.5,1.,1.5,3.,3.3333,3.3333,7/3.,7/3.]).value_counts()
Out[2]: 
2.333333    2
3.333300    2
1.000000    2
1.500000    2
3.000000    1
dtype: int64

@hayd
Copy link
Contributor Author

hayd commented Aug 7, 2013

Ag (behaviour in master):

In [11]: pd.value_counts(['1', 1])
Out[11]:
1    2
dtype: int64

In [12]: pd.Series(['1', 1]).value_counts()
Out[12]:
1    1
1    1
dtype: int64

:s kind of solved if we use dtype=object for unknown data, but I know you dislike that... one of the first lines is:

values = np.asarray(values)

eeew

In [21]: np.asarray([1, '1', 31])
Out[21]:
array(['1', '1', '31'],
      dtype='|S2')

@jreback
Copy link
Contributor

jreback commented Aug 7, 2013

actually string hashing is just fine (but slower), so for unknown data that is ok; though maybe should just raise if it is really mixed dtype (e.g. strings and ints)....hmmm

@hayd
Copy link
Contributor Author

hayd commented Aug 7, 2013

I don't like it:

In [39]: pd.value_counts([1., 1., 1, '1'])
Out[39]:
1      2
1.0    2
dtype: int64

In [40]: pd.value_counts([1., 1., 1])
Out[40]:
1    3
dtype: int64

dtype=object solves everything :p

@jreback
Copy link
Contributor

jreback commented Aug 7, 2013

sorry...accid closed

@hayd
Copy link
Contributor Author

hayd commented Aug 25, 2013

@jreback any thoughts what would be good behaviour on this one.

I propose passing in dtype (default None, which feeds asarray), I suppose we don't really care anyway as is a pretty edge case (and user shouldn't be inputting it). This is only an issue in the top-level function.

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

why would u pass in dtype?
just use the dtype of the passed in data

@hayd
Copy link
Contributor Author

hayd commented Aug 26, 2013

only reason is if you pass in something not an array. Thinking about it, maybe easier just to use Series rather than asarray... then don't need it.

ENH Series method accepts Categorical

Also, adds arguments present in top-level function to the Series
method (sort and ascending).
@hayd
Copy link
Contributor Author

hayd commented Aug 26, 2013

@jreback removed dtype arg, and use Series constructor instead, also made Series constructor accept a Categorical (since this appeared to be a supported use case of value_counts).

@@ -148,7 +148,7 @@ def factorize(values, sort=False, order=None, na_sentinel=-1):
return labels, uniques


def value_counts(values, sort=True, ascending=False, normalize=False):
def value_counts(values, sort=True, ascending=False, normalize=False, bins=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe? add raise_on_error=False, which you could automatically just exclude non-numeric

or just do it by definition; I think we do this is for example describe (it should just work);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Categorical was the thing that throws.

Is there a neat way to just include only numerics from a Series?

Maybe call numeric_only like groupby: dda2363. Not sure it should be default though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to push somethign to enable _get_numeric_data/_get_bool_data in all NDFrame shortly

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

see #4675

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

@hayd _get_numeric_data is now available (OT, should we add get_numeric_data as a 'user' method?)

@hayd
Copy link
Contributor Author

hayd commented Aug 26, 2013

Am I being incredibly stupid:

In [6]: s = pd.Series(['1', 1, 1.])

In [7]: s._get_numeric_data()
Out[7]: Series([], dtype: float64)

@hayd
Copy link
Contributor Author

hayd commented Aug 26, 2013

of course the other option is we just smash it with convert_numeric, but that seems like something a user should do.

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

hmm
should preserve the original dtype
let me fix

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

#4677 fixes

@hayd
Copy link
Contributor Author

hayd commented Aug 26, 2013

ah so that essentially gets numeric columns only, I was thinking it would return the numeric things:

s = pd.Series(['1', 1, 1.])
s._get_numeric_data()
# would be
pd.Series([1, 1.], index=[1, 2])
# or maybe
pd.Series([nan, 1, 1.])

In the value_counts case this means thowing away everything if it's dtype=object, which seems a little weird.

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

@hayd give another try I fixed empty Series so that the dtype propogates

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

In [1]: s = pd.Series(['1', 1, 1.])

In [2]: s._get_numeric_data()
Out[2]: Series([], dtype: object)

_get_numeric_data is essentially just saying give me a numeric dtype (no conversions are done)

You maybe also do this:

In [7]: pd.Series(['1', 1, 1.]).convert_objects()
Out[7]: 
0    1
1    1
2    1
dtype: object

In [8]: pd.Series(['1', 1, 1.]).convert_objects(convert_numeric=True)
Out[8]: 
0    1
1    1
2    1
dtype: float64

@hayd
Copy link
Contributor Author

hayd commented Aug 26, 2013

Yeah, but what I mean is, I'm confused why you would want _get_numeric_data to return empty in this case (and not either of the above things).

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

@hayd its not consistent with anything else, you would end up special casing a Series (so what about a single-dtyped Frame)? I think you need to be explicit about the conversion

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

contrast with

In [9]: pd.Series(['foo',1.,1])._get_numeric_data()
Out[9]: Series([], dtype: object)

In [10]: pd.Series(['foo',1.,1]).convert_objects()
Out[10]: 
0    foo
1      1
2      1
dtype: object

In [11]: pd.Series(['foo',1.,1]).convert_objects(convert_numeric=True)
Out[11]: 
0   NaN
1     1
2     1
dtype: float64

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

convert_objects is basically a safe astype, _get_numeric_data is just a way of telling you what is numeric (but 'item'-wise)....I suppose we could make it work on Series that way thought (as the 'item's are the index elements).....and prob more useful.....but then you have the problem of it 'dropping' the non-numeric elements, which may or may not be what you want

@hayd
Copy link
Contributor Author

hayd commented Aug 26, 2013

I'm not really sure what is best behaviour here.

I don't think dropping the entire Series (if there is some bad data) will ever be what user wants... I think better to raise and then they do the convert_numeric/remove bad data.

Maybe _get_numeric_columns is a better function name, would have avoided my confusion! :s

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

you always have a Series right? then just do convert_objects(convert_numeric=True) (as you ignore nan anyhow, right?

@hayd
Copy link
Contributor Author

hayd commented Aug 26, 2013

So let's give it a convert_numeric arg?

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

sure, maybe numeric= better though?
(unfort you will have to if/then this, because if numeric=False then you need _get_numeric_data, else you need convert_numeric)???

(I am actually lamenting convert_numeric)

hmm....maybe an option to _get_numeric_data to actually do convert_objects(convert_numeric=True)?

or is that what you are suggesting? I am confusing myself

@hayd
Copy link
Contributor Author

hayd commented Aug 26, 2013

no, because it is actually a conversion. So I think convert_numeric is good (though perhaps it's more a to_numeric... but consistency is probably good here)

numeric looks like it just includes numeric values (which isn't what this is doing). Earlier I was advocating some kind of is_numeric on values (maybe this would be a mask)...

:s

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

@hayd ok...are you suggesting this for value_counts or _get_numeric_data?

in essence _get_numeric_data(convert_numeric=True) is tantamount to
.convert_objects(convert_numeric=True)._get_numeric_data() (because will still not return say a string column (that is not convertible)

@hayd
Copy link
Contributor Author

hayd commented Aug 26, 2013

I think I was talking about value_counts. I kinda think it should just raise (as is current implementation), think we should merge as is, going round in circles (and I'm confused!)!

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

I'm fine with mergine too...go ahead

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

one sec

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

nvm...go ahead....value_counts is only a Series method so only 1 dtype is possible...

hayd added a commit that referenced this pull request Aug 27, 2013
ENH add bins argument to value_counts
@hayd hayd merged commit 4226afe into pandas-dev:master Aug 27, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add Series.histogram()
2 participants