ENH add bins argument to value_counts #4502

hayd · 2013-08-07T15:34:27Z

Also, adds arguments present in top-level function to the Series
method (sort and ascending).

Note: this only works with numeric data (it raises a TypeError if there was a TypeError problem in cut, however perhaps cut should raise a more descirptive error...)

In [1]: s = pd.Series(np.random.randn(100))

In [2]: s.value_counts(bins=4)
Out[2]:
 0.017354    44
-1.154431    35
 1.189140    14
-2.330904     7
dtype: int64

In [3]: s.value_counts(bins=4, normalize=True)  # also works with sort and ascending arguments.
Out[3]:
 0.017354    0.44
-1.154431    0.35
 1.189140    0.14
-2.330904    0.07
dtype: float64

In [11]: s = pd.Series(list('abc'))

In [12]: s.value_counts(bins=1)
TypeError: bins argument only works with numeric data.

Note: Have a test failing on python3 (topically) a sorting one.

hayd · 2013-08-07T15:36:03Z

pandas/tests/test_series.py

+        # don't sort
+        hist = s.value_counts(sort=False)
+        expected = Series([3, 1, 4, 2], index=list('acbd'))
+        assert_series_equal(hist, expected)


this one fails on python 3, if result is platform dep then not sure what I can test for here.

sort the results and expected, e.g. ``hist.sort(); expected.sort()`

@jreback so just tests it's giving correct Series, but don't care whether or not it has sorted. seems reasonable. thanks!

jreback · 2013-08-07T15:39:15Z

side issue...why is pd.value_counts a top-level method?

hayd · 2013-08-07T15:43:02Z

@jreback You can throw anything at it (e.g. arrays)

jreback · 2013-08-07T15:43:20Z

@hayd oh...makes sense

hayd · 2013-08-07T15:45:58Z

@jreback it's pretty much completely untested though (from what I could find), hence adding these few tests, should probably come up with some more...

jreback · 2013-08-07T15:48:14Z

maybe throw different dtypes at it, maybe number that look like each other e.g. '1',1.0,1 ?

jreback · 2013-08-07T15:48:45Z

maybe should even raise on floats?

hayd · 2013-08-07T15:54:08Z

Passing in a mixture of ints and floats [1, 1.] is ok (and think it should be), it raises when passing ['1'. 1].

will add test for those.

(These should probably be caught better in cut, get weird TypeError, will make an issue/fix later.)

jreback · 2013-08-07T15:56:48Z

@hayd I was thinking of this:

which I guess works, so ok

In [2]: Series([1.,1.5,1.,1.5,3.,3.3333,3.3333,7/3.,7/3.]).value_counts()
Out[2]: 
2.333333    2
3.333300    2
1.000000    2
1.500000    2
3.000000    1
dtype: int64

hayd · 2013-08-07T16:11:46Z

Ag (behaviour in master):

In [11]: pd.value_counts(['1', 1])
Out[11]:
1    2
dtype: int64

In [12]: pd.Series(['1', 1]).value_counts()
Out[12]:
1    1
1    1
dtype: int64

:s kind of solved if we use dtype=object for unknown data, but I know you dislike that... one of the first lines is:

values = np.asarray(values)

eeew

In [21]: np.asarray([1, '1', 31])
Out[21]:
array(['1', '1', '31'],
      dtype='|S2')

jreback · 2013-08-07T16:16:37Z

actually string hashing is just fine (but slower), so for unknown data that is ok; though maybe should just raise if it is really mixed dtype (e.g. strings and ints)....hmmm

hayd · 2013-08-07T16:21:55Z

I don't like it:

In [39]: pd.value_counts([1., 1., 1, '1'])
Out[39]:
1      2
1.0    2
dtype: int64

In [40]: pd.value_counts([1., 1., 1])
Out[40]:
1    3
dtype: int64

dtype=object solves everything :p

jreback · 2013-08-07T23:36:44Z

sorry...accid closed

hayd · 2013-08-25T22:27:09Z

@jreback any thoughts what would be good behaviour on this one.

I propose passing in dtype (default None, which feeds asarray), I suppose we don't really care anyway as is a pretty edge case (and user shouldn't be inputting it). This is only an issue in the top-level function.

jreback · 2013-08-26T02:15:35Z

why would u pass in dtype?
just use the dtype of the passed in data

hayd · 2013-08-26T09:17:28Z

only reason is if you pass in something not an array. Thinking about it, maybe easier just to use Series rather than asarray... then don't need it.

ENH Series method accepts Categorical Also, adds arguments present in top-level function to the Series method (sort and ascending).

hayd · 2013-08-26T10:46:24Z

@jreback removed dtype arg, and use Series constructor instead, also made Series constructor accept a Categorical (since this appeared to be a supported use case of value_counts).

jreback · 2013-08-26T12:41:42Z

pandas/core/algorithms.py

@@ -148,7 +148,7 @@ def factorize(values, sort=False, order=None, na_sentinel=-1):
    return labels, uniques


-def value_counts(values, sort=True, ascending=False, normalize=False):
+def value_counts(values, sort=True, ascending=False, normalize=False, bins=None):


maybe? add raise_on_error=False, which you could automatically just exclude non-numeric

or just do it by definition; I think we do this is for example describe (it should just work);

I think Categorical was the thing that throws.

Is there a neat way to just include only numerics from a Series?

Maybe call numeric_only like groupby: dda2363. Not sure it should be default though.

going to push somethign to enable _get_numeric_data/_get_bool_data in all NDFrame shortly

jreback · 2013-08-26T14:03:13Z

see #4675

jreback · 2013-08-26T14:43:02Z

@hayd _get_numeric_data is now available (OT, should we add get_numeric_data as a 'user' method?)

hayd · 2013-08-26T16:53:38Z

Am I being incredibly stupid:

In [6]: s = pd.Series(['1', 1, 1.])

In [7]: s._get_numeric_data()
Out[7]: Series([], dtype: float64)

hayd · 2013-08-26T16:57:33Z

of course the other option is we just smash it with convert_numeric, but that seems like something a user should do.

jreback · 2013-08-26T18:03:01Z

hmm
should preserve the original dtype
let me fix

jreback · 2013-08-26T19:25:20Z

#4677 fixes

hayd · 2013-08-26T19:42:37Z

ah so that essentially gets numeric columns only, I was thinking it would return the numeric things:

s = pd.Series(['1', 1, 1.])
s._get_numeric_data()
# would be
pd.Series([1, 1.], index=[1, 2])
# or maybe
pd.Series([nan, 1, 1.])

In the value_counts case this means thowing away everything if it's dtype=object, which seems a little weird.

jreback · 2013-08-26T19:42:52Z

@hayd give another try I fixed empty Series so that the dtype propogates

jreback · 2013-08-26T19:45:05Z

In [1]: s = pd.Series(['1', 1, 1.])

In [2]: s._get_numeric_data()
Out[2]: Series([], dtype: object)

_get_numeric_data is essentially just saying give me a numeric dtype (no conversions are done)

You maybe also do this:

In [7]: pd.Series(['1', 1, 1.]).convert_objects()
Out[7]: 
0    1
1    1
2    1
dtype: object

In [8]: pd.Series(['1', 1, 1.]).convert_objects(convert_numeric=True)
Out[8]: 
0    1
1    1
2    1
dtype: float64

hayd · 2013-08-26T19:47:13Z

Yeah, but what I mean is, I'm confused why you would want _get_numeric_data to return empty in this case (and not either of the above things).

jreback · 2013-08-26T19:48:42Z

@hayd its not consistent with anything else, you would end up special casing a Series (so what about a single-dtyped Frame)? I think you need to be explicit about the conversion

jreback · 2013-08-26T19:50:27Z

contrast with

In [9]: pd.Series(['foo',1.,1])._get_numeric_data()
Out[9]: Series([], dtype: object)

In [10]: pd.Series(['foo',1.,1]).convert_objects()
Out[10]: 
0    foo
1      1
2      1
dtype: object

In [11]: pd.Series(['foo',1.,1]).convert_objects(convert_numeric=True)
Out[11]: 
0   NaN
1     1
2     1
dtype: float64

jreback · 2013-08-26T19:52:07Z

convert_objects is basically a safe astype, _get_numeric_data is just a way of telling you what is numeric (but 'item'-wise)....I suppose we could make it work on Series that way thought (as the 'item's are the index elements).....and prob more useful.....but then you have the problem of it 'dropping' the non-numeric elements, which may or may not be what you want

hayd · 2013-08-26T20:01:34Z

I'm not really sure what is best behaviour here.

I don't think dropping the entire Series (if there is some bad data) will ever be what user wants... I think better to raise and then they do the convert_numeric/remove bad data.

Maybe _get_numeric_columns is a better function name, would have avoided my confusion! :s

jreback · 2013-08-26T20:09:31Z

you always have a Series right? then just do convert_objects(convert_numeric=True) (as you ignore nan anyhow, right?

hayd · 2013-08-26T20:10:46Z

So let's give it a convert_numeric arg?

jreback · 2013-08-26T20:16:19Z

sure, maybe numeric= better though?
(unfort you will have to if/then this, because if numeric=False then you need _get_numeric_data, else you need convert_numeric)???

(I am actually lamenting convert_numeric)

hmm....maybe an option to _get_numeric_data to actually do convert_objects(convert_numeric=True)?

or is that what you are suggesting? I am confusing myself

hayd · 2013-08-26T20:28:03Z

no, because it is actually a conversion. So I think convert_numeric is good (though perhaps it's more a to_numeric... but consistency is probably good here)

numeric looks like it just includes numeric values (which isn't what this is doing). Earlier I was advocating some kind of is_numeric on values (maybe this would be a mask)...

:s

jreback · 2013-08-26T20:30:41Z

@hayd ok...are you suggesting this for value_counts or _get_numeric_data?

in essence _get_numeric_data(convert_numeric=True) is tantamount to
.convert_objects(convert_numeric=True)._get_numeric_data() (because will still not return say a string column (that is not convertible)

hayd · 2013-08-26T23:51:25Z

I think I was talking about value_counts. I kinda think it should just raise (as is current implementation), think we should merge as is, going round in circles (and I'm confused!)!

jreback · 2013-08-26T23:53:46Z

I'm fine with mergine too...go ahead

jreback · 2013-08-26T23:54:29Z

one sec

jreback · 2013-08-26T23:55:07Z

nvm...go ahead....value_counts is only a Series method so only 1 dtype is possible...

ENH add bins argument to value_counts

hayd reviewed Aug 7, 2013
View reviewed changes

jreback mentioned this pull request Aug 7, 2013

TST: tests for GH4502, consistency in ix assignments, but ok in master #4509

Merged

jreback closed this in #4509 Aug 7, 2013

jreback reopened this Aug 7, 2013

ENH add bins argument to value_counts

85f191c

ENH Series method accepts Categorical Also, adds arguments present in top-level function to the Series method (sort and ascending).

jreback reviewed Aug 26, 2013
View reviewed changes

hayd added a commit that referenced this pull request Aug 27, 2013

Merge pull request #4502 from hayd/value_count_bins

4226afe

ENH add bins argument to value_counts

hayd merged commit 4226afe into pandas-dev:master Aug 27, 2013

hayd mentioned this pull request Oct 24, 2013

Tags to dummies helper function #3695

Closed

bluesquall mentioned this pull request Nov 14, 2018

ENH: add Series.histogram wrapping numpy.histogram #23710

Closed

ENH add bins argument to value_counts #4502

ENH add bins argument to value_counts #4502

Conversation

hayd commented Aug 7, 2013

hayd Aug 7, 2013

Choose a reason for hiding this comment

jreback Aug 7, 2013

Choose a reason for hiding this comment

hayd Aug 7, 2013

Choose a reason for hiding this comment

jreback commented Aug 7, 2013

hayd commented Aug 7, 2013

jreback commented Aug 7, 2013

hayd commented Aug 7, 2013

jreback commented Aug 7, 2013

jreback commented Aug 7, 2013

hayd commented Aug 7, 2013

jreback commented Aug 7, 2013

hayd commented Aug 7, 2013

jreback commented Aug 7, 2013

hayd commented Aug 7, 2013

jreback commented Aug 7, 2013

hayd commented Aug 25, 2013

jreback commented Aug 26, 2013

hayd commented Aug 26, 2013

hayd commented Aug 26, 2013

jreback Aug 26, 2013

Choose a reason for hiding this comment

hayd Aug 26, 2013

Choose a reason for hiding this comment

jreback Aug 26, 2013

Choose a reason for hiding this comment

jreback commented Aug 26, 2013

jreback commented Aug 26, 2013

hayd commented Aug 26, 2013

hayd commented Aug 26, 2013

jreback commented Aug 26, 2013

jreback commented Aug 26, 2013

hayd commented Aug 26, 2013

jreback commented Aug 26, 2013

jreback commented Aug 26, 2013

hayd commented Aug 26, 2013

jreback commented Aug 26, 2013

jreback commented Aug 26, 2013

jreback commented Aug 26, 2013

hayd commented Aug 26, 2013

jreback commented Aug 26, 2013

hayd commented Aug 26, 2013

jreback commented Aug 26, 2013

hayd commented Aug 26, 2013

jreback commented Aug 26, 2013

hayd commented Aug 26, 2013

jreback commented Aug 26, 2013

jreback commented Aug 26, 2013

jreback commented Aug 26, 2013