Skip to content

Behavior of Series.values when dtype is "category" is surprising #9580

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mwaskom opened this issue Mar 3, 2015 · 10 comments
Closed

Behavior of Series.values when dtype is "category" is surprising #9580

mwaskom opened this issue Mar 3, 2015 · 10 comments
Labels
Categorical Categorical Data Type Docs

Comments

@mwaskom
Copy link
Contributor

mwaskom commented Mar 3, 2015

Say I make a category-type Series:

s = pd.Series(["a", "b", "c"], dtype="category")

I want to pass this to a function that expects a numpy array, so I use the values property. According to the documentation:

s.values?
Type:        property
String form: <property object at 0x106d84050>
Docstring:
Return Series as ndarray

Returns
-------
arr : numpy.ndarray

However:

s.values
[a, b, c]
Categories (3, object): [a < b < c]
@mwaskom
Copy link
Contributor Author

mwaskom commented Mar 3, 2015

Aside from the documentation issue, I thought that using .values was the canonical way to represent a Series as an ndarray, so this is potentially a problematic API change.

@jreback
Copy link
Contributor

jreback commented Mar 3, 2015

s.values gets you an object that has the __array__ protocol defined. This is not an API change, FYI also true of Sparse objects. So you can pass this to a function expecting an ndarray.

I suppose the doc-string could be updated.

In [10]: s = pd.Series(["a", "b", "c"], dtype="category")

In [11]: s.values.__array__
Out[11]:
<bound method Categorical.__array__ of [a, b, c]
Categories (3, object): [a < b < c]>

In [12]: np.array(s.values)
Out[12]: array(['a', 'b', 'c'], dtype=object)

@jreback jreback added Docs Categorical Categorical Data Type labels Mar 3, 2015
@jreback jreback added this to the Next Major Release milestone Mar 3, 2015
@mwaskom
Copy link
Contributor Author

mwaskom commented Mar 3, 2015

Well, ok, but that's not quite the same as giving you an object that has all the methods on an ndarray. I currently have code failing because it is trying to use the .size attribute on the object that comes out of Series.values.

@jreback
Copy link
Contributor

jreback commented Mar 3, 2015

a categorical is NOT an ndarray, though it would act like it. It is simply not representable by an ndarray (its more like 2 ndarrays).

@mwaskom
Copy link
Contributor Author

mwaskom commented Mar 3, 2015

I completely agree. Which is why I'm saying that from a user perspective, the behavior of Series.values when the dtype of that series is category is unexpected, and breaks an existing API.

@shoyer
Copy link
Member

shoyer commented Mar 3, 2015

I agree, the documentation here is broken. We should point users toward using np.asarray(s) if they might have categorical or sparse values and want to be sure they get a numpy.ndarray object.

@jreback
Copy link
Contributor

jreback commented Mar 3, 2015

@mwaskom you can simply use .get_values() which is an 'internal-ish', method which returns an ndarray always (it may coerce as you see below). The .values has actually always been this ways (in that there are other dtypes that are backed by a non-ndarray).

In [3]: s.get_values()
Out[3]: array(['a', 'b', 'c'], dtype=object)

In [4]: s.astype('object').get_values()
Out[4]: array(['a', 'b', 'c'], dtype=object)

In [5]: Series([1,2,3]).get_values()
Out[5]: array([1, 2, 3])

@mwaskom
Copy link
Contributor Author

mwaskom commented Mar 3, 2015

Thanks @jreback though I will probably not be using an "internal-ish" method in library code.

@jreback
Copy link
Contributor

jreback commented Mar 3, 2015

i'll clarify

I meant unadvertised. Its in the public-api.

@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Next Major Release Feb 17, 2017
@jorisvandenbossche
Copy link
Member

The docstring of .values is updated in the meantime (https://github.com/pandas-dev/pandas/pull/10477/files#diff-150fd3c5a732ae915ec47bc54a933c41R328), so closing this.
The different output type is of course still confusing, but that is not something we are going to solve before pandas 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Docs
Projects
None yet
Development

No branches or pull requests

4 participants