Skip to content

BUG-19214 int categoricals are formatted as ints #24494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jan 5, 2019

Conversation

JustinZhengBC
Copy link
Contributor

This PR alters the ExtensionArrayFormatter such that if it receives a categorical series with only integers and NaN, it will print the integers as integers instead of converting to floats.

>>> # previous behaviour
>>> pd.Series([1,2, np.nan], dtype="object").astype("category")
0    1.0
1    2.0
2    NaN
dtype: category
Categories (2, int64): [1, 2]
>>> # new behaviour
>>> pd.Series([1,2, np.nan], dtype="object").astype("category")
0      1
1      2
2    NaN
dtype: category
Categories (2, int64): [1, 2]

@codecov
Copy link

codecov bot commented Dec 30, 2018

Codecov Report

Merging #24494 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24494      +/-   ##
==========================================
+ Coverage   92.31%   92.31%   +<.01%     
==========================================
  Files         166      166              
  Lines       52412    52414       +2     
==========================================
+ Hits        48382    48384       +2     
  Misses       4030     4030
Flag Coverage Δ
#multiple 90.73% <100%> (ø) ⬆️
#single 43.05% <50%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/formats/format.py 97.98% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1ebfd8a...a2593f0. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 30, 2018

Codecov Report

Merging #24494 into master will increase coverage by 0.05%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24494      +/-   ##
==========================================
+ Coverage   92.32%   92.38%   +0.05%     
==========================================
  Files         166      166              
  Lines       52440    52379      -61     
==========================================
- Hits        48417    48388      -29     
+ Misses       4023     3991      -32
Flag Coverage Δ
#multiple 90.8% <ø> (+0.05%) ⬆️
#single 43.01% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.43% <ø> (-0.05%) ⬇️
pandas/core/internals/construction.py 95.93% <0%> (-0.73%) ⬇️
pandas/core/nanops.py 94.36% <0%> (-0.55%) ⬇️
pandas/core/dtypes/concat.py 96.6% <0%> (-0.46%) ⬇️
pandas/core/internals/concat.py 96.45% <0%> (-0.39%) ⬇️
pandas/core/algorithms.py 94.77% <0%> (-0.34%) ⬇️
pandas/core/indexes/period.py 92.06% <0%> (-0.32%) ⬇️
pandas/core/indexes/timedeltas.py 90.22% <0%> (-0.26%) ⬇️
pandas/core/arrays/datetimes.py 97.84% <0%> (-0.08%) ⬇️
pandas/core/indexes/base.py 96.29% <0%> (-0.07%) ⬇️
... and 44 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf92230...7cc02c1. Read the comment docs.

@jreback jreback added Output-Formatting __repr__ of pandas objects, to_string Categorical Categorical Data Type labels Dec 30, 2018
@JustinZhengBC
Copy link
Contributor Author

@jreback I looked into it and the CategoricalFormatter is used for printing categoricals like this:

[1.0, 2.0, NaN]
Categories (2, int64): [1, 2]

instead of like a series. The PR now modifies the CategoricalFormatter in addition to the ExtensionArrayFormatter.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the fix is in .get_values() do this

n [28]: c.categories.astype(object).take(c.codes, fill_value=np.nan)
Out[28]: Index([1, 2, nan], dtype='object')

if -1 in c.codes and is_integer_dtype(c.categories)

@jreback
Copy link
Contributor

jreback commented Dec 30, 2018

@TomAugspurger any objections here?

@jreback jreback added this to the 0.24.0 milestone Dec 30, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the warnings are not needed

@@ -1323,6 +1323,7 @@ Categorical
- Bug in many methods of the ``.str``-accessor, which always failed on calling the ``CategoricalIndex.str`` constructor (:issue:`23555`, :issue:`23556`)
- Bug in :meth:`Series.where` losing the categorical dtype for categorical data (:issue:`24077`)
- Bug in :meth:`Categorical.apply` where ``NaN`` values could be handled unpredictably. They now remain unchanged (:issue:`24241`)
- Bug in :meth:`Categorical.get_values` where integers would be formatted as floats if ``NaN`` values were present (:issue:`19214`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the user facing note though, it is formtting of a Series/Categorical where this happens. .get_values() its just an implementation detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I'm a bit confused as to what a user-facing note is. Is it a specific section somewhere? Or does it mean to phrase the note differently, like "bug in the Categorical repr"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry. what I mean is a user would want to know that the string repr of Series (or categorical dtype) and Categoricals for integer categories will now not be coerced to float. Its maybe worth making this a sub-section (e.g. show previous / new behavior). you can just in-line it here.

@@ -1520,6 +1520,11 @@ def get_values(self):
# if we are a datetime and period index, return Index to keep metadata
if is_datetimelike(self.categories):
return self.categories.take(self._codes, fill_value=np.nan)
elif is_integer_dtype(self.categories) and -1 in self._codes:
warn("Integer values represented as objects to accomodate NaNs",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for a warning

@@ -240,6 +241,16 @@ def test_categorical_repr_datetime_ordered(self):

assert repr(c) == exp

@pytest.mark.filterwarnings("ignore:Integer values:RuntimeWarning")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no warnings are needed

@@ -1653,6 +1683,7 @@ Reshaping
- :meth:`DataFrame.nlargest` and :meth:`DataFrame.nsmallest` now returns the correct n values when keep != 'all' also when tied on the first columns (:issue:`22752`)
- Constructing a DataFrame with an index argument that wasn't already an instance of :class:`~pandas.core.Index` was broken (:issue:`22227`).
- Bug in :class:`DataFrame` prevented list subclasses to be used to construction (:issue:`21226`)
- Calling :func:`pandas.concat` on a ``Categorical`` of ints with NA values now causes them to be processed as objects (formerly coerced to floats) (:issue:`19214`)
Copy link
Contributor

@TomAugspurger TomAugspurger Jan 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only applies when concating a categorical with a different dtype, right? If I concat two integer cats with the same dtype, it’s still categorical right?

assert repr(c) == c_exp

s = Series([1, 2, np.nan], dtype="object").astype("category")
s_exp = """0 1\n1 2\n2 NaN\ndtype: category\nCategories (2, int64): [1, 2]""" # noqa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use parenthesis to wrap this line.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, the change in dtype for concat is the larger change. Just from the release notes it looks like the formatting change is larger.

@jreback
Copy link
Contributor

jreback commented Jan 1, 2019

To me, the change in dtype for concat is the larger change. Just from the release notes it looks like the formatting change is larger.

I don't think this is that big of a deal, do you want a sub-section for this?

@jreback
Copy link
Contributor

jreback commented Jan 3, 2019

@TomAugspurger you think need to highlite the change more?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 3, 2019 via email

@jreback
Copy link
Contributor

jreback commented Jan 4, 2019

I'm more concerned about the breaking
change to get_values
with no deprecation warning. Is there any way to deprecate the behavior, or
would that require to

This is a very small case to be honest (and our current behavior for concatting any non-same types, so this is a bug IMHO).

@JustinZhengBC can you make a small sub-section about the changes in Categorical int-like concat, showing previous and current behavior. I think that is fine.

@jreback
Copy link
Contributor

jreback commented Jan 5, 2019

@TomAugspurger better?

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just doc comments.

@jreback
Copy link
Contributor

jreback commented Jan 5, 2019

@TomAugspurger ?

@jreback jreback merged commit b9d2700 into pandas-dev:master Jan 5, 2019
@jreback
Copy link
Contributor

jreback commented Jan 5, 2019

thanks @JustinZhengBC keep em coming!

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

categorical series with null converts ints to float
3 participants