Skip to content

Commit 206fb97

Browse files
committed
FIX: Implement small Categorical fixups
This is a squashed commit which adresses several small issues identified in the latest code review. * As per dicussion in pandas-dev#8074, Categorical.unique() should return an array and not a Index object to make is consistant with other unique() implementations. * Mark a few more categorical methods as internal * Change reorder_levels and drop_unused_levels to inplace=False. As per the discussion in pandas-dev#8074, change the default of both methods to return a copy and only do the change inplace if told otherwise. Discusion in pandas-dev#8074 (comment) * Make str(cat) more array like. As Categorical is more like np.ndarray and less similar to pd.Series, change the string representation of Categorical to be more like np.ndarray. The new str(cat) will only show at max 10 values. Also remove the "Categorical(" in front of an empty categorical -> either that's put in front of every str(cta) or none. It also missed the final ")". * Test more nan handling in Categorical * Add a testcase for groupby with two columns and unused categories During the devleopment of pandas new Categorical support, groupby with two columns (one worked) didn't include empty categories which omits rows with NaN. Add a testcase so this is checked. No codechanges nessesary, this bug was fixed at some other places before. See pandas-dev#8138 for more details.
1 parent ca2e608 commit 206fb97

File tree

5 files changed

+183
-57
lines changed

5 files changed

+183
-57
lines changed

doc/source/categorical.rst

+10-10
Original file line numberDiff line numberDiff line change
@@ -298,12 +298,13 @@ This is even true for strings and numeric data:
298298
299299
Reordering the levels is possible via the ``Categorical.reorder_levels(new_levels)`` or
300300
``Series.cat.reorder_levels(new_levels)`` methods. All old levels must be included in the new
301-
levels.
301+
levels. Note that per default, this operation returns a new Series and you need to specify
302+
``inplace=True`` to do the change inplace!
302303

303304
.. ipython:: python
304305
305-
s2 = pd.Series(pd.Categorical([1,2,3,1]))
306-
s2.cat.reorder_levels([2,3,1])
306+
s = pd.Series(pd.Categorical([1,2,3,1]))
307+
s2 = s.cat.reorder_levels([2,3,1])
307308
s2
308309
s2.sort()
309310
s2
@@ -322,8 +323,8 @@ old levels:
322323

323324
.. ipython:: python
324325
325-
s3 = pd.Series(pd.Categorical(["a","b","d"]))
326-
s3.cat.reorder_levels(["a","b","c","d"])
326+
s = pd.Series(pd.Categorical(["a","b","d"]))
327+
s3 = s.cat.reorder_levels(["a","b","c","d"])
327328
s3
328329
329330
@@ -582,7 +583,7 @@ relevant columns back to `category` and assign the right levels and level orderi
582583
# rename the levels
583584
s.cat.levels = ["very good", "good", "bad"]
584585
# reorder the levels and add missing levels
585-
s.cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])
586+
s = s.cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])
586587
df = pd.DataFrame({"cats":s, "vals":[1,2,3,4,5,6]})
587588
csv = StringIO()
588589
df.to_csv(csv)
@@ -591,7 +592,7 @@ relevant columns back to `category` and assign the right levels and level orderi
591592
df2["cats"]
592593
# Redo the category
593594
df2["cats"] = df2["cats"].astype("category")
594-
df2["cats"].cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])
595+
df2["cats"].cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"], inplace=True)
595596
df2.dtypes
596597
df2["cats"]
597598
@@ -731,7 +732,7 @@ Series of type ``category`` this means that there is some danger to confuse both
731732
except Exception as e:
732733
print("Exception: " + str(e))
733734
# right
734-
s.cat.reorder_levels([4,3,2,1])
735+
s = s.cat.reorder_levels([4,3,2,1])
735736
print(s.cat.levels)
736737
737738
See also the API documentation for :func:`pandas.Series.reorder_levels` and
@@ -742,8 +743,7 @@ Old style constructor usage
742743

743744
I earlier versions, a `Categorical` could be constructed by passing in precomputed `codes`
744745
(called then `labels`) instead of values with levels. The `codes` are interpreted as pointers
745-
to the levels with `-1` as `NaN`. This usage is now deprecated and not available unless
746-
``compat=True`` is passed to the constructor of `Categorical`.
746+
to the levels with `-1` as `NaN`.
747747

748748
.. ipython:: python
749749
:okwarning:

pandas/core/categorical.py

+66-26
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,8 @@ def from_array(cls, data):
299299
"""
300300
Make a Categorical type from a single array-like object.
301301
302+
For internal compatibility with numpy arrays.
303+
302304
Parameters
303305
----------
304306
data : array-like
@@ -412,7 +414,7 @@ def _get_levels(self):
412414

413415
levels = property(fget=_get_levels, fset=_set_levels, doc=_levels_doc)
414416

415-
def reorder_levels(self, new_levels, ordered=None):
417+
def reorder_levels(self, new_levels, ordered=None, inplace=False):
416418
""" Reorders levels as specified in new_levels.
417419
418420
`new_levels` must include all old levels but can also include new level items. In
@@ -432,27 +434,50 @@ def reorder_levels(self, new_levels, ordered=None):
432434
ordered : boolean, optional
433435
Whether or not the categorical is treated as a ordered categorical. If not given,
434436
do not change the ordered information.
437+
inplace : bool (default: False)
438+
Whether or not to reorder the levels inplace or return a copy of this categorical with
439+
reordered levels.
440+
441+
Returns
442+
-------
443+
cat : Categorical with reordered levels.
435444
"""
436445
new_levels = self._validate_levels(new_levels)
437446

438447
if len(new_levels) < len(self._levels) or len(self._levels.difference(new_levels)):
439448
raise ValueError('Reordered levels must include all original levels')
440-
values = self.__array__()
441-
self._codes = _get_codes_for_values(values, new_levels)
442-
self._levels = new_levels
449+
450+
cat = self if inplace else self.copy()
451+
values = cat.__array__()
452+
cat._codes = _get_codes_for_values(values, new_levels)
453+
cat._levels = new_levels
443454
if not ordered is None:
444-
self.ordered = ordered
455+
cat.ordered = ordered
456+
if not inplace:
457+
return cat
445458

446-
def remove_unused_levels(self):
459+
def remove_unused_levels(self, inplace=False):
447460
""" Removes levels which are not used.
448461
449-
The level removal is done inplace.
462+
Parameters
463+
----------
464+
inplace : bool (default: False)
465+
Whether or not to drop unused levels inplace or return a copy of this categorical with
466+
unused levels dropped.
467+
468+
Returns
469+
-------
470+
cat : Categorical with unused levels dropped.
471+
450472
"""
451-
_used = sorted(np.unique(self._codes))
452-
new_levels = self.levels.take(com._ensure_platform_int(_used))
473+
cat = self if inplace else self.copy()
474+
_used = sorted(np.unique(cat._codes))
475+
new_levels = cat.levels.take(com._ensure_platform_int(_used))
453476
new_levels = _ensure_index(new_levels)
454-
self._codes = _get_codes_for_values(self.__array__(), new_levels)
455-
self._levels = new_levels
477+
cat._codes = _get_codes_for_values(cat.__array__(), new_levels)
478+
cat._levels = new_levels
479+
if not inplace:
480+
return cat
456481

457482

458483
__eq__ = _cat_compare_op('__eq__')
@@ -683,7 +708,14 @@ def view(self):
683708
return self
684709

685710
def to_dense(self):
686-
""" Return my 'dense' repr """
711+
"""Return my 'dense' representation
712+
713+
For internal compatibility with numpy arrays.
714+
715+
Returns
716+
-------
717+
dense : array
718+
"""
687719
return np.asarray(self)
688720

689721
def fillna(self, fill_value=None, method=None, limit=None, **kwargs):
@@ -743,7 +775,10 @@ def fillna(self, fill_value=None, method=None, limit=None, **kwargs):
743775
name=self.name, fastpath=True)
744776

745777
def take_nd(self, indexer, allow_fill=True, fill_value=None):
746-
""" Take the codes by the indexer, fill with the fill_value. """
778+
""" Take the codes by the indexer, fill with the fill_value.
779+
780+
For internal compatibility with numpy arrays.
781+
"""
747782

748783
# filling must always be None/nan here
749784
# but is passed thru internally
@@ -757,7 +792,10 @@ def take_nd(self, indexer, allow_fill=True, fill_value=None):
757792
take = take_nd
758793

759794
def _slice(self, slicer):
760-
""" Return a slice of myself. """
795+
""" Return a slice of myself.
796+
797+
For internal compatibility with numpy arrays.
798+
"""
761799

762800
# only allow 1 dimensional slicing, but can
763801
# in a 2-d case be passd (slice(None),....)
@@ -771,19 +809,21 @@ def _slice(self, slicer):
771809
name=self.name, fastpath=True)
772810

773811
def __len__(self):
812+
"""The length of this Categorical."""
774813
return len(self._codes)
775814

776815
def __iter__(self):
816+
"""Returns an Iterator over the values of this Categorical."""
777817
return iter(np.array(self))
778818

779-
def _tidy_repr(self, max_vals=20):
819+
def _tidy_repr(self, max_vals=10):
780820
num = max_vals // 2
781821
head = self[:num]._get_repr(length=False, name=False, footer=False)
782822
tail = self[-(max_vals - num):]._get_repr(length=False,
783823
name=False,
784824
footer=False)
785825

786-
result = '%s\n...\n%s' % (head, tail)
826+
result = '%s, ..., %s' % (head[:-1], tail[1:])
787827
result = '%s\n%s' % (result, self._repr_footer())
788828

789829
return compat.text_type(result)
@@ -840,17 +880,14 @@ def _get_repr(self, name=False, length=True, na_rep='NaN', footer=True):
840880

841881
def __unicode__(self):
842882
""" Unicode representation. """
843-
width, height = get_terminal_size()
844-
max_rows = (height if get_option("display.max_rows") == 0
845-
else get_option("display.max_rows"))
846-
847-
if len(self._codes) > (max_rows or 1000):
848-
result = self._tidy_repr(min(30, max_rows) - 4)
883+
_maxlen = 10
884+
if len(self._codes) > _maxlen:
885+
result = self._tidy_repr(_maxlen)
849886
elif len(self._codes) > 0:
850-
result = self._get_repr(length=len(self) > 50,
887+
result = self._get_repr(length=len(self) > _maxlen,
851888
name=True)
852889
else:
853-
result = 'Categorical([], %s' % self._get_repr(name=True,
890+
result = '[], %s' % self._get_repr(name=True,
854891
length=False,
855892
footer=True,
856893
).replace("\n",", ")
@@ -1025,7 +1062,7 @@ def unique(self):
10251062
-------
10261063
unique values : array
10271064
"""
1028-
return self.levels
1065+
return np.asarray(self.levels)
10291066

10301067
def equals(self, other):
10311068
"""
@@ -1113,8 +1150,11 @@ def _delegate_property_set(self, name, new_values):
11131150
return setattr(self.categorical, name, new_values)
11141151

11151152
def _delegate_method(self, name, *args, **kwargs):
1153+
from pandas import Series
11161154
method = getattr(self.categorical, name)
1117-
return method(*args, **kwargs)
1155+
res = method(*args, **kwargs)
1156+
if not res is None:
1157+
return Series(res, index=self.index)
11181158

11191159
CategoricalProperties._add_delegate_accessors(delegate=Categorical,
11201160
accessors=["levels", "ordered"],

pandas/core/format.py

+3
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,9 @@ def to_string(self):
116116
fmt_values = self._get_formatted_values()
117117

118118
result = ['%s' % i for i in fmt_values]
119+
result = [i.strip() for i in result]
120+
result = u(', ').join(result)
121+
result = [u('[')+result+u(']')]
119122
if self.footer:
120123
footer = self._get_footer()
121124
if footer:

0 commit comments

Comments
 (0)