BUG: Preserve categorical dtypes when melting (#15853) #23671

dsm054 · 2018-11-13T17:37:24Z

This addresses loss of Categorical status for columns used as id_vars by following Jeff's suggestion to introduce tiling, so Index, Series, and Categorical all grew a .tile method.

closes BUG: melt should preserve Categorical id_vars #15853
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Also add support for tile and not simply repeat.

pep8speaks · 2018-11-13T17:37:34Z

Hello @dsm054! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/compat/numpy/function.py !
There are no PEP8 issues in the file pandas/core/arrays/categorical.py !
There are no PEP8 issues in the file pandas/core/indexes/base.py !
There are no PEP8 issues in the file pandas/core/indexes/datetimelike.py !
There are no PEP8 issues in the file pandas/core/series.py !
There are no PEP8 issues in the file pandas/tests/arrays/categorical/test_analytics.py !
There are no PEP8 issues in the file pandas/tests/indexes/test_base.py !
There are no PEP8 issues in the file pandas/tests/reshape/test_melt.py !
There are no PEP8 issues in the file pandas/tests/series/test_analytics.py !

codecov · 2018-11-13T18:42:16Z

Codecov Report

Merging #23671 into master will increase coverage by <.01%.
The diff coverage is 95.45%.

@@            Coverage Diff             @@
##           master   #23671      +/-   ##
==========================================
+ Coverage   92.24%   92.24%   +<.01%     
==========================================
  Files         161      161              
  Lines       51318    51340      +22     
==========================================
+ Hits        47339    47360      +21     
- Misses       3979     3980       +1

Flag	Coverage Δ
#multiple	`90.63% <95.45%> (ø)`	⬆️
#single	`42.3% <27.27%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/arrays/categorical.py	`95.37% <100%> (+0.02%)`	⬆️
pandas/compat/numpy/function.py	`87.28% <100%> (+0.14%)`	⬆️
pandas/core/indexes/base.py	`96.46% <100%> (ø)`	⬆️
pandas/core/series.py	`93.72% <100%> (+0.04%)`	⬆️
pandas/core/indexes/datetimelike.py	`97.76% <83.33%> (-0.25%)`	⬇️
pandas/core/dtypes/common.py	`94.37% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe52d9f...2f2eba0. Read the comment docs.

TomAugspurger · 2018-11-13T19:17:15Z

I'm always hesitant to add new methods to Series / DataFrame. Do we think .tile is broadly useful enough to be included? Series.tile seems especially problematic, since you'll by definition end up with duplicate indices.

For fixing the original issue of preserving categorical in melt, I'd rather we define in internal helper function that knows how to tile extension arrays.

For Categorical we can tile the codes, datetimelike the .asi8, and for other EAs, we can convert to object, tile that, then use _construct_from_sequence to reconstruct the original dtype.

dsm054 · 2018-11-13T19:32:45Z

I don't understand your objection about the duplicate indices: while they're always awkward, that's exactly how pd.Series.repeat works already, and whose pattern I followed.

The objection to adding new methods is reasonable enough.

jreback

allowing this method is ok, its basically another version of repeat.

jreback · 2018-11-14T13:06:51Z

pandas/core/series.py

+        nv.validate_tile(args, kwargs)
+        new_index = self.index.tile(reps)
+        if is_categorical_dtype(self.dtype):
+            new_values = Categorical.from_codes(np.tile(self.cat.codes, reps),


do this more like repeats, you just directly access the underlying function; then you don't need to type check

jreback · 2018-11-14T13:07:18Z

pandas/tests/indexes/test_base.py

@@ -2485,6 +2485,26 @@ def test_repeat(self):
        result = index.repeat(repeats)
        tm.assert_index_equal(result, expected)

+    def test_tile(self):


ideally you can use the indices fixture here and test all Index

jreback · 2018-11-14T13:07:38Z

pandas/tests/reshape/test_melt.py

+        for column in id_vars:
+            num = len(df.columns) - len(id_vars)
+            expected = df[column].tile(num).reset_index(drop=True)
+            tm.assert_series_equal(result[column], expected)


can you also test result itself

TomAugspurger · 2018-11-14T13:24:02Z

If you want to call it on the underlying values then it has to be part of the EA interface, else it'll break for non-EAs. And we have a pretty high bar for adding new methods to the interface.

…

On Wed, Nov 14, 2018 at 7:08 AM Jeff Reback ***@***.***> wrote: ***@***.**** requested changes on this pull request. allowing this method is ok, its basically another version of repeat. ------------------------------ In pandas/core/series.py <#23671 (comment)>: > @@ -1002,6 +1003,29 @@ def repeat(self, repeats, *args, **kwargs): return self._constructor(new_values, index=new_index).__finalize__(self) + def tile(self, reps, *args, **kwargs): + """ + Tile elements of a Series. Refer to `numpy.tile` + for more information about the `reps` argument, although + note that we do not support multidimensional tiling of Series. + + See also + -------- + pd.Series.repeat + numpy.tile + """ + nv.validate_tile(args, kwargs) + new_index = self.index.tile(reps) + if is_categorical_dtype(self.dtype): + new_values = Categorical.from_codes(np.tile(self.cat.codes, reps), do this more like repeats, you just directly access the underlying function; then you don't need to type check ------------------------------ In pandas/tests/indexes/test_base.py <#23671 (comment)>: > @@ -2485,6 +2485,26 @@ def test_repeat(self): result = index.repeat(repeats) tm.assert_index_equal(result, expected) + def test_tile(self): ideally you can use the indices fixture here and test all Index ------------------------------ In pandas/tests/reshape/test_melt.py <#23671 (comment)>: > + @pytest.mark.parametrize('id_vars', [['a'], ['b'], ['a', 'b']]) + def test_categorical_id_vars(self, id_vars): + # GH 15853 + df = DataFrame({"a": pd.Series(["a", "b", "c", "a", "d"], + dtype="category"), + "b": pd.Series(pd.Categorical([0, 1, 1, 2, 1], + categories=[0, 2, 1, 3], + ordered=True)), + "c": range(5), "d": np.arange(5.0, 0.0, -1)}, + columns=["a", "b", "c", "d"]) + + result = df.melt(id_vars=id_vars) + for column in id_vars: + num = len(df.columns) - len(id_vars) + expected = df[column].tile(num).reset_index(drop=True) + tm.assert_series_equal(result[column], expected) can you also test result itself — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#23671 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIgeuLCMpEJG6ww7cu-HXbO_zGWbMks5uvBXjgaJpZM4YcKNi> .

jreback · 2018-11-14T13:26:22Z

@TomAugspurger well this is actually pretty reasonable to add to be honest (and repeats too). Not adding things just causes more upstream pain.

TomAugspurger · 2018-11-14T13:31:44Z

I'm willing to take on some of that pain, if it means a cleaner API for EA authors.

We could write a repeat for EAs that operates uses either ExtensionArray.take(np.repeat(np.arange(len(self)), or a similar thing with factorized values.

I really don't think .tile makes sense for EAs, which are currently 1-D by definition.

jreback · 2018-11-14T13:33:47Z

We could write a repeat for EAs that operates uses either ExtensionArray.take(np.repeat(np.arange(len(self)), or a similar thing with factorized values.

yes these could be implemented in terms of repeat, i agree.

jreback · 2018-11-19T02:09:12Z

can you rebase

jreback · 2018-12-03T02:17:33Z

@dsm054 can you merge master and update

jreback · 2018-12-23T23:11:54Z

can you merge master and update

jreback · 2019-01-05T15:40:02Z

closing as stale. pls ping if you want to continue working.

BUG: Preserve categorical dtypes when melting (pandas-dev#15853)

2f2eba0

Also add support for tile and not simply repeat.

TomAugspurger added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Nov 13, 2018

jreback requested changes Nov 14, 2018

View reviewed changes

jreback closed this Jan 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Preserve categorical dtypes when melting (#15853) #23671

BUG: Preserve categorical dtypes when melting (#15853) #23671

dsm054 commented Nov 13, 2018

pep8speaks commented Nov 13, 2018

codecov bot commented Nov 13, 2018 •

edited

Loading

TomAugspurger commented Nov 13, 2018

dsm054 commented Nov 13, 2018

jreback left a comment

jreback Nov 14, 2018

jreback Nov 14, 2018

jreback Nov 14, 2018

TomAugspurger commented Nov 14, 2018 via email

jreback commented Nov 14, 2018

TomAugspurger commented Nov 14, 2018

jreback commented Nov 14, 2018

jreback commented Nov 19, 2018

jreback commented Dec 3, 2018

jreback commented Dec 23, 2018

jreback commented Jan 5, 2019

BUG: Preserve categorical dtypes when melting (#15853) #23671

BUG: Preserve categorical dtypes when melting (#15853) #23671

Conversation

dsm054 commented Nov 13, 2018

pep8speaks commented Nov 13, 2018

codecov bot commented Nov 13, 2018 • edited Loading

Codecov Report

TomAugspurger commented Nov 13, 2018

dsm054 commented Nov 13, 2018

jreback left a comment

Choose a reason for hiding this comment

jreback Nov 14, 2018

Choose a reason for hiding this comment

jreback Nov 14, 2018

Choose a reason for hiding this comment

jreback Nov 14, 2018

Choose a reason for hiding this comment

TomAugspurger commented Nov 14, 2018 via email

jreback commented Nov 14, 2018

TomAugspurger commented Nov 14, 2018

jreback commented Nov 14, 2018

jreback commented Nov 19, 2018

jreback commented Dec 3, 2018

jreback commented Dec 23, 2018

jreback commented Jan 5, 2019

codecov bot commented Nov 13, 2018 •

edited

Loading