Skip to content

BUG: Preserve categorical dtypes when melting (#15853) #23671

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

dsm054
Copy link
Contributor

@dsm054 dsm054 commented Nov 13, 2018

This addresses loss of Categorical status for columns used as id_vars by following Jeff's suggestion to introduce tiling, so Index, Series, and Categorical all grew a .tile method.

Also add support for tile and not simply repeat.
@pep8speaks
Copy link

Hello @dsm054! Thanks for submitting the PR.

@codecov
Copy link

codecov bot commented Nov 13, 2018

Codecov Report

Merging #23671 into master will increase coverage by <.01%.
The diff coverage is 95.45%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23671      +/-   ##
==========================================
+ Coverage   92.24%   92.24%   +<.01%     
==========================================
  Files         161      161              
  Lines       51318    51340      +22     
==========================================
+ Hits        47339    47360      +21     
- Misses       3979     3980       +1
Flag Coverage Δ
#multiple 90.63% <95.45%> (ø) ⬆️
#single 42.3% <27.27%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.37% <100%> (+0.02%) ⬆️
pandas/compat/numpy/function.py 87.28% <100%> (+0.14%) ⬆️
pandas/core/indexes/base.py 96.46% <100%> (ø) ⬆️
pandas/core/series.py 93.72% <100%> (+0.04%) ⬆️
pandas/core/indexes/datetimelike.py 97.76% <83.33%> (-0.25%) ⬇️
pandas/core/dtypes/common.py 94.37% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe52d9f...2f2eba0. Read the comment docs.

@TomAugspurger
Copy link
Contributor

I'm always hesitant to add new methods to Series / DataFrame. Do we think .tile is broadly useful enough to be included? Series.tile seems especially problematic, since you'll by definition end up with duplicate indices.

For fixing the original issue of preserving categorical in melt, I'd rather we define in internal helper function that knows how to tile extension arrays.

For Categorical we can tile the codes, datetimelike the .asi8, and for other EAs, we can convert to object, tile that, then use _construct_from_sequence to reconstruct the original dtype.

@TomAugspurger TomAugspurger added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Nov 13, 2018
@dsm054
Copy link
Contributor Author

dsm054 commented Nov 13, 2018

I don't understand your objection about the duplicate indices: while they're always awkward, that's exactly how pd.Series.repeat works already, and whose pattern I followed.

The objection to adding new methods is reasonable enough.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allowing this method is ok, its basically another version of repeat.

nv.validate_tile(args, kwargs)
new_index = self.index.tile(reps)
if is_categorical_dtype(self.dtype):
new_values = Categorical.from_codes(np.tile(self.cat.codes, reps),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this more like repeats, you just directly access the underlying function; then you don't need to type check

@@ -2485,6 +2485,26 @@ def test_repeat(self):
result = index.repeat(repeats)
tm.assert_index_equal(result, expected)

def test_tile(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally you can use the indices fixture here and test all Index

for column in id_vars:
num = len(df.columns) - len(id_vars)
expected = df[column].tile(num).reset_index(drop=True)
tm.assert_series_equal(result[column], expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also test result itself

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 14, 2018 via email

@jreback
Copy link
Contributor

jreback commented Nov 14, 2018

@TomAugspurger well this is actually pretty reasonable to add to be honest (and repeats too). Not adding things just causes more upstream pain.

@TomAugspurger
Copy link
Contributor

I'm willing to take on some of that pain, if it means a cleaner API for EA authors.

We could write a repeat for EAs that operates uses either ExtensionArray.take(np.repeat(np.arange(len(self)), or a similar thing with factorized values.

I really don't think .tile makes sense for EAs, which are currently 1-D by definition.

@jreback
Copy link
Contributor

jreback commented Nov 14, 2018

We could write a repeat for EAs that operates uses either ExtensionArray.take(np.repeat(np.arange(len(self)), or a similar thing with factorized values.

yes these could be implemented in terms of repeat, i agree.

@jreback
Copy link
Contributor

jreback commented Nov 19, 2018

can you rebase

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

@dsm054 can you merge master and update

@jreback
Copy link
Contributor

jreback commented Dec 23, 2018

can you merge master and update

@jreback
Copy link
Contributor

jreback commented Jan 5, 2019

closing as stale. pls ping if you want to continue working.

@jreback jreback closed this Jan 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: melt should preserve Categorical id_vars
4 participants