-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Preserve categorical dtypes when melting (#15853) #23671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also add support for tile and not simply repeat.
Hello @dsm054! Thanks for submitting the PR.
|
Codecov Report
@@ Coverage Diff @@
## master #23671 +/- ##
==========================================
+ Coverage 92.24% 92.24% +<.01%
==========================================
Files 161 161
Lines 51318 51340 +22
==========================================
+ Hits 47339 47360 +21
- Misses 3979 3980 +1
Continue to review full report at Codecov.
|
I'm always hesitant to add new methods to Series / DataFrame. Do we think For fixing the original issue of preserving categorical in melt, I'd rather we define in internal helper function that knows how to tile extension arrays. For Categorical we can tile the codes, datetimelike the |
I don't understand your objection about the duplicate indices: while they're always awkward, that's exactly how pd.Series.repeat works already, and whose pattern I followed. The objection to adding new methods is reasonable enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allowing this method is ok, its basically another version of repeat.
nv.validate_tile(args, kwargs) | ||
new_index = self.index.tile(reps) | ||
if is_categorical_dtype(self.dtype): | ||
new_values = Categorical.from_codes(np.tile(self.cat.codes, reps), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do this more like repeats
, you just directly access the underlying function; then you don't need to type check
@@ -2485,6 +2485,26 @@ def test_repeat(self): | |||
result = index.repeat(repeats) | |||
tm.assert_index_equal(result, expected) | |||
|
|||
def test_tile(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally you can use the indices fixture here and test all Index
for column in id_vars: | ||
num = len(df.columns) - len(id_vars) | ||
expected = df[column].tile(num).reset_index(drop=True) | ||
tm.assert_series_equal(result[column], expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also test result itself
If you want to call it on the underlying values then it has to be part of
the EA interface, else it'll break for non-EAs.
And we have a pretty high bar for adding new methods to the interface.
…On Wed, Nov 14, 2018 at 7:08 AM Jeff Reback ***@***.***> wrote:
***@***.**** requested changes on this pull request.
allowing this method is ok, its basically another version of repeat.
------------------------------
In pandas/core/series.py
<#23671 (comment)>:
> @@ -1002,6 +1003,29 @@ def repeat(self, repeats, *args, **kwargs):
return self._constructor(new_values,
index=new_index).__finalize__(self)
+ def tile(self, reps, *args, **kwargs):
+ """
+ Tile elements of a Series. Refer to `numpy.tile`
+ for more information about the `reps` argument, although
+ note that we do not support multidimensional tiling of Series.
+
+ See also
+ --------
+ pd.Series.repeat
+ numpy.tile
+ """
+ nv.validate_tile(args, kwargs)
+ new_index = self.index.tile(reps)
+ if is_categorical_dtype(self.dtype):
+ new_values = Categorical.from_codes(np.tile(self.cat.codes, reps),
do this more like repeats, you just directly access the underlying
function; then you don't need to type check
------------------------------
In pandas/tests/indexes/test_base.py
<#23671 (comment)>:
> @@ -2485,6 +2485,26 @@ def test_repeat(self):
result = index.repeat(repeats)
tm.assert_index_equal(result, expected)
+ def test_tile(self):
ideally you can use the indices fixture here and test all Index
------------------------------
In pandas/tests/reshape/test_melt.py
<#23671 (comment)>:
> + @pytest.mark.parametrize('id_vars', [['a'], ['b'], ['a', 'b']])
+ def test_categorical_id_vars(self, id_vars):
+ # GH 15853
+ df = DataFrame({"a": pd.Series(["a", "b", "c", "a", "d"],
+ dtype="category"),
+ "b": pd.Series(pd.Categorical([0, 1, 1, 2, 1],
+ categories=[0, 2, 1, 3],
+ ordered=True)),
+ "c": range(5), "d": np.arange(5.0, 0.0, -1)},
+ columns=["a", "b", "c", "d"])
+
+ result = df.melt(id_vars=id_vars)
+ for column in id_vars:
+ num = len(df.columns) - len(id_vars)
+ expected = df[column].tile(num).reset_index(drop=True)
+ tm.assert_series_equal(result[column], expected)
can you also test result itself
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#23671 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIgeuLCMpEJG6ww7cu-HXbO_zGWbMks5uvBXjgaJpZM4YcKNi>
.
|
@TomAugspurger well this is actually pretty reasonable to add to be honest (and repeats too). Not adding things just causes more upstream pain. |
I'm willing to take on some of that pain, if it means a cleaner API for EA authors. We could write a repeat for EAs that operates uses either I really don't think |
yes these could be implemented in terms of repeat, i agree. |
can you rebase |
@dsm054 can you merge master and update |
can you merge master and update |
closing as stale. pls ping if you want to continue working. |
This addresses loss of Categorical status for columns used as id_vars by following Jeff's suggestion to introduce tiling, so Index, Series, and Categorical all grew a
.tile
method.git diff upstream/master -u -- "*.py" | flake8 --diff