-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Implement DataFrame.astype('category') #18099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,6 +18,7 @@ | |
is_number, | ||
is_integer, is_bool, | ||
is_bool_dtype, | ||
is_categorical_dtype, | ||
is_numeric_dtype, | ||
is_datetime64_dtype, | ||
is_timedelta64_dtype, | ||
|
@@ -4429,14 +4430,18 @@ def astype(self, dtype, copy=True, errors='raise', **kwargs): | |
if col_name not in self: | ||
raise KeyError('Only a column name can be used for the ' | ||
'key in a dtype mappings argument.') | ||
from pandas import concat | ||
results = [] | ||
for col_name, col in self.iteritems(): | ||
if col_name in dtype: | ||
results.append(col.astype(dtype[col_name], copy=copy)) | ||
else: | ||
results.append(results.append(col.copy() if copy else col)) | ||
return concat(results, axis=1, copy=False) | ||
return pd.concat(results, axis=1, copy=False) | ||
|
||
elif is_categorical_dtype(dtype) and self.ndim > 1: | ||
# GH 18099: columnwise conversion to categorical | ||
results = (self[col].astype(dtype, copy=copy) for col in self) | ||
return pd.concat(results, axis=1, copy=False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think using concat is the best way here. For example, this will loose information about the columns object (its type, metadata such as its name). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is idiomatic and how things are handled elsewhere.
it will preserve as these are all the same index. of course if you have a counter-example There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am speaking about the columns, not index. So just take an example with CategoricalIndex, RangeIndex, .. or one with a name, and the resulting columns will not be the same. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm, we do this else where in astype. So this is obviously not tested. ok In any event, using concat is vastly more performant. I suppose you just set the columns after. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've opened #19920 to address the (an?) existing occurrence of this bug. A quick check makes it look like @jreback's suggestion of setting the columns afterwards should fix things in both places, e.g. result = pd.concat(results, axis=1, copy=False)
result.columns = self.columns
return result Since this is relatively small, I'm planning to make the fix/add tests in this PR. Can address it in a separate PR if that would be preferable though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Both (here or in other PR) are fine for me. However, if we also want to fix this for the other cases (not only for categorical), it might be cleaner as a separate PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes this is fine, I see you opened another one which is good. |
||
|
||
# else, only a single dtype is given | ||
new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to add a ::note section to show how to gain all of the uniques a-priori on a DataFrame to pass to CDT