-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: accept dict of column:dtype as dtype argument in DataFrame.astype #12086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: accept dict of column:dtype as dtype argument in DataFrame.astype #12086
Conversation
------- | ||
casted : type of caller | ||
""" | ||
if type(dtype) == dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use isinstance
here to get subclasses, too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isinstance(dtype, collections.Mapping)
if you want to be really precise
68f73e4
to
957e1cb
Compare
if isinstance(dtype, collections.Mapping): | ||
df = DataFrame(data=self, copy=copy, **kwargs) | ||
for col_key, col_dtype in dtype.items(): | ||
df.ix[:, col_key] = df.ix[:, col_key].astype(col_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this iterative updating will be quite inefficient as it will potentially copy the entire new dtype each time as it grows; this should change with the internals refactor #11970 , but for now do something like:
astyped_cols = pd.concat([ df[col_key].astype(col_dtype) for col_key, col_dtype in dtype.items() ], axis=1)
return pd.concat([astyped_cols, df[[df.columns.difference(dtype.keys())]], axis=1).reindex(columns=df.columns)
which should minimize the copying.
astype currently has the related to #5769. It think we should change this to This could be done in another PR (before this one), or in this one is ok too. |
further I wouldn't use So pls add some more tests, including some invalid columns. |
df.ix[:, col_key] = df.ix[:, col_key].astype(col_dtype) | ||
return df | ||
else: | ||
return super(DataFrame, self).astype(dtype=dtype, copy=copy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can simply return instead of using else
here.
@StephenKappel can you rebase/update |
@StephenKappel can you rebase / update |
Sorry for neglecting this. The semester is finally over, so I'll get this cleaned up today :-) |
5cb7b25
to
6dc0800
Compare
6dc0800
to
0aeee8d
Compare
@jreback - Rebased and updated based on your earlier feedback. Please let me know if any further updates are needed. |
Parameters | ||
---------- | ||
dtype : numpy.dtype or Python type (to cast entire DataFrame to the | ||
same type). Alternatively, {col: dtype, ...}, where col is a column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you adding a method her? this should all be done in generic.py
…mapping of col to type; GH7271
…s-dev#13288) closes pandas-dev#13104 closes pandas-dev#13288 Author: Piotr Jucha <[email protected]> Closes pandas-dev#13298 from pijucha/bug13104 and squashes the following commits: 9a6bd6e [Piotr Jucha] BUG: Fix describe(): percentiles (pandas-dev#13104), col index (pandas-dev#13288)
Title is self-explanatory. Closes pandas-dev#13304. Author: gfyoung <[email protected]> Closes pandas-dev#13309 from gfyoung/ordereddict-key-ordering-init and squashes the following commits: 4f311cc [gfyoung] ENH: Respect key ordering for OrderedDict list in DataFrame init
closes pandas-dev#13324 Author: Roger Thomas <[email protected]> Closes pandas-dev#13326 from RogerThomas/fix_maybe_convert_numeric_for_unhashable_objects and squashes the following commits: 76a0738 [Roger Thomas] Fix maybe_convert_numeric for unhashable objects
Author: Jeff Reback <[email protected]> Closes pandas-dev#13336 from jreback/is_monotonic and squashes the following commits: 0a50ff9 [Jeff Reback] ENH: Series has gained the properties .is_monotonic, .is_monotonic_increasing, .is_monotonic_decreasing
closes pandas-dev#13338 Author: Jeff Reback <[email protected]> Closes pandas-dev#13339 from jreback/eval and squashes the following commits: b2ee5e8 [Jeff Reback] TST: computation/test_eval.py tests (slow)
Fixes bug in which the Python parser failed to detect trailing `NaN` values in rows Author: gfyoung <[email protected]> Closes pandas-dev#13320 from gfyoung/trailing-nan-conversion and squashes the following commits: 590874d [gfyoung] BUG: Parse trailing NaN values for the Python parser
Small thing I just noticed in the docs (the note on the other version was not updated when the example was changed from cythonmagic -> Cython) Author: Joris Van den Bossche <[email protected]> Closes pandas-dev#13343 from jorisvandenbossche/doc-cythonmagic and squashes the following commits: 902352c [Joris Van den Bossche] DOC: fix comment on previous versions cythonmagic
closes pandas-dev#13306 Author: Uwe Hoffmann <[email protected]> Closes pandas-dev#13313 from uwedeportivo/master and squashes the following commits: be3ed90 [Uwe Hoffmann] whatsnew entry for issue pandas-dev#13306 1f5f7a5 [Uwe Hoffmann] Code Review jreback 82f263a [Uwe Hoffmann] Use vectorized searchsorted and tests. a1ed5a5 [Uwe Hoffmann] Fix pandas-dev#13306: Hour overflow in tz-aware datetime conversions.
Title is self-explanatory. xref pandas-dev#12686 - I don't quite understand why these are marked (if at all) as internal to the C engine only, as the benefits for having these options accepted for the Python engine is quite clear based on the documentation I added as well. Implementation simply just calls the already-written function in `pandas/parsers.pyx` - as it isn't specific to the `TextReader` class, crossing over to grab this function from Cython (instead of duplicating in pure Python) seems reasonable while maintaining that separation between the C and Python engines. Author: gfyoung <[email protected]> Closes pandas-dev#13323 from gfyoung/python-engine-compact-ints and squashes the following commits: 95f7ba8 [gfyoung] ENH: Add support for compact_ints and use_unsigned in Python engine
…y arrays closes pandas-dev#13006 Author: Gábor Lipták <[email protected]> Closes pandas-dev#13307 from gliptak/seriescomp1 and squashes the following commits: 4967db4 [Gábor Lipták] Fix series comparison operators when dealing with zero rank numpy arrays
Author: Chris Warth <[email protected]> Closes pandas-dev#12399 from cswarth/doc/df_filter and squashes the following commits: f48e9ff [Chris Warth] DOC: Add example usage to DataFrame.filter
Author: babakkeyvani <[email protected]> Closes pandas-dev#13366 from bkeyvani/master and squashes the following commits: 029ade7 [babakkeyvani] DOC: Fixed a minor typo
Title is self-explanatory. Author: gfyoung <[email protected]> Closes pandas-dev#13368 from gfyoung/doublequote-doc and squashes the following commits: f3e01fc [gfyoung] DOC: document doublequote in read_csv
`buffer_lines` is not respected, as it is determined internally via a heuristic involving `table_width` (see <a href="https://github.com/pyd ata/pandas/blob/master/pandas/parser.pyx#L527">here</a> for how it is computed). Author: gfyoung <[email protected]> Closes pandas-dev#13360 from gfyoung/buffer-lines-depr-doc and squashes the following commits: a72ecbe [gfyoung] DEPR, DOC: Deprecate buffer_lines in read_csv
…ined categorical columns closes pandas-dev#13231 Author: Christian Hudon <[email protected]> Closes pandas-dev#13359 from chrish42/gh13231 and squashes the following commits: e839638 [Christian Hudon] Raise a better exception when the HDF file is empty and kwy=None. 611aa28 [Christian Hudon] Formatting fixes. e7c8313 [Christian Hudon] Add changelog entry. df10016 [Christian Hudon] Make logic that detects if there is only one dataset in a HDF5 file work when storing a dataframe that contains categorical data. 2f41aef [Christian Hudon] Tweak comment to be clearer. b3a5773 [Christian Hudon] Add test that fails for GitHub bug pandas-dev#13231 02f90d5 [Christian Hudon] Use if-expression.
* Typo correction * removed deprecated script
Author: Jeff Reback <[email protected]> Closes pandas-dev#13372 from jreback/skiplist and squashes the following commits: e05ea24 [Jeff Reback] CLN: remove old skiplist code
…mapping of col to type; GH7271
Ahh. Not sure how, but rebasing added everyone else's commits to this diff. Closing this PR and opening a clean one. |
you don't need to make a new PR just rebase and force push |
I created a clean branch off of prod, cherry-picked my commits, and force pushed to this branch, but for some reason GitHub isn't recognizing it. Perhaps it's because I already closed the PR? But, I don't think I have access to reopen the PR... |
Oh. It seems the PR can't be reopened because the branch has been subsequently force pushed :-( |
Continued in #13375
closes #7271
By passing a dict of {column name/column index: dtype}, multiple columns can be cast to different data types in a single command. Now users can do:
df = df.astype({'my_bool', 'bool', 'my_int': 'int'})
or:
df = df.astype({0, 'bool', 1: 'int'})
instead of: