Skip to content

ENH: accept dict of column:dtype as dtype argument in DataFrame.astype #12086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

StephenKappel
Copy link
Contributor

@StephenKappel StephenKappel commented Jan 18, 2016

Continued in #13375


closes #7271

By passing a dict of {column name/column index: dtype}, multiple columns can be cast to different data types in a single command. Now users can do:

df = df.astype({'my_bool', 'bool', 'my_int': 'int'})

or:

df = df.astype({0, 'bool', 1: 'int'})

instead of:

df['my_bool'] = df.my_bool.astype('bool')
df['my_int'] = df.my_int.astype('int')

-------
casted : type of caller
"""
if type(dtype) == dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use isinstance here to get subclasses, too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance(dtype, collections.Mapping) if you want to be really precise

if isinstance(dtype, collections.Mapping):
df = DataFrame(data=self, copy=copy, **kwargs)
for col_key, col_dtype in dtype.items():
df.ix[:, col_key] = df.ix[:, col_key].astype(col_dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this iterative updating will be quite inefficient as it will potentially copy the entire new dtype each time as it grows; this should change with the internals refactor #11970 , but for now do something like:

astyped_cols = pd.concat([ df[col_key].astype(col_dtype) for col_key, col_dtype in dtype.items() ], axis=1)
return pd.concat([astyped_cols, df[[df.columns.difference(dtype.keys())]], axis=1).reindex(columns=df.columns)

which should minimize the copying.

@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

astype currently has the copy kw. This is a big bogus as this almost always actually does copy if its changing to a new dtype (the exception is when its going to the same dtype).

related to #5769. It think we should change this to inplace (and deprecate copy) as its a bit more clear whats actually going on.

This could be done in another PR (before this one), or in this one is ok too.

@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

further I wouldn't use .ix here as that would interpret numeric values as positional indexers for columns as this is a potentially confusing situation (and we don't do this anywhere else). If a non-matching column is passed then it will simply raise a KeyError. Instead use [].

So pls add some more tests, including some invalid columns.

df.ix[:, col_key] = df.ix[:, col_key].astype(col_dtype)
return df
else:
return super(DataFrame, self).astype(dtype=dtype, copy=copy,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simply return instead of using else here.

@jreback jreback added Enhancement Dtype Conversions Unexpected or buggy dtype conversions labels Jan 19, 2016
@jreback
Copy link
Contributor

jreback commented Mar 12, 2016

@StephenKappel can you rebase/update

@jreback jreback added this to the Next Major Release milestone Apr 5, 2016
@jreback
Copy link
Contributor

jreback commented May 7, 2016

@StephenKappel can you rebase / update

@jreback jreback removed this from the Next Major Release milestone May 7, 2016
@StephenKappel
Copy link
Contributor Author

Sorry for neglecting this. The semester is finally over, so I'll get this cleaned up today :-)

@StephenKappel StephenKappel force-pushed the 7271-df-astype-dict branch 2 times, most recently from 5cb7b25 to 6dc0800 Compare May 8, 2016 23:27
@StephenKappel StephenKappel force-pushed the 7271-df-astype-dict branch from 6dc0800 to 0aeee8d Compare May 9, 2016 02:33
@StephenKappel
Copy link
Contributor Author

@jreback - Rebased and updated based on your earlier feedback. Please let me know if any further updates are needed.

Parameters
----------
dtype : numpy.dtype or Python type (to cast entire DataFrame to the
same type). Alternatively, {col: dtype, ...}, where col is a column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you adding a method her? this should all be done in generic.py

pijucha and others added 23 commits May 31, 2016 10:12
…s-dev#13288)

closes pandas-dev#13104
closes pandas-dev#13288

Author: Piotr Jucha <[email protected]>

Closes pandas-dev#13298 from pijucha/bug13104 and squashes the following commits:

9a6bd6e [Piotr Jucha] BUG: Fix describe(): percentiles (pandas-dev#13104), col index (pandas-dev#13288)
Title is self-explanatory.  Closes pandas-dev#13304.

Author: gfyoung <[email protected]>

Closes pandas-dev#13309 from gfyoung/ordereddict-key-ordering-init and squashes the following commits:

4f311cc [gfyoung] ENH: Respect key ordering for OrderedDict list in DataFrame init
closes pandas-dev#13324

Author: Roger Thomas <[email protected]>

Closes pandas-dev#13326 from RogerThomas/fix_maybe_convert_numeric_for_unhashable_objects and squashes the following commits:

76a0738 [Roger Thomas] Fix maybe_convert_numeric for unhashable objects
Author: Jeff Reback <[email protected]>

Closes pandas-dev#13336 from jreback/is_monotonic and squashes the following commits:

0a50ff9 [Jeff Reback] ENH: Series has gained the properties .is_monotonic, .is_monotonic_increasing, .is_monotonic_decreasing
closes pandas-dev#13338

Author: Jeff Reback <[email protected]>

Closes pandas-dev#13339 from jreback/eval and squashes the following commits:

b2ee5e8 [Jeff Reback] TST: computation/test_eval.py tests (slow)
Fixes bug in which the Python parser failed to detect trailing `NaN`
values in rows

Author: gfyoung <[email protected]>

Closes pandas-dev#13320 from gfyoung/trailing-nan-conversion and squashes the following commits:

590874d [gfyoung] BUG: Parse trailing NaN values for the Python parser
Small thing I just noticed in the docs (the note on the other version
was not updated when the example was changed from cythonmagic ->
Cython)

Author: Joris Van den Bossche <[email protected]>

Closes pandas-dev#13343 from jorisvandenbossche/doc-cythonmagic and squashes the following commits:

902352c [Joris Van den Bossche] DOC: fix comment on previous versions cythonmagic
closes pandas-dev#13306

Author: Uwe Hoffmann <[email protected]>

Closes pandas-dev#13313 from uwedeportivo/master and squashes the following commits:

be3ed90 [Uwe Hoffmann] whatsnew entry for issue pandas-dev#13306
1f5f7a5 [Uwe Hoffmann] Code Review jreback
82f263a [Uwe Hoffmann] Use vectorized searchsorted and tests.
a1ed5a5 [Uwe Hoffmann] Fix pandas-dev#13306: Hour overflow in tz-aware datetime conversions.
Title is self-explanatory.    xref pandas-dev#12686 - I don't quite understand
why these are marked (if at all) as internal to the C engine only, as
the benefits for having these options accepted for the Python engine
is quite clear based on the documentation I added as well.
Implementation simply just calls the already-written function in
`pandas/parsers.pyx` - as it isn't specific to the `TextReader` class,
crossing over to grab this function from Cython (instead of
duplicating in pure Python) seems reasonable while maintaining that
separation between the C and Python engines.

Author: gfyoung <[email protected]>

Closes pandas-dev#13323 from gfyoung/python-engine-compact-ints and squashes the following commits:

95f7ba8 [gfyoung] ENH: Add support for compact_ints and use_unsigned in Python engine
…y arrays

closes pandas-dev#13006

Author: Gábor Lipták <[email protected]>

Closes pandas-dev#13307 from gliptak/seriescomp1 and squashes the following commits:

4967db4 [Gábor Lipták] Fix series comparison operators when dealing with zero rank numpy arrays
Author: Chris Warth <[email protected]>

Closes pandas-dev#12399 from cswarth/doc/df_filter and squashes the following commits:

f48e9ff [Chris Warth] DOC: Add example usage to DataFrame.filter
Author: babakkeyvani <[email protected]>

Closes pandas-dev#13366 from bkeyvani/master and squashes the following commits:

029ade7 [babakkeyvani] DOC: Fixed a minor typo
Title is self-explanatory.

Author: gfyoung <[email protected]>

Closes pandas-dev#13368 from gfyoung/doublequote-doc and squashes the following commits:

f3e01fc [gfyoung] DOC: document doublequote in read_csv
`buffer_lines` is not respected, as it is determined internally via a
heuristic involving `table_width` (see <a href="https://github.com/pyd
ata/pandas/blob/master/pandas/parser.pyx#L527">here</a> for how it is
computed).

Author: gfyoung <[email protected]>

Closes pandas-dev#13360 from gfyoung/buffer-lines-depr-doc and squashes the following commits:

a72ecbe [gfyoung] DEPR, DOC: Deprecate buffer_lines in read_csv
…ined categorical columns

closes pandas-dev#13231

Author: Christian Hudon <[email protected]>

Closes pandas-dev#13359 from chrish42/gh13231 and squashes the following commits:

e839638 [Christian Hudon] Raise a better exception when the HDF file is empty and kwy=None.
611aa28 [Christian Hudon] Formatting fixes.
e7c8313 [Christian Hudon] Add changelog entry.
df10016 [Christian Hudon] Make logic that detects if there is only one dataset in a HDF5 file work when storing a dataframe that contains categorical data.
2f41aef [Christian Hudon] Tweak comment to be clearer.
b3a5773 [Christian Hudon] Add test that fails for GitHub bug pandas-dev#13231
02f90d5 [Christian Hudon] Use if-expression.
* Typo correction

* removed deprecated script
Author: Jeff Reback <[email protected]>

Closes pandas-dev#13372 from jreback/skiplist and squashes the following commits:

e05ea24 [Jeff Reback] CLN: remove old skiplist code
@StephenKappel
Copy link
Contributor Author

Ahh. Not sure how, but rebasing added everyone else's commits to this diff. Closing this PR and opening a clean one.

@jreback
Copy link
Contributor

jreback commented Jun 5, 2016

you don't need to make a new PR just rebase and force push

@StephenKappel
Copy link
Contributor Author

I created a clean branch off of prod, cherry-picked my commits, and force pushed to this branch, but for some reason GitHub isn't recognizing it. Perhaps it's because I already closed the PR? But, I don't think I have access to reopen the PR...

@StephenKappel
Copy link
Contributor Author

Oh. It seems the PR can't be reopened because the branch has been subsequently force pushed :-(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: df.astype could accept a dict of {col: type}