ENH: accept dict of column:dtype as dtype argument in DataFrame.astype #12086

StephenKappel · 2016-01-18T22:33:29Z

Continued in #13375

By passing a dict of {column name/column index: dtype}, multiple columns can be cast to different data types in a single command. Now users can do:

df = df.astype({'my_bool', 'bool', 'my_int': 'int'})

or:

df = df.astype({0, 'bool', 1: 'int'})

instead of:

df['my_bool'] = df.my_bool.astype('bool')
df['my_int'] = df.my_int.astype('int')

wesm · 2016-01-18T22:35:22Z

pandas/core/frame.py

+        -------
+        casted : type of caller
+        """
+        if type(dtype) == dict:


use isinstance here to get subclasses, too

isinstance(dtype, collections.Mapping) if you want to be really precise

jreback · 2016-01-19T02:16:56Z

pandas/core/frame.py

+        if isinstance(dtype, collections.Mapping):
+            df = DataFrame(data=self, copy=copy, **kwargs)
+            for col_key, col_dtype in dtype.items():
+                df.ix[:, col_key] = df.ix[:, col_key].astype(col_dtype)


this iterative updating will be quite inefficient as it will potentially copy the entire new dtype each time as it grows; this should change with the internals refactor #11970 , but for now do something like:

astyped_cols = pd.concat([ df[col_key].astype(col_dtype) for col_key, col_dtype in dtype.items() ], axis=1) return pd.concat([astyped_cols, df[[df.columns.difference(dtype.keys())]], axis=1).reindex(columns=df.columns)

which should minimize the copying.

jreback · 2016-01-19T02:19:31Z

astype currently has the copy kw. This is a big bogus as this almost always actually does copy if its changing to a new dtype (the exception is when its going to the same dtype).

related to #5769. It think we should change this to inplace (and deprecate copy) as its a bit more clear whats actually going on.

This could be done in another PR (before this one), or in this one is ok too.

jreback · 2016-01-19T02:21:41Z

further I wouldn't use .ix here as that would interpret numeric values as positional indexers for columns as this is a potentially confusing situation (and we don't do this anywhere else). If a non-matching column is passed then it will simply raise a KeyError. Instead use [].

So pls add some more tests, including some invalid columns.

jreback · 2016-01-19T02:22:01Z

pandas/core/frame.py

+                df.ix[:, col_key] = df.ix[:, col_key].astype(col_dtype)
+            return df
+        else:
+            return super(DataFrame, self).astype(dtype=dtype, copy=copy,


you can simply return instead of using else here.

jreback · 2016-03-12T17:48:47Z

@StephenKappel can you rebase/update

jreback · 2016-05-07T18:51:48Z

@StephenKappel can you rebase / update

StephenKappel · 2016-05-08T15:50:39Z

Sorry for neglecting this. The semester is finally over, so I'll get this cleaned up today :-)

StephenKappel · 2016-05-09T02:34:02Z

@jreback - Rebased and updated based on your earlier feedback. Please let me know if any further updates are needed.

jreback · 2016-05-09T13:28:56Z

pandas/core/frame.py

+        Parameters
+        ----------
+        dtype : numpy.dtype or Python type (to cast entire DataFrame to the
+            same type). Alternatively, {col: dtype, ...}, where col is a column


why are you adding a method her? this should all be done in generic.py

…mapping of col to type; GH7271

…s-dev#13288) closes pandas-dev#13104 closes pandas-dev#13288 Author: Piotr Jucha <[email protected]> Closes pandas-dev#13298 from pijucha/bug13104 and squashes the following commits: 9a6bd6e [Piotr Jucha] BUG: Fix describe(): percentiles (pandas-dev#13104), col index (pandas-dev#13288)

Title is self-explanatory. Closes pandas-dev#13304. Author: gfyoung <[email protected]> Closes pandas-dev#13309 from gfyoung/ordereddict-key-ordering-init and squashes the following commits: 4f311cc [gfyoung] ENH: Respect key ordering for OrderedDict list in DataFrame init

closes pandas-dev#13324 Author: Roger Thomas <[email protected]> Closes pandas-dev#13326 from RogerThomas/fix_maybe_convert_numeric_for_unhashable_objects and squashes the following commits: 76a0738 [Roger Thomas] Fix maybe_convert_numeric for unhashable objects

Author: Jeff Reback <[email protected]> Closes pandas-dev#13336 from jreback/is_monotonic and squashes the following commits: 0a50ff9 [Jeff Reback] ENH: Series has gained the properties .is_monotonic, .is_monotonic_increasing, .is_monotonic_decreasing

closes pandas-dev#13338 Author: Jeff Reback <[email protected]> Closes pandas-dev#13339 from jreback/eval and squashes the following commits: b2ee5e8 [Jeff Reback] TST: computation/test_eval.py tests (slow)

Fixes bug in which the Python parser failed to detect trailing `NaN` values in rows Author: gfyoung <[email protected]> Closes pandas-dev#13320 from gfyoung/trailing-nan-conversion and squashes the following commits: 590874d [gfyoung] BUG: Parse trailing NaN values for the Python parser

closes pandas-dev#13219 closes pandas-dev#13233

Small thing I just noticed in the docs (the note on the other version was not updated when the example was changed from cythonmagic -> Cython) Author: Joris Van den Bossche <[email protected]> Closes pandas-dev#13343 from jorisvandenbossche/doc-cythonmagic and squashes the following commits: 902352c [Joris Van den Bossche] DOC: fix comment on previous versions cythonmagic

closes pandas-dev#13306 Author: Uwe Hoffmann <[email protected]> Closes pandas-dev#13313 from uwedeportivo/master and squashes the following commits: be3ed90 [Uwe Hoffmann] whatsnew entry for issue pandas-dev#13306 1f5f7a5 [Uwe Hoffmann] Code Review jreback 82f263a [Uwe Hoffmann] Use vectorized searchsorted and tests. a1ed5a5 [Uwe Hoffmann] Fix pandas-dev#13306: Hour overflow in tz-aware datetime conversions.

Title is self-explanatory. xref pandas-dev#12686 - I don't quite understand why these are marked (if at all) as internal to the C engine only, as the benefits for having these options accepted for the Python engine is quite clear based on the documentation I added as well. Implementation simply just calls the already-written function in `pandas/parsers.pyx` - as it isn't specific to the `TextReader` class, crossing over to grab this function from Cython (instead of duplicating in pure Python) seems reasonable while maintaining that separation between the C and Python engines. Author: gfyoung <[email protected]> Closes pandas-dev#13323 from gfyoung/python-engine-compact-ints and squashes the following commits: 95f7ba8 [gfyoung] ENH: Add support for compact_ints and use_unsigned in Python engine

…y arrays closes pandas-dev#13006 Author: Gábor Lipták <[email protected]> Closes pandas-dev#13307 from gliptak/seriescomp1 and squashes the following commits: 4967db4 [Gábor Lipták] Fix series comparison operators when dealing with zero rank numpy arrays

Author: Chris Warth <[email protected]> Closes pandas-dev#12399 from cswarth/doc/df_filter and squashes the following commits: f48e9ff [Chris Warth] DOC: Add example usage to DataFrame.filter

Author: babakkeyvani <[email protected]> Closes pandas-dev#13366 from bkeyvani/master and squashes the following commits: 029ade7 [babakkeyvani] DOC: Fixed a minor typo

Title is self-explanatory. Author: gfyoung <[email protected]> Closes pandas-dev#13368 from gfyoung/doublequote-doc and squashes the following commits: f3e01fc [gfyoung] DOC: document doublequote in read_csv

`buffer_lines` is not respected, as it is determined internally via a heuristic involving `table_width` (see <a href="https://github.com/pyd ata/pandas/blob/master/pandas/parser.pyx#L527">here</a> for how it is computed). Author: gfyoung <[email protected]> Closes pandas-dev#13360 from gfyoung/buffer-lines-depr-doc and squashes the following commits: a72ecbe [gfyoung] DEPR, DOC: Deprecate buffer_lines in read_csv

…ined categorical columns closes pandas-dev#13231 Author: Christian Hudon <[email protected]> Closes pandas-dev#13359 from chrish42/gh13231 and squashes the following commits: e839638 [Christian Hudon] Raise a better exception when the HDF file is empty and kwy=None. 611aa28 [Christian Hudon] Formatting fixes. e7c8313 [Christian Hudon] Add changelog entry. df10016 [Christian Hudon] Make logic that detects if there is only one dataset in a HDF5 file work when storing a dataframe that contains categorical data. 2f41aef [Christian Hudon] Tweak comment to be clearer. b3a5773 [Christian Hudon] Add test that fails for GitHub bug pandas-dev#13231 02f90d5 [Christian Hudon] Use if-expression.

* Typo correction * removed deprecated script

Author: Jeff Reback <[email protected]> Closes pandas-dev#13372 from jreback/skiplist and squashes the following commits: e05ea24 [Jeff Reback] CLN: remove old skiplist code

…mapping of col to type; GH7271

StephenKappel · 2016-06-05T23:41:37Z

Ahh. Not sure how, but rebasing added everyone else's commits to this diff. Closing this PR and opening a clean one.

jreback · 2016-06-05T23:50:01Z

you don't need to make a new PR just rebase and force push

StephenKappel · 2016-06-06T00:02:56Z

I created a clean branch off of prod, cherry-picked my commits, and force pushed to this branch, but for some reason GitHub isn't recognizing it. Perhaps it's because I already closed the PR? But, I don't think I have access to reopen the PR...

StephenKappel · 2016-06-06T00:05:13Z

Oh. It seems the PR can't be reopened because the branch has been subsequently force pushed :-(

wesm reviewed Jan 18, 2016
View reviewed changes

StephenKappel force-pushed the 7271-df-astype-dict branch from 68f73e4 to 957e1cb Compare January 18, 2016 23:58

jreback reviewed Jan 19, 2016
View reviewed changes

jreback added Enhancement Dtype Conversions Unexpected or buggy dtype conversions labels Jan 19, 2016

jreback mentioned this pull request Apr 5, 2016

pd.DataFrame.astype should allow dict as argument #12801

Closed

jreback added Difficulty Intermediate labels Apr 5, 2016

jreback added this to the Next Major Release milestone Apr 5, 2016

jreback removed this from the Next Major Release milestone May 7, 2016

StephenKappel force-pushed the 7271-df-astype-dict branch 2 times, most recently from 5cb7b25 to 6dc0800 Compare May 8, 2016 23:27

ENH: inplace dtype changes, df per-column dtype changes; GH7271

0aeee8d

StephenKappel force-pushed the 7271-df-astype-dict branch from 6dc0800 to 0aeee8d Compare May 9, 2016 02:33

jreback reviewed May 9, 2016
View reviewed changes

ENH: NDFrame astype() now accepts inplace arg and dtype arg can be a …

58dd71b

…mapping of col to type; GH7271

pijucha and others added 23 commits May 31, 2016 10:12

TST: computation/test_eval.py tests (slow)

2e3c82e

closes pandas-dev#13338 Author: Jeff Reback <[email protected]> Closes pandas-dev#13339 from jreback/eval and squashes the following commits: b2ee5e8 [Jeff Reback] TST: computation/test_eval.py tests (slow)

BUG: GH13219 Fixed. Allow unicode values in usecols

fcd73ad

closes pandas-dev#13219 closes pandas-dev#13233

DOC: Add example usage to DataFrame.filter

103f7d3

Author: Chris Warth <[email protected]> Closes pandas-dev#12399 from cswarth/doc/df_filter and squashes the following commits: f48e9ff [Chris Warth] DOC: Add example usage to DataFrame.filter

DOC: Fixed a minor typo

faf9b7d

Author: babakkeyvani <[email protected]> Closes pandas-dev#13366 from bkeyvani/master and squashes the following commits: 029ade7 [babakkeyvani] DOC: Fixed a minor typo

DOC: document doublequote in read_csv

eca7891

Title is self-explanatory. Author: gfyoung <[email protected]> Closes pandas-dev#13368 from gfyoung/doublequote-doc and squashes the following commits: f3e01fc [gfyoung] DOC: document doublequote in read_csv

DOC: remove obsolete cron job script (pandas-dev#13369)

e90d411

* Typo correction * removed deprecated script

CLN: remove old skiplist code

b722222

Author: Jeff Reback <[email protected]> Closes pandas-dev#13372 from jreback/skiplist and squashes the following commits: e05ea24 [Jeff Reback] CLN: remove old skiplist code

ENH: incorporate PR feedback; GH7271

3600bca

ENH: inplace dtype changes, df per-column dtype changes; GH7271

29ecec0

ENH: NDFrame astype() now accepts inplace arg and dtype arg can be a …

95a029b

…mapping of col to type; GH7271

ENH: incorporate PR feedback; GH7271

9d8e1b5

resolve merge conflict in rebasing of 7271-df-astype-dict

c960523

StephenKappel closed this Jun 5, 2016

StephenKappel mentioned this pull request Jun 6, 2016

ENH: astype() can now take col label -> dtype mapping as arg; GH7271 #13375

Closed

jorisvandenbossche added this to the No action milestone Sep 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: accept dict of column:dtype as dtype argument in DataFrame.astype #12086

ENH: accept dict of column:dtype as dtype argument in DataFrame.astype #12086

StephenKappel commented Jan 18, 2016 •

edited by jorisvandenbossche

Loading

wesm Jan 18, 2016

max-sixty Jan 18, 2016

jreback Jan 19, 2016

jreback commented Jan 19, 2016

jreback commented Jan 19, 2016

jreback Jan 19, 2016

jreback commented Mar 12, 2016

jreback commented May 7, 2016

StephenKappel commented May 8, 2016

StephenKappel commented May 9, 2016

jreback May 9, 2016

StephenKappel commented Jun 5, 2016

jreback commented Jun 5, 2016

StephenKappel commented Jun 6, 2016

StephenKappel commented Jun 6, 2016

ENH: accept dict of column:dtype as dtype argument in DataFrame.astype #12086

ENH: accept dict of column:dtype as dtype argument in DataFrame.astype #12086

Conversation

StephenKappel commented Jan 18, 2016 • edited by jorisvandenbossche Loading

wesm Jan 18, 2016

Choose a reason for hiding this comment

max-sixty Jan 18, 2016

Choose a reason for hiding this comment

jreback Jan 19, 2016

Choose a reason for hiding this comment

jreback commented Jan 19, 2016

jreback commented Jan 19, 2016

jreback Jan 19, 2016

Choose a reason for hiding this comment

jreback commented Mar 12, 2016

jreback commented May 7, 2016

StephenKappel commented May 8, 2016

StephenKappel commented May 9, 2016

jreback May 9, 2016

Choose a reason for hiding this comment

StephenKappel commented Jun 5, 2016

jreback commented Jun 5, 2016

StephenKappel commented Jun 6, 2016

StephenKappel commented Jun 6, 2016

StephenKappel commented Jan 18, 2016 •

edited by jorisvandenbossche

Loading