Skip to content

BUG: Series dtype casting to platform numeric (GH #2751) #2838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Feb 14, 2013
3 changes: 3 additions & 0 deletions RELEASE.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,9 @@ pandas 0.11.0

- Do not automatically upcast numeric specified dtypes to ``int64`` or
``float64`` (GH622_ and GH797_)
- DataFrame construction of lists and scalars, with no dtype present, will
result in casting to ``int64`` or ``float64``, regardless of platform.
This is not an apparent change in the API, but noting it.
- Guarantee that ``convert_objects()`` for Series/DataFrame always returns a
copy
- groupby operations will respect dtypes for numeric float operations
Expand Down
82 changes: 57 additions & 25 deletions doc/source/v0.11.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
v0.11.0 (March ??, 2013)
------------------------

This is a minor release from 0.10.1 and includes many new features and
This is a major release from 0.10.1 and includes many new features and
enhancements along with a large number of bug fixes. There are also a number of
important API changes that long-time pandas users should pay close attention
to.
Expand All @@ -13,7 +13,8 @@ API changes

Numeric dtypes will propagate and can coexist in DataFrames. If a dtype is passed (either directly via the ``dtype`` keyword, a passed ``ndarray``, or a passed ``Series``, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will **NOT** be combined. The following example will give you a taste.

**Dtype Specification**
Dtype Specification
~~~~~~~~~~~~~~~~~~~

.. ipython:: python

Expand All @@ -29,7 +30,8 @@ Numeric dtypes will propagate and can coexist in DataFrames. If a dtype is passe
df3
df3.dtypes

**Dtype conversion**
Dtype Conversion
~~~~~~~~~~~~~~~~

.. ipython:: python

Expand All @@ -54,6 +56,26 @@ Numeric dtypes will propagate and can coexist in DataFrames. If a dtype is passe
Timestamp('20010104'), '20010105'],dtype='O')
s.convert_objects(convert_dates='coerce')

Dtype Gotchas
~~~~~~~~~~~~~

**Platform Gotchas**

Starting in 0.11.0, construction of DataFrame/Series will use default dtypes of ``int64`` and ``float64``,
*regardless of platform*. This is not an apparent change from earlier versions of pandas. If you specify
dtypes, they *WILL* be respected, however (GH2837_)

The following will all result in ``int64`` dtypes

.. ipython:: python

DataFrame([1,2],columns=['a']).dtypes
DataFrame({'a' : [1,2] }).dtypes
DataFrame({'a' : 1 }, index=range(2)).dtypes

Keep in mind that ``DataFrame(np.array([1,2]))`` **WILL** result in ``int32`` on 32-bit platforms!


**Upcasting Gotchas**

Performing indexing operations on integer type data can easily upcast the data.
Expand Down Expand Up @@ -82,21 +104,13 @@ While float dtypes are unchanged.
casted
casted.dtypes

New features
~~~~~~~~~~~~

**Enhancements**
Datetimes Conversion
~~~~~~~~~~~~~~~~~~~~

- In ``HDFStore``, provide dotted attribute access to ``get`` from stores (e.g. store.df == store['df'])

**Bug Fixes**

See the `full release notes
<https://github.com/pydata/pandas/blob/master/RELEASE.rst>`__ or issue tracker
on GitHub for a complete list.


Datetime64[ns] columns in a DataFrame (or a Series) allow the use of ``np.nan`` to indicate a nan value, in addition to the traditional ``NaT``, or not-a-time. This allows convenient nan setting in a generic way. Furthermore datetime64 columns are created by default, when passed datetimelike objects (*this change was introduced in 0.10.1*)
Datetime64[ns] columns in a DataFrame (or a Series) allow the use of ``np.nan`` to indicate a nan value,
in addition to the traditional ``NaT``, or not-a-time. This allows convenient nan setting in a generic way.
Furthermore ``datetime64[ns]`` columns are created by default, when passed datetimelike objects (*this change was introduced in 0.10.1*)
(GH2809_, GH2810_)

.. ipython:: python

Expand All @@ -111,8 +125,7 @@ Datetime64[ns] columns in a DataFrame (or a Series) allow the use of ``np.nan``
df.ix[2:4,['A','timestamp']] = np.nan
df

Astype conversion on datetime64[ns] to object, implicity converts ``NaT`` to ``np.nan``

Astype conversion on ``datetime64[ns]`` to ``object``, implicity converts ``NaT`` to ``np.nan``

.. ipython:: python

Expand All @@ -127,13 +140,32 @@ Astype conversion on datetime64[ns] to object, implicity converts ``NaT`` to ``n
s.dtype


``Squeeze`` to possibly remove length 1 dimensions from an object.
New features
~~~~~~~~~~~~

.. ipython:: python
**Enhancements**

- In ``HDFStore``, provide dotted attribute access to ``get`` from stores
(e.g. store.df == store['df'])

- ``Squeeze`` to possibly remove length 1 dimensions from an object.

p = Panel(randn(3,4,4),items=['ItemA','ItemB','ItemC'],
.. ipython:: python

p = Panel(randn(3,4,4),items=['ItemA','ItemB','ItemC'],
major_axis=date_range('20010102',periods=4),
minor_axis=['A','B','C','D'])
p
p.reindex(items=['ItemA']).squeeze()
p.reindex(items=['ItemA'],minor=['B']).squeeze()
p
p.reindex(items=['ItemA']).squeeze()
p.reindex(items=['ItemA'],minor=['B']).squeeze()

**Bug Fixes**

See the `full release notes
<https://github.com/pydata/pandas/blob/master/RELEASE.rst>`__ or issue tracker
on GitHub for a complete list.

.. _GH2809: https://github.com/pydata/pandas/issues/2809
.. _GH2810: https://github.com/pydata/pandas/issues/2810
.. _GH2837: https://github.com/pydata/pandas/issues/2837

157 changes: 104 additions & 53 deletions pandas/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
from pandas.util.py3compat import StringIO, BytesIO

from pandas.core.config import get_option
from pandas.core import array as pa

# XXX: HACK for NumPy 1.5.1 to suppress warnings
try:
Expand Down Expand Up @@ -503,7 +504,7 @@ def take_1d(arr, indexer, out=None, fill_value=np.nan):
dtype, fill_value = arr.dtype, arr.dtype.type()
else:
indexer = _ensure_int64(indexer)
dtype = _maybe_promote(arr.dtype, fill_value)
dtype = _maybe_promote(arr.dtype, fill_value)[0]
if dtype != arr.dtype:
mask = indexer == -1
needs_masking = mask.any()
Expand Down Expand Up @@ -551,7 +552,7 @@ def take_2d_multi(arr, row_idx, col_idx, fill_value=np.nan, out=None):
else:
col_idx = _ensure_int64(col_idx)

dtype = _maybe_promote(arr.dtype, fill_value)
dtype = _maybe_promote(arr.dtype, fill_value)[0]
if dtype != arr.dtype:
row_mask = row_idx == -1
col_mask = col_idx == -1
Expand Down Expand Up @@ -587,7 +588,7 @@ def diff(arr, n, axis=0):
n = int(n)
dtype = arr.dtype
if issubclass(dtype.type, np.integer):
dtype = np.float_
dtype = np.float64
elif issubclass(dtype.type, np.bool_):
dtype = np.object_

Expand Down Expand Up @@ -628,7 +629,7 @@ def take_fast(arr, indexer, mask, needs_masking, axis=0, out=None,
else:
indexer = _ensure_int64(indexer)
if needs_masking:
dtype = _maybe_promote(arr.dtype, fill_value)
dtype = _maybe_promote(arr.dtype, fill_value)[0]
if dtype != arr.dtype and out is not None and out.dtype != dtype:
raise Exception('Incompatible type for fill_value')
else:
Expand All @@ -644,49 +645,110 @@ def take_fast(arr, indexer, mask, needs_masking, axis=0, out=None,
return out


def _infer_dtype_from_scalar(val):
""" interpret the dtype from a scalar, upcast floats and ints
return the new value and the dtype """

dtype = np.object_

# a 1-element ndarray
if isinstance(val, pa.Array):
if val.ndim != 0:
raise ValueError("invalid ndarray passed to _infer_dtype_from_scalar")

dtype = val.dtype
val = val.item()

elif isinstance(val, basestring):

# If we create an empty array using a string to infer
# the dtype, NumPy will only allocate one character per entry
# so this is kind of bad. Alternately we could use np.repeat
# instead of np.empty (but then you still don't want things
# coming out as np.str_!

dtype = np.object_

elif isinstance(val, np.datetime64):
# ugly hacklet
val = lib.Timestamp(val).value
dtype = np.dtype('M8[ns]')

elif is_bool(val):
dtype = np.bool_

# provide implicity upcast on scalars
elif is_integer(val):
dtype = np.int64

elif is_float(val):
dtype = np.float64

elif is_complex(val):
dtype = np.complex_

return dtype, val

def _maybe_promote(dtype, fill_value=np.nan):
# returns tuple of (dtype, fill_value)
if issubclass(dtype.type, np.datetime64):
# for now: refuse to upcast
# for now: refuse to upcast datetime64
# (this is because datetime64 will not implicitly upconvert
# to object correctly as of numpy 1.6.1)
return dtype
if isnull(fill_value):
fill_value = tslib.iNaT
else:
try:
fill_value = lib.Timestamp(fill_value).value
except:
# the proper thing to do here would probably be to upcast to
# object (but numpy 1.6.1 doesn't do this properly)
fill_value = tslib.iNaT
elif is_float(fill_value):
if issubclass(dtype.type, np.bool_):
return np.object_
dtype = np.object_
elif issubclass(dtype.type, np.integer):
return np.float_
return dtype
dtype = np.float64
elif is_bool(fill_value):
if issubclass(dtype.type, np.bool_):
return dtype
return np.object_
if not issubclass(dtype.type, np.bool_):
dtype = np.object_
elif is_integer(fill_value):
if issubclass(dtype.type, np.bool_):
return np.object_
dtype = np.object_
elif issubclass(dtype.type, np.integer):
# upcast to prevent overflow
arr = np.asarray(fill_value)
if arr != arr.astype(dtype):
return arr.dtype
return dtype
return dtype
dtype = arr.dtype
elif is_complex(fill_value):
if issubclass(dtype.type, np.bool_):
return np.object_
dtype = np.object_
elif issubclass(dtype.type, (np.integer, np.floating)):
return np.complex_
return dtype
return np.object_
dtype = np.complex128
else:
dtype = np.object_
return dtype, fill_value

def _maybe_upcast(values, fill_value=np.nan, copy=False):
""" provide explicty type promotion and coercion
if copy == True, then a copy is created even if no upcast is required """

new_dtype, fill_value = _maybe_promote(values.dtype, fill_value)
if new_dtype != values.dtype:
values = values.astype(new_dtype)
elif copy:
values = values.copy()
return values, fill_value

def _possibly_cast_item(obj, item, dtype):
chunk = obj[item]

if chunk.values.dtype != dtype:
if dtype in (np.object_, np.bool_):
obj[item] = chunk.astype(np.object_)
elif not issubclass(dtype, (np.integer, np.bool_)): # pragma: no cover
raise ValueError("Unexpected dtype encountered: %s" % dtype)

def _maybe_upcast(values):
# TODO: convert remaining usage of _maybe_upcast to _maybe_promote
if issubclass(values.dtype.type, np.integer):
values = values.astype(np.float_)
elif issubclass(values.dtype.type, np.bool_):
values = values.astype(np.object_)
return values


def _interp_wrapper(f, wrap_dtype, na_override=None):
def wrapper(arr, mask, limit=None):
Expand Down Expand Up @@ -808,7 +870,8 @@ def _consensus_name_attr(objs):
def _possibly_convert_objects(values, convert_dates=True, convert_numeric=True):
""" if we have an object dtype, try to coerce dates and/or numers """

if values.dtype == np.object_ and convert_dates:
# convert dates
if convert_dates and values.dtype == np.object_:

# we take an aggressive stance and convert to datetime64[ns]
if convert_dates == 'coerce':
Expand All @@ -821,7 +884,8 @@ def _possibly_convert_objects(values, convert_dates=True, convert_numeric=True):
else:
values = lib.maybe_convert_objects(values, convert_datetime=convert_dates)

if values.dtype == np.object_ and convert_numeric:
# convert to numeric
if convert_numeric and values.dtype == np.object_:
try:
new_values = lib.maybe_convert_numeric(values,set(),coerce_numeric=True)

Expand All @@ -834,6 +898,16 @@ def _possibly_convert_objects(values, convert_dates=True, convert_numeric=True):

return values

def _possibly_convert_platform(values):
""" try to do platform conversion, allow ndarray or list here """

if isinstance(values, (list,tuple)):
values = lib.list_to_object_array(values)
if values.dtype == np.object_:
values = lib.maybe_convert_objects(values)

return values


def _possibly_cast_to_datetime(value, dtype, coerce = False):
""" try to cast the array/value to a datetimelike dtype, converting float nan to iNaT """
Expand Down Expand Up @@ -876,29 +950,6 @@ def _possibly_cast_to_datetime(value, dtype, coerce = False):
return value


def _infer_dtype(value):
if isinstance(value, (float, np.floating)):
return np.float_
elif isinstance(value, (bool, np.bool_)):
return np.bool_
elif isinstance(value, (int, long, np.integer)):
return np.int_
elif isinstance(value, (complex, np.complexfloating)):
return np.complex_
else:
return np.object_


def _possibly_cast_item(obj, item, dtype):
chunk = obj[item]

if chunk.values.dtype != dtype:
if dtype in (np.object_, np.bool_):
obj[item] = chunk.astype(np.object_)
elif not issubclass(dtype, (np.integer, np.bool_)): # pragma: no cover
raise ValueError("Unexpected dtype encountered: %s" % dtype)


def _is_bool_indexer(key):
if isinstance(key, np.ndarray) and key.dtype == np.object_:
key = np.asarray(key)
Expand Down
Loading