Skip to content

ENH/BUG/DOC: allow propogation and coexistance of numeric dtypes #2708

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 10, 2013
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 38 additions & 1 deletion RELEASE.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,42 @@ Where to get it
* Binary installers on PyPI: http://pypi.python.org/pypi/pandas
* Documentation: http://pandas.pydata.org

pandas 0.10.2
=============

**Release date:** 2013-??-??

**New features**

- Allow mixed dtypes (e.g ``float32/float64/int32/int16/int8``) to coexist in DataFrames and propogate in operations

**Improvements to existing features**

- added ``blocks`` attribute to DataFrames, to return a dict of dtypes to homogeneously dtyped DataFrames
- added keyword ``convert_numeric`` to ``convert_objects()`` to try to convert object dtypes to numeric types
- ``convert_dates`` in ``convert_objects`` can now be ``coerce`` which will return a datetime64[ns] dtype
with non-convertibles set as ``NaT``; will preserve an all-nan object (e.g. strings)
- Series print output now includes the dtype by default

**API Changes**

- Do not automatically upcast numeric specified dtypes to ``int64`` or ``float64`` (GH622_ and GH797_)
- Guarantee that ``convert_objects()`` for Series/DataFrame always returns a copy
- groupby operations will respect dtypes for numeric float operations (float32/float64); other types will be operated on,
and will try to cast back to the input dtype (e.g. if an int is passed, as long as the output doesn't have nans,
then an int will be returned)
- backfill/pad/take/diff/ohlc will now support ``float32/int16/int8`` operations
- Integer block types will upcast as needed in where operations (GH2793_)

**Bug Fixes**

- Fix seg fault on empty data frame when fillna with ``pad`` or ``backfill`` (GH2778_)

.. _GH622: https://github.com/pydata/pandas/issues/622
.. _GH797: https://github.com/pydata/pandas/issues/797
.. _GH2778: https://github.com/pydata/pandas/issues/2778
.. _GH2793: https://github.com/pydata/pandas/issues/2793

pandas 0.10.1
=============

Expand All @@ -36,6 +72,7 @@ pandas 0.10.1
- Restored inplace=True behavior returning self (same object) with
deprecation warning until 0.11 (GH1893_)
- ``HDFStore``

- refactored HFDStore to deal with non-table stores as objects, will allow future enhancements
- removed keyword ``compression`` from ``put`` (replaced by keyword
``complib`` to be consistent across library)
Expand All @@ -49,7 +86,7 @@ pandas 0.10.1
- support data column indexing and selection, via ``data_columns`` keyword in append
- support write chunking to reduce memory footprint, via ``chunksize``
keyword to append
- support automagic indexing via ``index`` keywork to append
- support automagic indexing via ``index`` keyword to append
- support ``expectedrows`` keyword in append to inform ``PyTables`` about
the expected tablesize
- support ``start`` and ``stop`` keywords in select to limit the row
Expand Down
114 changes: 90 additions & 24 deletions doc/source/dsintro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -450,15 +450,101 @@ DataFrame:
df.xs('b')
df.ix[2]

Note if a DataFrame contains columns of multiple dtypes, the dtype of the row
will be chosen to accommodate all of the data types (dtype=object is the most
general).

For a more exhaustive treatment of more sophisticated label-based indexing and
slicing, see the :ref:`section on indexing <indexing>`. We will address the
fundamentals of reindexing / conforming to new sets of lables in the
:ref:`section on reindexing <basics.reindexing>`.

DataTypes
~~~~~~~~~

.. _dsintro.column_types:

The main types stored in pandas objects are float, int, boolean, datetime64[ns],
and object. A convenient ``dtypes`` attribute return a Series with the data type of
each column.

.. ipython:: python

df['integer'] = 1
df['int32'] = df['integer'].astype('int32')
df['float32'] = Series([1.0]*len(df),dtype='float32')
df['timestamp'] = Timestamp('20010102')
df.dtypes

If a DataFrame contains columns of multiple dtypes, the dtype of the column
will be chosen to accommodate all of the data types (dtype=object is the most
general).

The related method ``get_dtype_counts`` will return the number of columns of
each type:

.. ipython:: python

df.get_dtype_counts()

Numeric dtypes will propgate and can coexist in DataFrames (starting in v0.10.2).
If a dtype is passed (either directly via the ``dtype`` keyword, a passed ``ndarray``,
or a passed ``Series``, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will **NOT** be combined. The following example will give you a taste.

.. ipython:: python

df1 = DataFrame(randn(8, 1), columns = ['A'], dtype = 'float32')
df1
df1.dtypes
df2 = DataFrame(dict( A = Series(randn(8),dtype='float16'),
B = Series(randn(8)),
C = Series(np.array(randn(8),dtype='uint8')) ))
df2
df2.dtypes

# here you get some upcasting
df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
df3
df3.dtypes

# this is lower-common-denomicator upcasting (meaning you get the dtype which can accomodate all of the types)
df3.values.dtype

Upcasting is always according to the **numpy** rules. If two different dtypes are involved in an operation, then the more *general* one will be used as the result of the operation.

DataType Conversion
~~~~~~~~~~~~~~~~~~~

You can use the ``astype`` method to convert dtypes from one to another. These *always* return a copy.
In addition, ``convert_objects`` will attempt to *soft* conversion of any *object* dtypes, meaning that if all the objects in a Series are of the same type, the Series
will have that dtype.

.. ipython:: python

df3
df3.dtypes

# conversion of dtypes
df3.astype('float32').dtypes

To force conversion of specific types of number conversion, pass ``convert_numeric = True``.
This will force strings and numbers alike to be numbers if possible, otherwise the will be set to ``np.nan``.
To force conversion to ``datetime64[ns]``, pass ``convert_dates = 'coerce'``.
This will convert any datetimelike object to dates, forcing other values to ``NaT``.

.. ipython:: python

# mixed type conversions
df3['D'] = '1.'
df3['E'] = '1'
df3.convert_objects(convert_numeric=True).dtypes

# same, but specific dtype conversion
df3['D'] = df3['D'].astype('float16')
df3['E'] = df3['E'].astype('int32')
df3.dtypes

# forcing date coercion
s = Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1, Timestamp('20010104'), '20010105'],dtype='O')
s
s.convert_objects(convert_dates='coerce')

Data alignment and arithmetic
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -633,26 +719,6 @@ You can also disable this feature via the ``expand_frame_repr`` option:
reset_option('expand_frame_repr')


DataFrame column types
~~~~~~~~~~~~~~~~~~~~~~

.. _dsintro.column_types:

The four main types stored in pandas objects are float, int, boolean, and
object. A convenient ``dtypes`` attribute return a Series with the data type of
each column:

.. ipython:: python

baseball.dtypes

The related method ``get_dtype_counts`` will return the number of columns of
each type:

.. ipython:: python

baseball.get_dtype_counts()

DataFrame column attribute access and IPython completion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
28 changes: 28 additions & 0 deletions doc/source/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,34 @@ so that the original data can be modified without creating a copy:

df.mask(df >= 0)

Upcasting Gotchas
~~~~~~~~~~~~~~~~~

Performing indexing operations on ``integer`` type data can easily upcast the data to ``floating``.
The dtype of the input data will be preserved in cases where ``nans`` are not introduced (coming soon).

.. ipython:: python

dfi = df.astype('int32')
dfi['E'] = 1
dfi
dfi.dtypes

casted = dfi[dfi>0]
casted
casted.dtypes

While float dtypes are unchanged.

.. ipython:: python

df2 = df.copy()
df2['A'] = df2['A'].astype('float32')
df2.dtypes

casted = df2[df2>0]
casted
casted.dtypes

Take Methods
~~~~~~~~~~~~
Expand Down
95 changes: 95 additions & 0 deletions doc/source/v0.10.2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
.. _whatsnew_0102:

v0.10.2 (February ??, 2013)
---------------------------

This is a minor release from 0.10.1 and includes many new features and
enhancements along with a large number of bug fixes. There are also a number of
important API changes that long-time pandas users should pay close attention
to.

API changes
~~~~~~~~~~~

Numeric dtypes will propgate and can coexist in DataFrames. If a dtype is passed (either directly via the ``dtype`` keyword, a passed ``ndarray``, or a passed ``Series``, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will **NOT** be combined. The following example will give you a taste.

**Dtype Specification**

.. ipython:: python

df1 = DataFrame(randn(8, 1), columns = ['A'], dtype = 'float32')
df1
df1.dtypes
df2 = DataFrame(dict( A = Series(randn(8),dtype='float16'), B = Series(randn(8)), C = Series(randn(8),dtype='uint8') ))
df2
df2.dtypes

# here you get some upcasting
df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
df3
df3.dtypes

**Dtype conversion**

.. ipython:: python

# this is lower-common-denomicator upcasting (meaning you get the dtype which can accomodate all of the types)
df3.values.dtype

# conversion of dtypes
df3.astype('float32').dtypes

# mixed type conversions
df3['D'] = '1.'
df3['E'] = '1'
df3.convert_objects(convert_numeric=True).dtypes

# same, but specific dtype conversion
df3['D'] = df3['D'].astype('float16')
df3['E'] = df3['E'].astype('int32')
df3.dtypes

# forcing date coercion
s = Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1,
Timestamp('20010104'), '20010105'],dtype='O')
s.convert_objects(convert_dates='coerce')

**Upcasting Gotchas**

Performing indexing operations on integer type data can easily upcast the data.
The dtype of the input data will be preserved in cases where ``nans`` are not introduced (coming soon).

.. ipython:: python

dfi = df3.astype('int32')
dfi['D'] = dfi['D'].astype('int64')
dfi
dfi.dtypes

casted = dfi[dfi>0]
casted
casted.dtypes

While float dtypes are unchanged.

.. ipython:: python

df4 = df3.copy()
df4['A'] = df4['A'].astype('float32')
df4.dtypes

casted = df4[df4>0]
casted
casted.dtypes

New features
~~~~~~~~~~~~

**Enhancements**

**Bug Fixes**

See the `full release notes
<https://github.com/pydata/pandas/blob/master/RELEASE.rst>`__ or issue tracker
on GitHub for a complete list.

2 changes: 2 additions & 0 deletions doc/source/whatsnew.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ What's New

These are new features and improvements of note in each release.

.. include:: v0.10.2.txt

.. include:: v0.10.1.txt

.. include:: v0.10.0.txt
Expand Down
Loading