doc/source/v0.14.0.txt

.. _whatsnew_0140:

v0.14.0 (May ? , 2014)
----------------------

This is a major release from 0.13.1 and includes a small number of API changes, several new features,
enhancements, and performance improvements along with a large number of bug fixes. We recommend that all
users upgrade to this version.

Highlights include:

- MultIndexing Using Slicers
- Joining a singly-indexed DataFrame with a multi-indexed DataFrame
- More flexible groupby specifications

API changes
~~~~~~~~~~~

- ``read_excel`` uses 0 as the default sheet (:issue:`6573`)
- ``iloc`` will now accept out-of-bounds indexers for slices, e.g. a value that exceeds the length of the object being
  indexed. These will be excluded. This will make pandas conform more with pandas/numpy indexing of out-of-bounds
  values. A single indexer / list of indexers that is out-of-bounds will still raise
  ``IndexError`` (:issue:`6296`, :issue:`6299`). This could result in an empty axis (e.g. an empty DataFrame being returned)

.. ipython:: python

   dfl = DataFrame(np.random.randn(5,2),columns=list('AB'))
   dfl
   dfl.iloc[:,2:3]
   dfl.iloc[:,1:3]
   dfl.iloc[4:6]

These are out-of-bounds selections

.. code-block:: python

   dfl.iloc[[4,5,6]]
   IndexError: positional indexers are out-of-bounds

   dfl.iloc[:,4]
   IndexError: single positional indexer is out-of-bounds


- The ``DataFrame.interpolate()`` ``downcast`` keyword default has been changed from ``infer`` to
  ``None``. This is to preseve the original dtype unless explicitly requested otherwise (:issue:`6290`).
- allow a Series to utilize index methods depending on its index type, e.g. ``Series.year`` is now defined
  for a Series with a ``DatetimeIndex`` or a ``PeriodIndex``; trying this on a non-supported Index type will
  now raise a ``TypeError``. (:issue:`4551`, :issue:`4056`, :issue:`5519`)

  The following affected:

  - ``date,time,year,month,day``
  - ``hour,minute,second,weekofyear``
  - ``week,dayofweek,dayofyear,quarter``
  - ``microsecond,nanosecond,qyear``
  - ``min(),max()``
  - ``pd.infer_freq()``

  .. ipython:: python

     s = Series(np.random.randn(5),index=tm.makeDateIndex(5))
     s
     s.year
     s.index.year

- More consistent behaviour for some groupby methods:
   - groupby ``head`` and ``tail`` now act more like ``filter`` rather than an aggregation:

  .. ipython:: python

     df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
     g = df.groupby('A')
     g.head(1)  # filters DataFrame

     g.apply(lambda x: x.head(1))  # used to simply fall-through

   - groupby head and tail respect column selection:

  .. ipython:: python

     g[['B']].head(1)

   - groupby ``nth`` now filters by default, with optional dropna argument to ignore
     NaN (to replicate the previous behaviour.), See :ref:`the docs <groupby.nth>`.

  .. ipython:: python

     DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=['A', 'B'])
     g = df.groupby('A')
     g.nth(0)  # can also use negative ints

     g.nth(0, dropna='any')  # similar to old behaviour

- Allow specification of a more complex groupby via ``pd.Groupby``, such as grouping
  by a Time and a string field simultaneously. See :ref:`the docs <groupby.specify>`. (:issue:`3794`)

- Local variable usage has changed in
  :func:`pandas.eval`/:meth:`DataFrame.eval`/:meth:`DataFrame.query`
  (:issue:`5987`). For the :class:`~pandas.DataFrame` methods, two things have
  changed

   - Column names are now given precedence over locals
   - Local variables must be referred to explicitly. This means that even if
     you have a local variable that is *not* a column you must still refer to
     it with the ``'@'`` prefix.
   - You can have an expression like ``df.query('@a < a')`` with no complaints
     from ``pandas`` about ambiguity of the name ``a``.

- The top-level :func:`pandas.eval` function does not allow you use the
  ``'@'`` prefix and provides you with an error message telling you so.
- ``NameResolutionError`` was removed because it isn't necessary anymore.
- ``concat`` will now concatenate mixed Series and DataFrames using the Series name
  or numbering columns as needed (:issue:`2385`). See :ref:`the docs <merging.mixed_ndims>`
- Slicing and advanced/boolean indexing operations on ``Index`` classes will no
  longer change type of the resulting index (:issue:`6440`)

  .. ipython:: python

     i = pd.Index([1, 2, 3, 'a' , 'b', 'c'])
     i[[0,1,2]]

  Previously, the above operation would return ``Int64Index``.  If you'd like
  to do this manually, use :meth:`Index.astype`

  .. ipython:: python

     i[[0,1,2]].astype(np.int_)

- ``set_index`` no longer converts MultiIndexes to an Index of tuples. For example,
  the old behavior returned an Index in this case (:issue:`6459`):

  .. ipython:: python
     :suppress:

     from itertools import product
     tuples = list(product(('a', 'b'), ('c', 'd')))
     mi = MultiIndex.from_tuples(tuples)
     df_multi = DataFrame(np.random.randn(4, 2), index=mi)
     tuple_ind = pd.Index(tuples)

  .. ipython:: python

     df_multi.index

     @suppress
     df_multi.index = tuple_ind

     # Old behavior, casted MultiIndex to an Index
     df_multi.set_index(df_multi.index)

     @suppress
     df_multi.index = mi

     # New behavior
     df_multi.set_index(df_multi.index)

  This also applies when passing multiple indices to ``set_index``:

  .. ipython:: python

    @suppress
    df_multi.index = tuple_ind

    # Old output, 2-level MultiIndex of tuples
    df_multi.set_index([df_multi.index, df_multi.index])

    @suppress
    df_multi.index = mi

    # New output, 4-level MultiIndex
    df_multi.set_index([df_multi.index, df_multi.index])


MultiIndexing Using Slicers
~~~~~~~~~~~~~~~~~~~~~~~~~~~

In 0.14.0 we added a new way to slice multi-indexed objects.
You can slice a multi-index by providing multiple indexers.

You can provide any of the selectors as if you are indexing by label, see :ref:`Selection by Label <indexing.label>`,
including slices, lists of labels, labels, and boolean indexers.

You can use ``slice(None)`` to select all the contents of *that* level. You do not need to specify all the
*deeper* levels, they will be implied as ``slice(None)``.

As usual, **both sides** of the slicers are included as this is label indexing.

See :ref:`the docs<indexing.mi_slicers>`
See also issues (:issue:`6134`, :issue:`4036`, :issue:`3057`, :issue:`2598`, :issue:`5641`)

.. warning::

   You should specify all axes in the ``.loc`` specifier, meaning the indexer for the **index** and
   for the **columns**. Their are some ambiguous cases where the passed indexer could be mis-interpreted
   as indexing *both* axes, rather than into say the MuliIndex for the rows.

   You should do this:

   .. code-block:: python

      df.loc[(slice('A1','A3'),.....),:]

   rather than this:

   .. code-block:: python

      df.loc[(slice('A1','A3'),.....)]

.. warning::

   You will need to make sure that the selection axes are fully lexsorted!

.. ipython:: python

   def mklbl(prefix,n):
       return ["%s%s" % (prefix,i)  for i in range(n)]

   index = MultiIndex.from_product([mklbl('A',4),
                                    mklbl('B',2),
                                    mklbl('C',4),
                                    mklbl('D',2)])
   columns = MultiIndex.from_tuples([('a','foo'),('a','bar'),
                                     ('b','foo'),('b','bah')],
                                      names=['lvl0', 'lvl1'])
   df = DataFrame(np.arange(len(index)*len(columns)).reshape((len(index),len(columns))),
                  index=index,
                  columns=columns).sortlevel().sortlevel(axis=1)
   df

Basic multi-index slicing using slices, lists, and labels.

.. ipython:: python

   df.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:]

You can use a ``pd.IndexSlice`` to shortcut the creation of these slices

.. ipython:: python

   idx = pd.IndexSlice
   df.loc[idx[:,:,['C1','C3']],idx[:,'foo']]

It is possible to perform quite complicated selections using this method on multiple
axes at the same time.

.. ipython:: python

   df.loc['A1',(slice(None),'foo')]
   df.loc[idx[:,:,['C1','C3']],idx[:,'foo']]

Using a boolean indexer you can provide selection related to the *values*.

.. ipython:: python

   mask = df[('a','foo')]>200
   df.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]

You can also specify the ``axis`` argument to ``.loc`` to interpret the passed
slicers on a single axis.

.. ipython:: python

   df.loc(axis=0)[:,:,['C1','C3']]

Furthermore you can *set* the values using these methods

.. ipython:: python

   df2 = df.copy()
   df2.loc(axis=0)[:,:,['C1','C3']] = -10
   df2

You can use a right-hand-side of an alignable object as well.

.. ipython:: python

   df2 = df.copy()
   df2.loc[idx[:,:,['C1','C3']],:] = df2*1000
   df2

Prior Version Deprecations/Changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There are no announced changes in 0.13.1 or prior that are taking effect as of 0.14.0

Deprecations
~~~~~~~~~~~~

There are no deprecations of prior behavior in 0.14.0

Enhancements
~~~~~~~~~~~~

- ``DataFrame.to_latex`` now takes a longtable keyword, which if True will return a table in a longtable environment. (:issue:`6617`)
- pd.read_clipboard will, if 'sep' is unspecified, try to detect data copied from a spreadsheet
  and parse accordingly. (:issue:`6223`)
- ``plot(legend='reverse')`` will now reverse the order of legend labels for
  most plot kinds. (:issue:`6014`)
- improve performance of slice indexing on Series with string keys (:issue:`6341`, :issue:`6372`)
- Hexagonal bin plots from ``DataFrame.plot`` with ``kind='hexbin'`` (:issue:`5478`)
- Joining a singly-indexed DataFrame with a multi-indexed DataFrame (:issue:`3662`)

  See :ref:`the docs<merging.join_on_mi>`. Joining multi-index DataFrames on both the left and right is not yet supported ATM.

  .. ipython:: python

     household = DataFrame(dict(household_id = [1,2,3],
                                male = [0,1,0],
                                wealth = [196087.3,316478.7,294750]),
                           columns = ['household_id','male','wealth']
                          ).set_index('household_id')
     household
     portfolio = DataFrame(dict(household_id = [1,2,2,3,3,3,4],
                                asset_id = ["nl0000301109","nl0000289783","gb00b03mlx29",
                                            "gb00b03mlx29","lu0197800237","nl0000289965",np.nan],
                                name = ["ABN Amro","Robeco","Royal Dutch Shell","Royal Dutch Shell",
                                        "AAB Eastern Europe Equity Fund","Postbank BioTech Fonds",np.nan],
                                share = [1.0,0.4,0.6,0.15,0.6,0.25,1.0]),
                           columns = ['household_id','asset_id','name','share']
                          ).set_index(['household_id','asset_id'])
     portfolio

     household.join(portfolio, how='inner')

- ``quotechar``, ``doublequote``, and ``escapechar`` can now be specified when
  using ``DataFrame.to_csv`` (:issue:`5414`, :issue:`4528`)
- Added a ``to_julian_date`` function to ``TimeStamp`` and ``DatetimeIndex``
  to convert to the Julian Date used primarily in astronomy. (:issue:`4041`)
- ``DataFrame.to_stata`` will now check data for compatibility with Stata data types
  and will upcast when needed.  When it isn't possibly to losslessly upcast, a warning
  is raised (:issue:`6327`)
- ``DataFrame.to_stata`` and ``StataWriter`` will accept keyword arguments time_stamp
  and data_label which allow the time stamp and dataset label to be set when creating a
  file. (:issue:`6545`)

Performance
~~~~~~~~~~~

- perf improvements in DataFrame construction with certain offsets, by removing faulty caching
  (e.g. MonthEnd,BusinessMonthEnd), (:issue:`6479`)

Experimental
~~~~~~~~~~~~

There are no experimental changes in 0.14.0

Bug Fixes
~~~~~~~~~

See :ref:`V0.14.0 Bug Fixes<release.bug_fixes-0.14.0>` for an extensive list of bugs that have been fixed in 0.14.0.

See the :ref:`full release notes
<release>` or issue tracker
on GitHub for a complete list of all API changes, Enhancements and Bug Fixes.