Skip to content

Commit b80d69b

Browse files
committed
Merge branch 'master' of https://github.com/pydata/pandas into fix-column-dtype-mixing
Conflicts: doc/source/whatsnew/v0.17.0.txt
2 parents 6e3ddf1 + c74820e commit b80d69b

17 files changed

+476
-35
lines changed

doc/source/api.rst

+2
Original file line numberDiff line numberDiff line change
@@ -904,6 +904,8 @@ Reshaping, sorting, transposing
904904
DataFrame.sort
905905
DataFrame.sort_index
906906
DataFrame.sortlevel
907+
DataFrame.nlargest
908+
DataFrame.nsmallest
907909
DataFrame.swaplevel
908910
DataFrame.stack
909911
DataFrame.unstack

doc/source/basics.rst

+14
Original file line numberDiff line numberDiff line change
@@ -1497,6 +1497,20 @@ faster than sorting the entire Series and calling ``head(n)`` on the result.
14971497
s.nsmallest(3)
14981498
s.nlargest(3)
14991499
1500+
.. versionadded:: 0.17.0
1501+
1502+
``DataFrame`` also has the ``nlargest`` and ``nsmallest`` methods.
1503+
1504+
.. ipython:: python
1505+
1506+
df = DataFrame({'a': [-2, -1, 1, 10, 8, 11, -1],
1507+
'b': list('abdceff'),
1508+
'c': [1.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0]})
1509+
df.nlargest(3, 'a')
1510+
df.nlargest(5, ['a', 'c'])
1511+
df.nsmallest(3, 'a')
1512+
df.nsmallest(5, ['a', 'c'])
1513+
15001514
15011515
.. _basics.multi-index_sorting:
15021516

doc/source/install.rst

+3-3
Original file line numberDiff line numberDiff line change
@@ -267,11 +267,11 @@ Optional Dependencies
267267
installation.
268268
* Google's `python-gflags <http://code.google.com/p/python-gflags/>`__
269269
and `google-api-python-client <http://github.com/google/google-api-python-client>`__
270-
* Needed for :mod:`~pandas.io.gbq`
270+
* Needed for :mod:`~pandas.io.gbq`
271271
* `setuptools <https://pypi.python.org/pypi/setuptools/>`__
272-
* Needed for :mod:`~pandas.io.gbq` (specifically, it utilizes `pkg_resources`)
272+
* Needed for :mod:`~pandas.io.gbq` (specifically, it utilizes `pkg_resources`)
273273
* `httplib2 <http://pypi.python.org/pypi/httplib2>`__
274-
* Needed for :mod:`~pandas.io.gbq`
274+
* Needed for :mod:`~pandas.io.gbq`
275275
* One of the following combinations of libraries is needed to use the
276276
top-level :func:`~pandas.io.html.read_html` function:
277277

doc/source/io.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -3610,7 +3610,7 @@ below and the SQLAlchemy `documentation <http://docs.sqlalchemy.org/en/rel_0_9/c
36103610
36113611
If you want to manage your own connections you can pass one of those instead:
36123612

3613-
.. ipython:: python
3613+
.. code-block:: python
36143614
36153615
with engine.connect() as conn, conn.begin():
36163616
data = pd.read_sql_table('data', conn)

doc/source/timeseries.rst

+3-11
Original file line numberDiff line numberDiff line change
@@ -208,21 +208,13 @@ Pass ``errors='coerce'`` to convert invalid data to ``NaT`` (not a time):
208208
:okexcept:
209209
210210
# this is the default, raise when unparseable
211-
to_datetime(['2009-07-31', 'asd'], errors='raise')
211+
to_datetime(['2009/07/31', 'asd'], errors='raise')
212212
213213
# return the original input when unparseable
214-
to_datetime(['2009-07-31', 'asd'], errors='ignore')
214+
to_datetime(['2009/07/31', 'asd'], errors='ignore')
215215
216216
# return NaT for input when unparseable
217-
to_datetime(['2009-07-31', 'asd'], errors='coerce')
218-
219-
220-
Take care, ``to_datetime`` may not act as you expect on mixed data:
221-
222-
.. ipython:: python
223-
:okexcept:
224-
225-
to_datetime([1, '1'])
217+
to_datetime(['2009/07/31', 'asd'], errors='coerce')
226218
227219
Epoch Timestamps
228220
~~~~~~~~~~~~~~~~

doc/source/visualization.rst

+7
Original file line numberDiff line numberDiff line change
@@ -1649,6 +1649,7 @@ values, the resulting grid has two columns and two rows. A histogram is
16491649
displayed for each cell of the grid.
16501650

16511651
.. ipython:: python
1652+
:okwarning:
16521653
16531654
plt.figure()
16541655
@@ -1680,6 +1681,7 @@ Example below is the same as previous except the plot is set to kernel density
16801681
estimation. A ``seaborn`` example is included beneath.
16811682

16821683
.. ipython:: python
1684+
:okwarning:
16831685
16841686
plt.figure()
16851687
@@ -1706,6 +1708,7 @@ The plot below shows that it is possible to have two or more plots for the same
17061708
data displayed on the same Trellis grid cell.
17071709

17081710
.. ipython:: python
1711+
:okwarning:
17091712
17101713
plt.figure()
17111714
@@ -1745,6 +1748,7 @@ Below is a similar plot but with 2D kernel density estimation plot superimposed,
17451748
followed by a ``seaborn`` equivalent:
17461749

17471750
.. ipython:: python
1751+
:okwarning:
17481752
17491753
plt.figure()
17501754
@@ -1774,6 +1778,7 @@ only uses 'sex' attribute. If the second grouping attribute is not specified,
17741778
the plots will be arranged in a column.
17751779

17761780
.. ipython:: python
1781+
:okwarning:
17771782
17781783
plt.figure()
17791784
@@ -1792,6 +1797,7 @@ the plots will be arranged in a column.
17921797
If the first grouping attribute is not specified the plots will be arranged in a row.
17931798

17941799
.. ipython:: python
1800+
:okwarning:
17951801
17961802
plt.figure()
17971803
@@ -1816,6 +1822,7 @@ scale objects to specify these mappings. The list of scale classes is
18161822
given below with initialization arguments for quick reference.
18171823

18181824
.. ipython:: python
1825+
:okwarning:
18191826
18201827
plt.figure()
18211828

doc/source/whatsnew/v0.17.0.txt

+17-10
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,13 @@ users upgrade to this version.
1313

1414
Highlights include:
1515

16-
- Release the Global Interpreter Lock (GIL) on some cython operations, see :ref:`here <whatsnew_0170.gil>`
17-
- The default for ``to_datetime`` will now be to ``raise`` when presented with unparseable formats,
18-
previously this would return the original input, see :ref:`here <whatsnew_0170.api_breaking.to_datetime>`
19-
- The default for ``dropna`` in ``HDFStore`` has changed to ``False``, to store by default all rows even
20-
if they are all ``NaN``, see :ref:`here <whatsnew_0170.api_breaking.hdf_dropna>`
21-
- Support for ``Series.dt.strftime`` to generate formatted strings for datetime-likes, see :ref:`here <whatsnew_0170.strftime>`
22-
- Development installed versions of pandas will now have ``PEP440`` compliant version strings (:issue:`9518`)
16+
- Release the Global Interpreter Lock (GIL) on some cython operations, see :ref:`here <whatsnew_0170.gil>`
17+
- The default for ``to_datetime`` will now be to ``raise`` when presented with unparseable formats,
18+
previously this would return the original input, see :ref:`here <whatsnew_0170.api_breaking.to_datetime>`
19+
- The default for ``dropna`` in ``HDFStore`` has changed to ``False``, to store by default all rows even
20+
if they are all ``NaN``, see :ref:`here <whatsnew_0170.api_breaking.hdf_dropna>`
21+
- Support for ``Series.dt.strftime`` to generate formatted strings for datetime-likes, see :ref:`here <whatsnew_0170.strftime>`
22+
- Development installed versions of pandas will now have ``PEP440`` compliant version strings (:issue:`9518`)
2323

2424
Check the :ref:`API Changes <whatsnew_0170.api>` and :ref:`deprecations <whatsnew_0170.deprecations>` before updating.
2525

@@ -32,6 +32,7 @@ Check the :ref:`API Changes <whatsnew_0170.api>` and :ref:`deprecations <whatsne
3232
New features
3333
~~~~~~~~~~~~
3434

35+
- ``DataFrame`` has the ``nlargest`` and ``nsmallest`` methods (:issue:`10393`)
3536
- SQL io functions now accept a SQLAlchemy connectable. (:issue:`7877`)
3637
- Enable writing complex values to HDF stores when using table format (:issue:`10447`)
3738
- Enable reading gzip compressed files via URL, either by explicitly setting the compression parameter or by inferring from the presence of the HTTP Content-Encoding header in the response (:issue:`8685`)
@@ -448,6 +449,7 @@ from ``7``.
448449

449450
.. ipython:: python
450451
:suppress:
452+
451453
pd.set_option('display.precision', 6)
452454

453455

@@ -457,6 +459,8 @@ Other API Changes
457459
^^^^^^^^^^^^^^^^^
458460

459461
- Line and kde plot with ``subplots=True`` now uses default colors, not all black. Specify ``color='k'`` to draw all lines in black (:issue:`9894`)
462+
- Calling the ``.value_counts`` method on a Series with ``categorical`` dtype now returns a
463+
Series with a ``CategoricalIndex`` (:issue:`10704`)
460464
- Enable writing Excel files in :ref:`memory <_io.excel_writing_buffer>` using StringIO/BytesIO (:issue:`7074`)
461465
- Enable serialization of lists and dicts to strings in ExcelWriter (:issue:`8188`)
462466
- Allow passing `kwargs` to the interpolation methods (:issue:`10378`).
@@ -479,9 +483,9 @@ Other API Changes
479483
- ``groupby`` using ``Categorical`` follows the same rule as ``Categorical.unique`` described above (:issue:`10508`)
480484
- ``NaT``'s methods now either raise ``ValueError``, or return ``np.nan`` or ``NaT`` (:issue:`9513`)
481485

482-
=============================== ==============================================================
486+
=============================== ===============================================================
483487
Behavior Methods
484-
=============================== ==============================================================
488+
=============================== ===============================================================
485489
``return np.nan`` ``weekday``, ``isoweekday``
486490
``return NaT`` ``date``, ``now``, ``replace``, ``to_datetime``, ``today``
487491
``return np.datetime64('NaT')`` ``to_datetime64`` (unchanged)
@@ -544,6 +548,8 @@ Performance Improvements
544548
Bug Fixes
545549
~~~~~~~~~
546550

551+
552+
- Bug in ``DataFrame.to_html(index=False)`` renders unnecessary ``name`` row (:issue:`10344`)
547553
- Bug in ``DataFrame.apply`` when function returns categorical series. (:issue:`9573`)
548554
- Bug in ``to_datetime`` with invalid dates and formats supplied (:issue:`10154`)
549555
- Bug in ``Index.drop_duplicates`` dropping name(s) (:issue:`10115`)
@@ -606,4 +612,5 @@ Bug Fixes
606612
- Bug in vectorised setting of timestamp columns with python ``datetime.date`` and numpy ``datetime64`` (:issue:`10408`, :issue:`10412`)
607613

608614
- Bug in ``pd.DataFrame`` when constructing an empty DataFrame with a string dtype (:issue:`9428`)
609-
- Bug in ``read_stata`` when reading a file with a different order set in ``columns`` (:issue:`10739`)
615+
- Bug in ``read_stata`` when reading a file with a different order set in ``columns`` (:issue:`10739`)
616+
- Bug in ``pd.unique`` for arrays with the ``datetime64`` or ``timedelta64`` dtype that meant an array with object dtype was returned instead the original dtype (:issue: `9431`)

pandas/core/algorithms.py

+8-2
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ def match(to_match, values, na_sentinel=-1):
3636
values = np.array(values, dtype='O')
3737

3838
f = lambda htype, caster: _match_generic(to_match, values, htype, caster)
39-
result = _hashtable_algo(f, values.dtype)
39+
result = _hashtable_algo(f, values.dtype, np.int64)
4040

4141
if na_sentinel != -1:
4242

@@ -66,14 +66,20 @@ def unique(values):
6666
return _hashtable_algo(f, values.dtype)
6767

6868

69-
def _hashtable_algo(f, dtype):
69+
def _hashtable_algo(f, dtype, return_dtype=None):
7070
"""
7171
f(HashTable, type_caster) -> result
7272
"""
7373
if com.is_float_dtype(dtype):
7474
return f(htable.Float64HashTable, com._ensure_float64)
7575
elif com.is_integer_dtype(dtype):
7676
return f(htable.Int64HashTable, com._ensure_int64)
77+
elif com.is_datetime64_dtype(dtype):
78+
return_dtype = return_dtype or 'M8[ns]'
79+
return f(htable.Int64HashTable, com._ensure_int64).view(return_dtype)
80+
elif com.is_timedelta64_dtype(dtype):
81+
return_dtype = return_dtype or 'm8[ns]'
82+
return f(htable.Int64HashTable, com._ensure_int64).view(return_dtype)
7783
else:
7884
return f(htable.PyObjectHashTable, com._ensure_object)
7985

pandas/core/categorical.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -1027,6 +1027,7 @@ def value_counts(self, dropna=True):
10271027
"""
10281028
import pandas.hashtable as htable
10291029
from pandas.core.series import Series
1030+
from pandas.core.index import CategoricalIndex
10301031

10311032
cat = self.dropna() if dropna else self
10321033
keys, counts = htable.value_count_int64(com._ensure_int64(cat._codes))
@@ -1036,10 +1037,12 @@ def value_counts(self, dropna=True):
10361037
if not dropna and -1 in keys:
10371038
ix = np.append(ix, -1)
10381039
result = result.reindex(ix, fill_value=0)
1039-
result.index = (np.append(cat.categories, np.nan)
1040+
index = (np.append(cat.categories, np.nan)
10401041
if not dropna and -1 in keys
10411042
else cat.categories)
10421043

1044+
result.index = CategoricalIndex(index, self.categories, self.ordered)
1045+
10431046
return result
10441047

10451048
def get_values(self):

pandas/core/format.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1037,7 +1037,7 @@ def _column_header():
10371037
self.write_tr(col_row, indent, self.indent_delta, header=True,
10381038
align=align)
10391039

1040-
if self.fmt.has_index_names:
1040+
if self.fmt.has_index_names and self.fmt.index:
10411041
row = [
10421042
x if x is not None else '' for x in self.frame.index.names
10431043
] + [''] * min(len(self.columns), self.max_cols)

pandas/core/frame.py

+73
Original file line numberDiff line numberDiff line change
@@ -3127,6 +3127,79 @@ def sortlevel(self, level=0, axis=0, ascending=True,
31273127
else:
31283128
return self._constructor(new_data).__finalize__(self)
31293129

3130+
def _nsorted(self, columns, n, method, take_last):
3131+
if not com.is_list_like(columns):
3132+
columns = [columns]
3133+
columns = list(columns)
3134+
ser = getattr(self[columns[0]], method)(n, take_last=take_last)
3135+
ascending = dict(nlargest=False, nsmallest=True)[method]
3136+
return self.loc[ser.index].sort(columns, ascending=ascending,
3137+
kind='mergesort')
3138+
3139+
def nlargest(self, n, columns, take_last=False):
3140+
"""Get the rows of a DataFrame sorted by the `n` largest
3141+
values of `columns`.
3142+
3143+
.. versionadded:: 0.17.0
3144+
3145+
Parameters
3146+
----------
3147+
n : int
3148+
Number of items to retrieve
3149+
columns : list or str
3150+
Column name or names to order by
3151+
take_last : bool, optional
3152+
Where there are duplicate values, take the last duplicate
3153+
3154+
Returns
3155+
-------
3156+
DataFrame
3157+
3158+
Examples
3159+
--------
3160+
>>> df = DataFrame({'a': [1, 10, 8, 11, -1],
3161+
... 'b': list('abdce'),
3162+
... 'c': [1.0, 2.0, np.nan, 3.0, 4.0]})
3163+
>>> df.nlargest(3, 'a')
3164+
a b c
3165+
3 11 c 3
3166+
1 10 b 2
3167+
2 8 d NaN
3168+
"""
3169+
return self._nsorted(columns, n, 'nlargest', take_last)
3170+
3171+
def nsmallest(self, n, columns, take_last=False):
3172+
"""Get the rows of a DataFrame sorted by the `n` smallest
3173+
values of `columns`.
3174+
3175+
.. versionadded:: 0.17.0
3176+
3177+
Parameters
3178+
----------
3179+
n : int
3180+
Number of items to retrieve
3181+
columns : list or str
3182+
Column name or names to order by
3183+
take_last : bool, optional
3184+
Where there are duplicate values, take the last duplicate
3185+
3186+
Returns
3187+
-------
3188+
DataFrame
3189+
3190+
Examples
3191+
--------
3192+
>>> df = DataFrame({'a': [1, 10, 8, 11, -1],
3193+
... 'b': list('abdce'),
3194+
... 'c': [1.0, 2.0, np.nan, 3.0, 4.0]})
3195+
>>> df.nsmallest(3, 'a')
3196+
a b c
3197+
4 -1 e 4
3198+
0 1 a 1
3199+
2 8 d NaN
3200+
"""
3201+
return self._nsorted(columns, n, 'nsmallest', take_last)
3202+
31303203
def swaplevel(self, i, j, axis=0):
31313204
"""
31323205
Swap levels i and j in a MultiIndex on a particular axis

pandas/core/strings.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -193,7 +193,7 @@ def str_contains(arr, pat, case=True, flags=0, na=np.nan, regex=True):
193193
194194
See Also
195195
--------
196-
match : analagous, but stricter, relying on re.match instead of re.search
196+
match : analogous, but stricter, relying on re.match instead of re.search
197197
198198
"""
199199
if regex:

pandas/io/tests/test_packers.py

+17-1
Original file line numberDiff line numberDiff line change
@@ -532,14 +532,30 @@ class TestMsgpack():
532532
http://stackoverflow.com/questions/6689537/nose-test-generators-inside-class
533533
"""
534534
def setUp(self):
535-
from pandas.io.tests.generate_legacy_storage_files import create_msgpack_data
535+
from pandas.io.tests.generate_legacy_storage_files import (
536+
create_msgpack_data, create_data)
536537
self.data = create_msgpack_data()
538+
self.all_data = create_data()
537539
self.path = u('__%s__.msgpack' % tm.rands(10))
540+
self.minimum_structure = {'series': ['float', 'int', 'mixed', 'ts', 'mi', 'dup'],
541+
'frame': ['float', 'int', 'mixed', 'mi'],
542+
'panel': ['float'],
543+
'index': ['int', 'date', 'period'],
544+
'mi': ['reg2']}
545+
546+
def check_min_structure(self, data):
547+
for typ, v in self.minimum_structure.items():
548+
assert typ in data, '"{0}" not found in unpacked data'.format(typ)
549+
for kind in v:
550+
assert kind in data[typ], '"{0}" not found in data["{1}"]'.format(kind, typ)
538551

539552
def compare(self, vf):
540553
data = read_msgpack(vf)
554+
self.check_min_structure(data)
541555
for typ, dv in data.items():
556+
assert typ in self.all_data, 'unpacked data contains extra key "{0}"'.format(typ)
542557
for dt, result in dv.items():
558+
assert dt in self.all_data[typ], 'data["{0}"] contains extra key "{1}"'.format(typ, dt)
543559
try:
544560
expected = self.data[typ][dt]
545561
except KeyError:

0 commit comments

Comments
 (0)