Skip to content

Commit a2ca9c4

Browse files
committed
BUG: fix bugs causes dataframe replace to not respect replacer's dtype (#26632)
2 parents deea374 + ad4c4d5 commit a2ca9c4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+1687
-1443
lines changed

.travis.yml

+3-5
Original file line numberDiff line numberDiff line change
@@ -30,11 +30,9 @@ matrix:
3030
- python: 3.5
3131

3232
include:
33-
- dist: bionic
34-
# 18.04
35-
python: 3.8.0
33+
- dist: trusty
3634
env:
37-
- JOB="3.8-dev" PATTERN="(not slow and not network)"
35+
- JOB="3.8" ENV_FILE="ci/deps/travis-38.yaml" PATTERN="(not slow and not network)"
3836

3937
- dist: trusty
4038
env:
@@ -88,7 +86,7 @@ install:
8886
script:
8987
- echo "script start"
9088
- echo "$JOB"
91-
- if [ "$JOB" != "3.8-dev" ]; then source activate pandas-dev; fi
89+
- source activate pandas-dev
9290
- ci/run_tests.sh
9391

9492
after_script:

ci/azure/windows.yml

+15-19
Original file line numberDiff line numberDiff line change
@@ -11,49 +11,45 @@ jobs:
1111
py36_np15:
1212
ENV_FILE: ci/deps/azure-windows-36.yaml
1313
CONDA_PY: "36"
14+
PATTERN: "not slow and not network"
1415

1516
py37_np141:
1617
ENV_FILE: ci/deps/azure-windows-37.yaml
1718
CONDA_PY: "37"
19+
PATTERN: "not slow and not network"
1820

1921
steps:
2022
- powershell: |
2123
Write-Host "##vso[task.prependpath]$env:CONDA\Scripts"
2224
Write-Host "##vso[task.prependpath]$HOME/miniconda3/bin"
2325
displayName: 'Add conda to PATH'
2426
- script: conda update -q -n base conda
25-
displayName: Update conda
26-
- script: |
27-
call activate
27+
displayName: 'Update conda'
28+
- bash: |
2829
conda env create -q --file ci\\deps\\azure-windows-$(CONDA_PY).yaml
2930
displayName: 'Create anaconda environment'
30-
- script: |
31-
call activate pandas-dev
32-
call conda list
31+
- bash: |
32+
source activate pandas-dev
33+
conda list
3334
ci\\incremental\\build.cmd
3435
displayName: 'Build'
35-
- script: |
36-
call activate pandas-dev
37-
pytest -m "not slow and not network" --junitxml=test-data.xml pandas -n 2 -r sxX --strict --durations=10 %*
36+
- bash: |
37+
source activate pandas-dev
38+
ci/run_tests.sh
3839
displayName: 'Test'
3940
- task: PublishTestResults@2
4041
inputs:
4142
testResultsFiles: 'test-data.xml'
4243
testRunTitle: 'Windows-$(CONDA_PY)'
4344
- powershell: |
44-
$junitXml = "test-data.xml"
45-
$(Get-Content $junitXml | Out-String) -match 'failures="(.*?)"'
46-
if ($matches[1] -eq 0)
47-
{
45+
$(Get-Content "test-data.xml" | Out-String) -match 'failures="(.*?)"'
46+
if ($matches[1] -eq 0) {
4847
Write-Host "No test failures in test-data"
49-
}
50-
else
51-
{
52-
# note that this will produce $LASTEXITCODE=1
53-
Write-Error "$($matches[1]) tests failed"
48+
} else {
49+
Write-Error "$($matches[1]) tests failed" # will produce $LASTEXITCODE=1
5450
}
5551
displayName: 'Check for test failures'
56-
- script: |
52+
- bash: |
5753
source activate pandas-dev
5854
python ci/print_skipped.py
5955
displayName: 'Print skipped tests'

ci/build38.sh

-19
This file was deleted.

ci/deps/travis-38.yaml

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
name: pandas-dev
2+
channels:
3+
- defaults
4+
- conda-forge
5+
dependencies:
6+
- python=3.8.*
7+
- cython>=0.29.13
8+
- numpy
9+
- python-dateutil
10+
- nomkl
11+
- pytz
12+
# universal
13+
- pytest>=5.0.0
14+
- pytest-xdist>=1.29.0
15+
- hypothesis>=3.58.0
16+
- pip

ci/setup_env.sh

-5
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,5 @@
11
#!/bin/bash -e
22

3-
if [ "$JOB" == "3.8-dev" ]; then
4-
/bin/bash ci/build38.sh
5-
exit 0
6-
fi
7-
83
# edit the locale file if needed
94
if [ -n "$LOCALE_OVERRIDE" ]; then
105
echo "Adding locale to the first line of pandas/__init__.py"

doc/source/development/extending.rst

+42
Original file line numberDiff line numberDiff line change
@@ -251,6 +251,48 @@ To use a test, subclass it:
251251
See https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/base/__init__.py
252252
for a list of all the tests available.
253253

254+
.. _extending.extension.arrow:
255+
256+
Compatibility with Apache Arrow
257+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
258+
259+
An ``ExtensionArray`` can support conversion to / from ``pyarrow`` arrays
260+
(and thus support for example serialization to the Parquet file format)
261+
by implementing two methods: ``ExtensionArray.__arrow_array__`` and
262+
``ExtensionDtype.__from_arrow__``.
263+
264+
The ``ExtensionArray.__arrow_array__`` ensures that ``pyarrow`` knowns how
265+
to convert the specific extension array into a ``pyarrow.Array`` (also when
266+
included as a column in a pandas DataFrame):
267+
268+
.. code-block:: python
269+
270+
class MyExtensionArray(ExtensionArray):
271+
...
272+
273+
def __arrow_array__(self, type=None):
274+
# convert the underlying array values to a pyarrow Array
275+
import pyarrow
276+
return pyarrow.array(..., type=type)
277+
278+
The ``ExtensionDtype.__from_arrow__`` method then controls the conversion
279+
back from pyarrow to a pandas ExtensionArray. This method receives a pyarrow
280+
``Array`` or ``ChunkedArray`` as only argument and is expected to return the
281+
appropriate pandas ``ExtensionArray`` for this dtype and the passed values:
282+
283+
.. code-block:: none
284+
285+
class ExtensionDtype:
286+
...
287+
288+
def __from_arrow__(self, array: pyarrow.Array/ChunkedArray) -> ExtensionArray:
289+
...
290+
291+
See more in the `Arrow documentation <https://arrow.apache.org/docs/python/extending_types.html>`__.
292+
293+
Those methods have been implemented for the nullable integer and string extension
294+
dtypes included in pandas, and ensure roundtrip to pyarrow and the Parquet file format.
295+
254296
.. _extension dtype dtypes: https://github.com/pandas-dev/pandas/blob/master/pandas/core/dtypes/dtypes.py
255297
.. _extension dtype source: https://github.com/pandas-dev/pandas/blob/master/pandas/core/dtypes/base.py
256298
.. _extension array source: https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/base.py

doc/source/user_guide/io.rst

+3
Original file line numberDiff line numberDiff line change
@@ -4716,6 +4716,9 @@ Several caveats.
47164716
* The ``pyarrow`` engine preserves the ``ordered`` flag of categorical dtypes with string types. ``fastparquet`` does not preserve the ``ordered`` flag.
47174717
* Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message
47184718
on an attempt at serialization.
4719+
* The ``pyarrow`` engine preserves extension data types such as the nullable integer and string data
4720+
type (requiring pyarrow >= 1.0.0, and requiring the extension type to implement the needed protocols,
4721+
see the :ref:`extension types documentation <extending.extension.arrow>`).
47194722

47204723
You can specify an ``engine`` to direct the serialization. This can be one of ``pyarrow``, or ``fastparquet``, or ``auto``.
47214724
If the engine is NOT specified, then the ``pd.options.io.parquet.engine`` option is checked; if this is also ``auto``,

doc/source/user_guide/text.rst

+35-2
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ Text Data Types
1313

1414
.. versionadded:: 1.0.0
1515

16-
There are two main ways to store text data
16+
There are two ways to store text data in pandas:
1717

1818
1. ``object`` -dtype NumPy array.
1919
2. :class:`StringDtype` extension type.
@@ -63,7 +63,40 @@ Or ``astype`` after the ``Series`` or ``DataFrame`` is created
6363
s
6464
s.astype("string")
6565
66-
Everything that follows in the rest of this document applies equally to
66+
.. _text.differences:
67+
68+
Behavior differences
69+
^^^^^^^^^^^^^^^^^^^^
70+
71+
These are places where the behavior of ``StringDtype`` objects differ from
72+
``object`` dtype
73+
74+
l. For ``StringDtype``, :ref:`string accessor methods<api.series.str>`
75+
that return **numeric** output will always return a nullable integer dtype,
76+
rather than either int or float dtype, depending on the presence of NA values.
77+
78+
.. ipython:: python
79+
80+
s = pd.Series(["a", None, "b"], dtype="string")
81+
s
82+
s.str.count("a")
83+
s.dropna().str.count("a")
84+
85+
Both outputs are ``Int64`` dtype. Compare that with object-dtype
86+
87+
.. ipython:: python
88+
89+
s.astype(object).str.count("a")
90+
s.astype(object).dropna().str.count("a")
91+
92+
When NA values are present, the output dtype is float64.
93+
94+
2. Some string methods, like :meth:`Series.str.decode` are not available
95+
on ``StringArray`` because ``StringArray`` only holds strings, not
96+
bytes.
97+
98+
99+
Everything else that follows in the rest of this document applies equally to
67100
``string`` and ``object`` dtype.
68101

69102
.. _text.string_methods:

doc/source/whatsnew/v1.0.0.rst

+16-1
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ Previously, strings were typically stored in object-dtype NumPy arrays.
6363
``StringDtype`` is currently considered experimental. The implementation
6464
and parts of the API may change without warning.
6565

66-
The text extension type solves several issues with object-dtype NumPy arrays:
66+
The ``'string'`` extension type solves several issues with object-dtype NumPy arrays:
6767

6868
1. You can accidentally store a *mixture* of strings and non-strings in an
6969
``object`` dtype array. A ``StringArray`` can only store strings.
@@ -88,9 +88,17 @@ You can use the alias ``"string"`` as well.
8888
The usual string accessor methods work. Where appropriate, the return type
8989
of the Series or columns of a DataFrame will also have string dtype.
9090

91+
.. ipython:: python
92+
9193
s.str.upper()
9294
s.str.split('b', expand=True).dtypes
9395
96+
String accessor methods returning integers will return a value with :class:`Int64Dtype`
97+
98+
.. ipython:: python
99+
100+
s.str.count("a")
101+
94102
We recommend explicitly using the ``string`` data type when working with strings.
95103
See :ref:`text.types` for more.
96104

@@ -114,6 +122,9 @@ Other enhancements
114122
- Added ``encoding`` argument to :meth:`DataFrame.to_string` for non-ascii text (:issue:`28766`)
115123
- Added ``encoding`` argument to :func:`DataFrame.to_html` for non-ascii text (:issue:`28663`)
116124
- :meth:`Styler.background_gradient` now accepts ``vmin`` and ``vmax`` arguments (:issue:`12145`)
125+
- Roundtripping DataFrames with nullable integer or string data types to parquet
126+
(:meth:`~DataFrame.to_parquet` / :func:`read_parquet`) using the `'pyarrow'` engine
127+
now preserve those data types with pyarrow >= 1.0.0 (:issue:`20612`).
117128

118129
Build Changes
119130
^^^^^^^^^^^^^
@@ -268,6 +279,7 @@ or ``matplotlib.Axes.plot``. See :ref:`plotting.formatters` for more.
268279
- Removed the previously deprecated ``reduce`` and ``broadcast`` arguments from :meth:`DataFrame.apply` (:issue:`18577`)
269280
- Removed the previously deprecated ``assert_raises_regex`` function in ``pandas.util.testing`` (:issue:`29174`)
270281
- Removed :meth:`Index.is_lexsorted_for_tuple` (:issue:`29305`)
282+
- Removed support for nexted renaming in :meth:`DataFrame.aggregate`, :meth:`Series.aggregate`, :meth:`DataFrameGroupBy.aggregate`, :meth:`SeriesGroupBy.aggregate`, :meth:`Rolling.aggregate` (:issue:`29608`)
271283
-
272284

273285
.. _whatsnew_1000.performance:
@@ -342,6 +354,7 @@ Numeric
342354
- :class:`DataFrame` flex inequality comparisons methods (:meth:`DataFrame.lt`, :meth:`DataFrame.le`, :meth:`DataFrame.gt`, :meth: `DataFrame.ge`) with object-dtype and ``complex`` entries failing to raise ``TypeError`` like their :class:`Series` counterparts (:issue:`28079`)
343355
- Bug in :class:`DataFrame` logical operations (`&`, `|`, `^`) not matching :class:`Series` behavior by filling NA values (:issue:`28741`)
344356
- Bug in :meth:`DataFrame.interpolate` where specifying axis by name references variable before it is assigned (:issue:`29142`)
357+
- Bug in :meth:`Series.var` not computing the right value with a nullable integer dtype series not passing through ddof argument (:issue:`29128`)
345358
- Improved error message when using `frac` > 1 and `replace` = False (:issue:`27451`)
346359
- Bug in numeric indexes resulted in it being possible to instantiate an :class:`Int64Index`, :class:`UInt64Index`, or :class:`Float64Index` with an invalid dtype (e.g. datetime-like) (:issue:`29539`)
347360
- Bug in :class:`UInt64Index` precision loss while constructing from a list with values in the ``np.uint64`` range (:issue:`29526`)
@@ -432,6 +445,7 @@ Groupby/resample/rolling
432445

433446
-
434447
- Bug in :meth:`DataFrame.groupby` with multiple groups where an ``IndexError`` would be raised if any group contained all NA values (:issue:`20519`)
448+
- Bug in :meth:`pandas.core.resample.Resampler.size` and :meth:`pandas.core.resample.Resampler.count` returning wrong dtype when used with an empty series or dataframe (:issue:`28427`)
435449
- Bug in :meth:`DataFrame.rolling` not allowing for rolling over datetimes when ``axis=1`` (:issue: `28192`)
436450
- Bug in :meth:`DataFrame.rolling` not allowing rolling over multi-index levels (:issue: `15584`).
437451
- Bug in :meth:`DataFrame.rolling` not allowing rolling on monotonic decreasing time indexes (:issue: `19248`).
@@ -452,6 +466,7 @@ Reshaping
452466
- Better error message in :func:`get_dummies` when `columns` isn't a list-like value (:issue:`28383`)
453467
- Bug :meth:`Series.pct_change` where supplying an anchored frequency would throw a ValueError (:issue:`28664`)
454468
- Bug in :meth:`DataFrame.replace` that caused non-numeric replacer's dtype not respected (:issue:`26632`)
469+
- Bug where :meth:`DataFrame.equals` returned True incorrectly in some cases when two DataFrames had the same columns in different orders (:issue:`28839`)
455470

456471
Sparse
457472
^^^^^^

pandas/_libs/internals.pyx

+5-12
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,14 @@
11
import cython
22
from cython import Py_ssize_t
33

4-
from cpython.object cimport PyObject
4+
from cpython.slice cimport PySlice_GetIndicesEx
55

66
cdef extern from "Python.h":
77
Py_ssize_t PY_SSIZE_T_MAX
88

99
import numpy as np
1010
from numpy cimport int64_t
1111

12-
cdef extern from "compat_helper.h":
13-
cdef int slice_get_indices(PyObject* s, Py_ssize_t length,
14-
Py_ssize_t *start, Py_ssize_t *stop,
15-
Py_ssize_t *step,
16-
Py_ssize_t *slicelength) except -1
17-
18-
1912
from pandas._libs.algos import ensure_int64
2013

2114

@@ -258,8 +251,8 @@ cpdef Py_ssize_t slice_len(
258251
if slc is None:
259252
raise TypeError("slc must be slice")
260253

261-
slice_get_indices(<PyObject *>slc, objlen,
262-
&start, &stop, &step, &length)
254+
PySlice_GetIndicesEx(slc, objlen,
255+
&start, &stop, &step, &length)
263256

264257
return length
265258

@@ -278,8 +271,8 @@ cdef slice_get_indices_ex(slice slc, Py_ssize_t objlen=PY_SSIZE_T_MAX):
278271
if slc is None:
279272
raise TypeError("slc should be a slice")
280273

281-
slice_get_indices(<PyObject *>slc, objlen,
282-
&start, &stop, &step, &length)
274+
PySlice_GetIndicesEx(slc, objlen,
275+
&start, &stop, &step, &length)
283276

284277
return start, stop, step, length
285278

0 commit comments

Comments
 (0)