Skip to content

Commit cd32fce

Browse files
authored
Backport PR #51853 on branch 2.0.x (Remove use_nullable_dtypes and add dtype_backend keyword) (#51935)
Remove use_nullable_dtypes and add dtype_backend keyword (#51853)
1 parent bd23fff commit cd32fce

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+518
-898
lines changed

doc/source/user_guide/io.rst

+10-7
Original file line numberDiff line numberDiff line change
@@ -170,12 +170,15 @@ dtype : Type name or dict of column -> type, default ``None``
170170
the default determines the dtype of the columns which are not explicitly
171171
listed.
172172

173-
use_nullable_dtypes : bool = False
174-
Whether or not to use nullable dtypes as default when reading data. If
175-
set to True, nullable dtypes are used for all dtypes that have a nullable
176-
implementation, even if no nulls are present.
173+
dtype_backend : {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames
174+
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy
175+
arrays, nullable dtypes are used for all dtypes that have a nullable
176+
implementation when "numpy_nullable" is set, pyarrow is used for all
177+
dtypes if "pyarrow" is set.
177178

178-
.. versionadded:: 2.0
179+
The dtype_backends are still experimential.
180+
181+
.. versionadded:: 2.0
179182

180183
engine : {``'c'``, ``'python'``, ``'pyarrow'``}
181184
Parser engine to use. The C and pyarrow engines are faster, while the python engine
@@ -475,7 +478,7 @@ worth trying.
475478
476479
os.remove("foo.csv")
477480
478-
Setting ``use_nullable_dtypes=True`` will result in nullable dtypes for every column.
481+
Setting ``dtype_backend="numpy_nullable"`` will result in nullable dtypes for every column.
479482

480483
.. ipython:: python
481484
@@ -484,7 +487,7 @@ Setting ``use_nullable_dtypes=True`` will result in nullable dtypes for every co
484487
3,4.5,False,b,6,7.5,True,a,12-31-2019,
485488
"""
486489
487-
df = pd.read_csv(StringIO(data), use_nullable_dtypes=True, parse_dates=["i"])
490+
df = pd.read_csv(StringIO(data), dtype_backend="numpy_nullable", parse_dates=["i"])
488491
df
489492
df.dtypes
490493

doc/source/user_guide/pyarrow.rst

+4-14
Original file line numberDiff line numberDiff line change
@@ -145,8 +145,8 @@ functions provide an ``engine`` keyword that can dispatch to PyArrow to accelera
145145
df
146146
147147
By default, these functions and all other IO reader functions return NumPy-backed data. These readers can return
148-
PyArrow-backed data by specifying the parameter ``use_nullable_dtypes=True`` **and** the global configuration option ``"mode.dtype_backend"``
149-
set to ``"pyarrow"``. A reader does not need to set ``engine="pyarrow"`` to necessarily return PyArrow-backed data.
148+
PyArrow-backed data by specifying the parameter ``dtype_backend="pyarrow"``. A reader does not need to set
149+
``engine="pyarrow"`` to necessarily return PyArrow-backed data.
150150

151151
.. ipython:: python
152152
@@ -155,20 +155,10 @@ set to ``"pyarrow"``. A reader does not need to set ``engine="pyarrow"`` to nece
155155
1,2.5,True,a,,,,,
156156
3,4.5,False,b,6,7.5,True,a,
157157
""")
158-
with pd.option_context("mode.dtype_backend", "pyarrow"):
159-
df_pyarrow = pd.read_csv(data, use_nullable_dtypes=True)
158+
df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow")
160159
df_pyarrow.dtypes
161160
162-
To simplify specifying ``use_nullable_dtypes=True`` in several functions, you can set a global option ``nullable_dtypes``
163-
to ``True``. You will still need to set the global configuration option ``"mode.dtype_backend"`` to ``pyarrow``.
164-
165-
.. code-block:: ipython
166-
167-
In [1]: pd.set_option("mode.dtype_backend", "pyarrow")
168-
169-
In [2]: pd.options.mode.nullable_dtypes = True
170-
171-
Several non-IO reader functions can also use the ``"mode.dtype_backend"`` option to return PyArrow-backed data including:
161+
Several non-IO reader functions can also use the ``dtype_backend`` argument to return PyArrow-backed data including:
172162

173163
* :func:`to_numeric`
174164
* :meth:`DataFrame.convert_dtypes`

doc/source/whatsnew/v2.0.0.rst

+12-26
Original file line numberDiff line numberDiff line change
@@ -103,12 +103,12 @@ Below is a possibly non-exhaustive list of changes:
103103
pd.Index([1, 2, 3], dtype=np.float16)
104104
105105
106-
.. _whatsnew_200.enhancements.io_use_nullable_dtypes_and_dtype_backend:
106+
.. _whatsnew_200.enhancements.io_dtype_backend:
107107

108-
Configuration option, ``mode.dtype_backend``, to return pyarrow-backed dtypes
109-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
108+
Argument ``dtype_backend``, to return pyarrow-backed or numpy-backed nullable dtypes
109+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
110110

111-
The ``use_nullable_dtypes`` keyword argument has been expanded to the following functions to enable automatic conversion to nullable dtypes (:issue:`36712`)
111+
The following functions gained a new keyword ``dtype_backend`` (:issue:`36712`)
112112

113113
* :func:`read_csv`
114114
* :func:`read_clipboard`
@@ -124,19 +124,13 @@ The ``use_nullable_dtypes`` keyword argument has been expanded to the following
124124
* :func:`read_feather`
125125
* :func:`read_spss`
126126
* :func:`to_numeric`
127+
* :meth:`DataFrame.convert_dtypes`
128+
* :meth:`Series.convert_dtypes`
127129

128-
To simplify opting-in to nullable dtypes for these functions, a new option ``nullable_dtypes`` was added that allows setting
129-
the keyword argument globally to ``True`` if not specified directly. The option can be enabled
130-
through:
131-
132-
.. ipython:: python
133-
134-
pd.options.mode.nullable_dtypes = True
135-
136-
The option will only work for functions with the keyword ``use_nullable_dtypes``.
130+
When this option is set to ``"numpy_nullable"`` it will return a :class:`DataFrame` that is
131+
backed by nullable dtypes.
137132

138-
Additionally a new global configuration, ``mode.dtype_backend`` can now be used in conjunction with the parameter ``use_nullable_dtypes=True`` in the following functions
139-
to select the nullable dtypes implementation.
133+
When this keyword is set to ``"pyarrow"``, then these functions will return pyarrow-backed nullable :class:`ArrowDtype` DataFrames (:issue:`48957`, :issue:`49997`):
140134

141135
* :func:`read_csv`
142136
* :func:`read_clipboard`
@@ -153,30 +147,21 @@ to select the nullable dtypes implementation.
153147
* :func:`read_feather`
154148
* :func:`read_spss`
155149
* :func:`to_numeric`
156-
157-
158-
And the following methods will also utilize the ``mode.dtype_backend`` option.
159-
160150
* :meth:`DataFrame.convert_dtypes`
161151
* :meth:`Series.convert_dtypes`
162152

163-
By default, ``mode.dtype_backend`` is set to ``"pandas"`` to return existing, numpy-backed nullable dtypes, but it can also
164-
be set to ``"pyarrow"`` to return pyarrow-backed, nullable :class:`ArrowDtype` (:issue:`48957`, :issue:`49997`).
165-
166153
.. ipython:: python
167154
168155
import io
169156
data = io.StringIO("""a,b,c,d,e,f,g,h,i
170157
1,2.5,True,a,,,,,
171158
3,4.5,False,b,6,7.5,True,a,
172159
""")
173-
with pd.option_context("mode.dtype_backend", "pandas"):
174-
df = pd.read_csv(data, use_nullable_dtypes=True)
160+
df = pd.read_csv(data, dtype_backend="pyarrow")
175161
df.dtypes
176162
177163
data.seek(0)
178-
with pd.option_context("mode.dtype_backend", "pyarrow"):
179-
df_pyarrow = pd.read_csv(data, use_nullable_dtypes=True, engine="pyarrow")
164+
df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow", engine="pyarrow")
180165
df_pyarrow.dtypes
181166
182167
Copy-on-Write improvements
@@ -811,6 +796,7 @@ Deprecations
811796
- Deprecated :meth:`Grouper.obj`, use :meth:`Groupby.obj` instead (:issue:`51206`)
812797
- Deprecated :meth:`Grouper.indexer`, use :meth:`Resampler.indexer` instead (:issue:`51206`)
813798
- Deprecated :meth:`Grouper.ax`, use :meth:`Resampler.ax` instead (:issue:`51206`)
799+
- Deprecated keyword ``use_nullable_dtypes`` in :func:`read_parquet`, use ``dtype_backend`` instead (:issue:`51853`)
814800
- Deprecated :meth:`Series.pad` in favor of :meth:`Series.ffill` (:issue:`33396`)
815801
- Deprecated :meth:`Series.backfill` in favor of :meth:`Series.bfill` (:issue:`33396`)
816802
- Deprecated :meth:`DataFrame.pad` in favor of :meth:`DataFrame.ffill` (:issue:`33396`)

pandas/_libs/parsers.pyi

+6
Original file line numberDiff line numberDiff line change
@@ -67,3 +67,9 @@ class TextReader:
6767
def close(self) -> None: ...
6868
def read(self, rows: int | None = ...) -> dict[int, ArrayLike]: ...
6969
def read_low_memory(self, rows: int | None) -> list[dict[int, ArrayLike]]: ...
70+
71+
# _maybe_upcast, na_values are only exposed for testing
72+
73+
def _maybe_upcast(
74+
arr, use_dtype_backend: bool = ..., dtype_backend: str = ...
75+
) -> np.ndarray: ...

pandas/_libs/parsers.pyx

+12-16
Original file line numberDiff line numberDiff line change
@@ -339,7 +339,6 @@ cdef class TextReader:
339339
object index_col
340340
object skiprows
341341
object dtype
342-
bint use_nullable_dtypes
343342
object usecols
344343
set unnamed_cols # set[str]
345344
str dtype_backend
@@ -379,8 +378,7 @@ cdef class TextReader:
379378
float_precision=None,
380379
bint skip_blank_lines=True,
381380
encoding_errors=b"strict",
382-
use_nullable_dtypes=False,
383-
dtype_backend="pandas"):
381+
dtype_backend="numpy"):
384382

385383
# set encoding for native Python and C library
386384
if isinstance(encoding_errors, str):
@@ -501,7 +499,6 @@ cdef class TextReader:
501499
# - DtypeObj
502500
# - dict[Any, DtypeObj]
503501
self.dtype = dtype
504-
self.use_nullable_dtypes = use_nullable_dtypes
505502
self.dtype_backend = dtype_backend
506503

507504
self.noconvert = set()
@@ -928,7 +925,6 @@ cdef class TextReader:
928925
bint na_filter = 0
929926
int64_t num_cols
930927
dict results
931-
bint use_nullable_dtypes
932928

933929
start = self.parser_start
934930

@@ -1049,12 +1045,12 @@ cdef class TextReader:
10491045
# don't try to upcast EAs
10501046
if (
10511047
na_count > 0 and not is_extension_array_dtype(col_dtype)
1052-
or self.use_nullable_dtypes
1048+
or self.dtype_backend != "numpy"
10531049
):
1054-
use_nullable_dtypes = self.use_nullable_dtypes and col_dtype is None
1050+
use_dtype_backend = self.dtype_backend != "numpy" and col_dtype is None
10551051
col_res = _maybe_upcast(
10561052
col_res,
1057-
use_nullable_dtypes=use_nullable_dtypes,
1053+
use_dtype_backend=use_dtype_backend,
10581054
dtype_backend=self.dtype_backend,
10591055
)
10601056

@@ -1389,11 +1385,11 @@ _NA_VALUES = _ensure_encoded(list(STR_NA_VALUES))
13891385

13901386

13911387
def _maybe_upcast(
1392-
arr, use_nullable_dtypes: bool = False, dtype_backend: str = "pandas"
1388+
arr, use_dtype_backend: bool = False, dtype_backend: str = "numpy"
13931389
):
13941390
"""Sets nullable dtypes or upcasts if nans are present.
13951391
1396-
Upcast, if use_nullable_dtypes is false and nans are present so that the
1392+
Upcast, if use_dtype_backend is false and nans are present so that the
13971393
current dtype can not hold the na value. We use nullable dtypes if the
13981394
flag is true for every array.
13991395
@@ -1402,7 +1398,7 @@ def _maybe_upcast(
14021398
arr: ndarray
14031399
Numpy array that is potentially being upcast.
14041400
1405-
use_nullable_dtypes: bool, default False
1401+
use_dtype_backend: bool, default False
14061402
If true, we cast to the associated nullable dtypes.
14071403
14081404
Returns
@@ -1419,7 +1415,7 @@ def _maybe_upcast(
14191415
if issubclass(arr.dtype.type, np.integer):
14201416
mask = arr == na_value
14211417

1422-
if use_nullable_dtypes:
1418+
if use_dtype_backend:
14231419
arr = IntegerArray(arr, mask)
14241420
else:
14251421
arr = arr.astype(float)
@@ -1428,22 +1424,22 @@ def _maybe_upcast(
14281424
elif arr.dtype == np.bool_:
14291425
mask = arr.view(np.uint8) == na_value
14301426

1431-
if use_nullable_dtypes:
1427+
if use_dtype_backend:
14321428
arr = BooleanArray(arr, mask)
14331429
else:
14341430
arr = arr.astype(object)
14351431
np.putmask(arr, mask, np.nan)
14361432

14371433
elif issubclass(arr.dtype.type, float) or arr.dtype.type == np.float32:
1438-
if use_nullable_dtypes:
1434+
if use_dtype_backend:
14391435
mask = np.isnan(arr)
14401436
arr = FloatingArray(arr, mask)
14411437

14421438
elif arr.dtype == np.object_:
1443-
if use_nullable_dtypes:
1439+
if use_dtype_backend:
14441440
arr = StringDtype().construct_array_type()._from_sequence(arr)
14451441

1446-
if use_nullable_dtypes and dtype_backend == "pyarrow":
1442+
if use_dtype_backend and dtype_backend == "pyarrow":
14471443
import pyarrow as pa
14481444
if isinstance(arr, IntegerArray) and arr.isna().all():
14491445
# use null instead of int64 in pyarrow

pandas/_typing.py

+1
Original file line numberDiff line numberDiff line change
@@ -370,3 +370,4 @@ def closed(self) -> bool:
370370
Literal["pearson", "kendall", "spearman"], Callable[[np.ndarray, np.ndarray], float]
371371
]
372372
AlignJoin = Literal["outer", "inner", "left", "right"]
373+
DtypeBackend = Literal["pyarrow", "numpy_nullable"]

pandas/conftest.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1305,7 +1305,7 @@ def string_storage(request):
13051305

13061306
@pytest.fixture(
13071307
params=[
1308-
"pandas",
1308+
"numpy_nullable",
13091309
pytest.param("pyarrow", marks=td.skip_if_no("pyarrow")),
13101310
]
13111311
)

pandas/core/arrays/numeric.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -285,7 +285,7 @@ def _from_sequence_of_strings(
285285
) -> T:
286286
from pandas.core.tools.numeric import to_numeric
287287

288-
scalars = to_numeric(strings, errors="raise", use_nullable_dtypes=True)
288+
scalars = to_numeric(strings, errors="raise", dtype_backend="numpy_nullable")
289289
return cls._from_sequence(scalars, dtype=dtype, copy=copy)
290290

291291
_HANDLED_TYPES = (np.ndarray, numbers.Number)

pandas/core/config_init.py

-28
Original file line numberDiff line numberDiff line change
@@ -487,41 +487,13 @@ def use_inf_as_na_cb(key) -> None:
487487
The default storage for StringDtype.
488488
"""
489489

490-
dtype_backend_doc = """
491-
: string
492-
The nullable dtype implementation to return. Only applicable to certain
493-
operations where documented. Available options: 'pandas', 'pyarrow',
494-
the default is 'pandas'.
495-
"""
496-
497490
with cf.config_prefix("mode"):
498491
cf.register_option(
499492
"string_storage",
500493
"python",
501494
string_storage_doc,
502495
validator=is_one_of_factory(["python", "pyarrow"]),
503496
)
504-
cf.register_option(
505-
"dtype_backend",
506-
"pandas",
507-
dtype_backend_doc,
508-
validator=is_one_of_factory(["pandas", "pyarrow"]),
509-
)
510-
511-
512-
nullable_dtypes_doc = """
513-
: bool
514-
If nullable dtypes should be returned. This is only applicable to functions
515-
where the ``use_nullable_dtypes`` keyword is implemented.
516-
"""
517-
518-
with cf.config_prefix("mode"):
519-
cf.register_option(
520-
"nullable_dtypes",
521-
False,
522-
nullable_dtypes_doc,
523-
validator=is_bool,
524-
)
525497

526498

527499
# Set up the io.excel specific reader configuration.

pandas/core/dtypes/cast.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -1006,7 +1006,7 @@ def convert_dtypes(
10061006
convert_boolean: bool = True,
10071007
convert_floating: bool = True,
10081008
infer_objects: bool = False,
1009-
dtype_backend: Literal["pandas", "pyarrow"] = "pandas",
1009+
dtype_backend: Literal["numpy_nullable", "pyarrow"] = "numpy_nullable",
10101010
) -> DtypeObj:
10111011
"""
10121012
Convert objects to best possible type, and optionally,
@@ -1028,10 +1028,10 @@ def convert_dtypes(
10281028
infer_objects : bool, defaults False
10291029
Whether to also infer objects to float/int if possible. Is only hit if the
10301030
object array contains pd.NA.
1031-
dtype_backend : str, default "pandas"
1031+
dtype_backend : str, default "numpy_nullable"
10321032
Nullable dtype implementation to use.
10331033
1034-
* "pandas" returns numpy-backed nullable types
1034+
* "numpy_nullable" returns numpy-backed nullable types
10351035
* "pyarrow" returns pyarrow-backed nullable types using ``ArrowDtype``
10361036
10371037
Returns

0 commit comments

Comments
 (0)