Skip to content

Remove use_nullable_dtypes and add dtype_backend keyword #51853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Mar 13, 2023
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 5 additions & 6 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -170,10 +170,9 @@ dtype : Type name or dict of column -> type, default ``None``
the default determines the dtype of the columns which are not explicitly
listed.

use_nullable_dtypes : bool = False
Whether or not to use nullable dtypes as default when reading data. If
set to True, nullable dtypes are used for all dtypes that have a nullable
implementation, even if no nulls are present.
dtype_backend : {"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames.
Which dtype backend to use. If
set to True, nullable dtypes or pyarrow dtypes are used for all dtypes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is True also an option here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this shouldn't accept True

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, updated


.. versionadded:: 2.0

Expand Down Expand Up @@ -475,7 +474,7 @@ worth trying.

os.remove("foo.csv")

Setting ``use_nullable_dtypes=True`` will result in nullable dtypes for every column.
Setting ``dtype_backend="numpy_nullable"`` will result in nullable dtypes for every column.

.. ipython:: python

Expand All @@ -484,7 +483,7 @@ Setting ``use_nullable_dtypes=True`` will result in nullable dtypes for every co
3,4.5,False,b,6,7.5,True,a,12-31-2019,
"""

df = pd.read_csv(StringIO(data), use_nullable_dtypes=True, parse_dates=["i"])
df = pd.read_csv(StringIO(data), dtype_backend="numpy_nullable", parse_dates=["i"])
df
df.dtypes

Expand Down
18 changes: 4 additions & 14 deletions doc/source/user_guide/pyarrow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -145,8 +145,8 @@ functions provide an ``engine`` keyword that can dispatch to PyArrow to accelera
df

By default, these functions and all other IO reader functions return NumPy-backed data. These readers can return
PyArrow-backed data by specifying the parameter ``use_nullable_dtypes=True`` **and** the global configuration option ``"mode.dtype_backend"``
set to ``"pyarrow"``. A reader does not need to set ``engine="pyarrow"`` to necessarily return PyArrow-backed data.
PyArrow-backed data by specifying the parameter ``dtype_backend="pyarrow"``. A reader does not need to set
``engine="pyarrow"`` to necessarily return PyArrow-backed data.

.. ipython:: python

Expand All @@ -155,20 +155,10 @@ set to ``"pyarrow"``. A reader does not need to set ``engine="pyarrow"`` to nece
1,2.5,True,a,,,,,
3,4.5,False,b,6,7.5,True,a,
""")
with pd.option_context("mode.dtype_backend", "pyarrow"):
df_pyarrow = pd.read_csv(data, use_nullable_dtypes=True)
df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow")
df_pyarrow.dtypes

To simplify specifying ``use_nullable_dtypes=True`` in several functions, you can set a global option ``nullable_dtypes``
to ``True``. You will still need to set the global configuration option ``"mode.dtype_backend"`` to ``pyarrow``.

.. code-block:: ipython

In [1]: pd.set_option("mode.dtype_backend", "pyarrow")

In [2]: pd.options.mode.nullable_dtypes = True

Several non-IO reader functions can also use the ``"mode.dtype_backend"`` option to return PyArrow-backed data including:
Several non-IO reader functions can also use the ``dtype_backend`` argument to return PyArrow-backed data including:

* :func:`to_numeric`
* :meth:`DataFrame.convert_dtypes`
Expand Down
38 changes: 12 additions & 26 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -103,12 +103,12 @@ Below is a possibly non-exhaustive list of changes:
pd.Index([1, 2, 3], dtype=np.float16)


.. _whatsnew_200.enhancements.io_use_nullable_dtypes_and_dtype_backend:
.. _whatsnew_200.enhancements.io_dtype_backend:

Configuration option, ``mode.dtype_backend``, to return pyarrow-backed dtypes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Argument ``dtype_backend``, to return pyarrow-backed or numpy-backed nullable dtypes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``use_nullable_dtypes`` keyword argument has been expanded to the following functions to enable automatic conversion to nullable dtypes (:issue:`36712`)
The following functions gained a new keyword ``dtype_backend`` (:issue:`36712`)

* :func:`read_csv`
* :func:`read_clipboard`
Expand All @@ -124,19 +124,13 @@ The ``use_nullable_dtypes`` keyword argument has been expanded to the following
* :func:`read_feather`
* :func:`read_spss`
* :func:`to_numeric`
* :meth:`DataFrame.convert_dtypes`
* :meth:`Series.convert_dtypes`

To simplify opting-in to nullable dtypes for these functions, a new option ``nullable_dtypes`` was added that allows setting
the keyword argument globally to ``True`` if not specified directly. The option can be enabled
through:

.. ipython:: python

pd.options.mode.nullable_dtypes = True

The option will only work for functions with the keyword ``use_nullable_dtypes``.
When this option is set to ``"numpy_nullable"`` it will return a :class:`DataFrame` that is
backed by nullable dtypes.

Additionally a new global configuration, ``mode.dtype_backend`` can now be used in conjunction with the parameter ``use_nullable_dtypes=True`` in the following functions
to select the nullable dtypes implementation.
When this keyword is set to ``"pyarrow"``, then these functions will return pyarrow-backed nullable :class:`ArrowDtype` DataFrames (:issue:`48957`, :issue:`49997`):

* :func:`read_csv`
* :func:`read_clipboard`
Expand All @@ -153,30 +147,21 @@ to select the nullable dtypes implementation.
* :func:`read_feather`
* :func:`read_spss`
* :func:`to_numeric`


And the following methods will also utilize the ``mode.dtype_backend`` option.

* :meth:`DataFrame.convert_dtypes`
* :meth:`Series.convert_dtypes`

By default, ``mode.dtype_backend`` is set to ``"pandas"`` to return existing, numpy-backed nullable dtypes, but it can also
be set to ``"pyarrow"`` to return pyarrow-backed, nullable :class:`ArrowDtype` (:issue:`48957`, :issue:`49997`).

.. ipython:: python

import io
data = io.StringIO("""a,b,c,d,e,f,g,h,i
1,2.5,True,a,,,,,
3,4.5,False,b,6,7.5,True,a,
""")
with pd.option_context("mode.dtype_backend", "pandas"):
df = pd.read_csv(data, use_nullable_dtypes=True)
df = pd.read_csv(data, dtype_backend="pyarrow")
df.dtypes

data.seek(0)
with pd.option_context("mode.dtype_backend", "pyarrow"):
df_pyarrow = pd.read_csv(data, use_nullable_dtypes=True, engine="pyarrow")
df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow", engine="pyarrow")
df_pyarrow.dtypes

Copy-on-Write improvements
Expand Down Expand Up @@ -810,6 +795,7 @@ Deprecations
- Deprecated :meth:`Grouper.obj`, use :meth:`Groupby.obj` instead (:issue:`51206`)
- Deprecated :meth:`Grouper.indexer`, use :meth:`Resampler.indexer` instead (:issue:`51206`)
- Deprecated :meth:`Grouper.ax`, use :meth:`Resampler.ax` instead (:issue:`51206`)
- Deprecated keyword ``use_nullable_dtypes`` in :func:`read_parquet`, use ``dtype_backend`` instead (:issue:`51853`)
- Deprecated :meth:`Series.pad` in favor of :meth:`Series.ffill` (:issue:`33396`)
- Deprecated :meth:`Series.backfill` in favor of :meth:`Series.bfill` (:issue:`33396`)
- Deprecated :meth:`DataFrame.pad` in favor of :meth:`DataFrame.ffill` (:issue:`33396`)
Expand Down
2 changes: 1 addition & 1 deletion pandas/_libs/parsers.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -72,5 +72,5 @@ class TextReader:
na_values: dict

def _maybe_upcast(
arr, use_nullable_dtypes: bool = ..., dtype_backend: str = ...
arr, use_dtype_backend: bool = ..., dtype_backend: str = ...
) -> np.ndarray: ...
28 changes: 12 additions & 16 deletions pandas/_libs/parsers.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -339,7 +339,6 @@ cdef class TextReader:
object index_col
object skiprows
object dtype
bint use_nullable_dtypes
object usecols
set unnamed_cols # set[str]
str dtype_backend
Expand Down Expand Up @@ -379,8 +378,7 @@ cdef class TextReader:
float_precision=None,
bint skip_blank_lines=True,
encoding_errors=b"strict",
use_nullable_dtypes=False,
dtype_backend="pandas"):
dtype_backend="numpy"):

# set encoding for native Python and C library
if isinstance(encoding_errors, str):
Expand Down Expand Up @@ -501,7 +499,6 @@ cdef class TextReader:
# - DtypeObj
# - dict[Any, DtypeObj]
self.dtype = dtype
self.use_nullable_dtypes = use_nullable_dtypes
self.dtype_backend = dtype_backend

self.noconvert = set()
Expand Down Expand Up @@ -928,7 +925,6 @@ cdef class TextReader:
bint na_filter = 0
int64_t num_cols
dict results
bint use_nullable_dtypes

start = self.parser_start

Expand Down Expand Up @@ -1049,12 +1045,12 @@ cdef class TextReader:
# don't try to upcast EAs
if (
na_count > 0 and not is_extension_array_dtype(col_dtype)
or self.use_nullable_dtypes
or self.dtype_backend != "numpy"
):
use_nullable_dtypes = self.use_nullable_dtypes and col_dtype is None
use_dtype_backend = self.dtype_backend != "numpy" and col_dtype is None
col_res = _maybe_upcast(
col_res,
use_nullable_dtypes=use_nullable_dtypes,
use_dtype_backend=use_dtype_backend,
dtype_backend=self.dtype_backend,
)

Expand Down Expand Up @@ -1389,11 +1385,11 @@ _NA_VALUES = _ensure_encoded(list(STR_NA_VALUES))


def _maybe_upcast(
arr, use_nullable_dtypes: bool = False, dtype_backend: str = "pandas"
arr, use_dtype_backend: bool = False, dtype_backend: str = "numpy"
):
"""Sets nullable dtypes or upcasts if nans are present.

Upcast, if use_nullable_dtypes is false and nans are present so that the
Upcast, if use_dtype_backend is false and nans are present so that the
current dtype can not hold the na value. We use nullable dtypes if the
flag is true for every array.

Expand All @@ -1402,7 +1398,7 @@ def _maybe_upcast(
arr: ndarray
Numpy array that is potentially being upcast.

use_nullable_dtypes: bool, default False
use_dtype_backend: bool, default False
If true, we cast to the associated nullable dtypes.

Returns
Expand All @@ -1419,7 +1415,7 @@ def _maybe_upcast(
if issubclass(arr.dtype.type, np.integer):
mask = arr == na_value

if use_nullable_dtypes:
if use_dtype_backend:
arr = IntegerArray(arr, mask)
else:
arr = arr.astype(float)
Expand All @@ -1428,22 +1424,22 @@ def _maybe_upcast(
elif arr.dtype == np.bool_:
mask = arr.view(np.uint8) == na_value

if use_nullable_dtypes:
if use_dtype_backend:
arr = BooleanArray(arr, mask)
else:
arr = arr.astype(object)
np.putmask(arr, mask, np.nan)

elif issubclass(arr.dtype.type, float) or arr.dtype.type == np.float32:
if use_nullable_dtypes:
if use_dtype_backend:
mask = np.isnan(arr)
arr = FloatingArray(arr, mask)

elif arr.dtype == np.object_:
if use_nullable_dtypes:
if use_dtype_backend:
arr = StringDtype().construct_array_type()._from_sequence(arr)

if use_nullable_dtypes and dtype_backend == "pyarrow":
if use_dtype_backend and dtype_backend == "pyarrow":
import pyarrow as pa
if isinstance(arr, IntegerArray) and arr.isna().all():
# use null instead of int64 in pyarrow
Expand Down
1 change: 1 addition & 0 deletions pandas/_typing.py
Original file line number Diff line number Diff line change
Expand Up @@ -377,3 +377,4 @@ def closed(self) -> bool:
Literal["pearson", "kendall", "spearman"], Callable[[np.ndarray, np.ndarray], float]
]
AlignJoin = Literal["outer", "inner", "left", "right"]
DtypeBackend = Literal["pyarrow", "numpy_nullable"]
2 changes: 1 addition & 1 deletion pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1274,7 +1274,7 @@ def string_storage(request):

@pytest.fixture(
params=[
"pandas",
"numpy_nullable",
pytest.param("pyarrow", marks=td.skip_if_no("pyarrow")),
]
)
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/arrays/numeric.py
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,7 @@ def _from_sequence_of_strings(
) -> T:
from pandas.core.tools.numeric import to_numeric

scalars = to_numeric(strings, errors="raise", use_nullable_dtypes=True)
scalars = to_numeric(strings, errors="raise", dtype_backend="numpy_nullable")
return cls._from_sequence(scalars, dtype=dtype, copy=copy)

_HANDLED_TYPES = (np.ndarray, numbers.Number)
28 changes: 0 additions & 28 deletions pandas/core/config_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -487,41 +487,13 @@ def use_inf_as_na_cb(key) -> None:
The default storage for StringDtype.
"""

dtype_backend_doc = """
: string
The nullable dtype implementation to return. Only applicable to certain
operations where documented. Available options: 'pandas', 'pyarrow',
the default is 'pandas'.
"""

with cf.config_prefix("mode"):
cf.register_option(
"string_storage",
"python",
string_storage_doc,
validator=is_one_of_factory(["python", "pyarrow"]),
)
cf.register_option(
"dtype_backend",
"pandas",
dtype_backend_doc,
validator=is_one_of_factory(["pandas", "pyarrow"]),
)


nullable_dtypes_doc = """
: bool
If nullable dtypes should be returned. This is only applicable to functions
where the ``use_nullable_dtypes`` keyword is implemented.
"""

with cf.config_prefix("mode"):
cf.register_option(
"nullable_dtypes",
False,
nullable_dtypes_doc,
validator=is_bool,
)


# Set up the io.excel specific reader configuration.
Expand Down
6 changes: 3 additions & 3 deletions pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -1007,7 +1007,7 @@ def convert_dtypes(
convert_boolean: bool = True,
convert_floating: bool = True,
infer_objects: bool = False,
dtype_backend: Literal["pandas", "pyarrow"] = "pandas",
dtype_backend: Literal["numpy_nullable", "pyarrow"] = "numpy_nullable",
) -> DtypeObj:
"""
Convert objects to best possible type, and optionally,
Expand All @@ -1029,10 +1029,10 @@ def convert_dtypes(
infer_objects : bool, defaults False
Whether to also infer objects to float/int if possible. Is only hit if the
object array contains pd.NA.
dtype_backend : str, default "pandas"
dtype_backend : str, default "numpy_nullable"
Nullable dtype implementation to use.

* "pandas" returns numpy-backed nullable types
* "numpy_nullable" returns numpy-backed nullable types
* "pyarrow" returns pyarrow-backed nullable types using ``ArrowDtype``

Returns
Expand Down
9 changes: 9 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@
CompressionOptions,
Dtype,
DtypeArg,
DtypeBackend,
DtypeObj,
FilePath,
FillnaOptions,
Expand Down Expand Up @@ -6547,6 +6548,7 @@ def convert_dtypes(
convert_integer: bool_t = True,
convert_boolean: bool_t = True,
convert_floating: bool_t = True,
dtype_backend: DtypeBackend = "numpy_nullable",
) -> NDFrameT:
"""
Convert columns to the best possible dtypes using dtypes supporting ``pd.NA``.
Expand All @@ -6567,6 +6569,11 @@ def convert_dtypes(
dtypes if the floats can be faithfully casted to integers.

.. versionadded:: 1.2.0
dtype_backend : {"numpy_nullable", "pyarrow"}, default "numpy_nullable"
Which dtype_backend to use, e.g. whether a DataFrame should have nullable
extension dtypes or pyarrow dtypes.

.. versionadded:: 2.0

Returns
-------
Expand Down Expand Up @@ -6686,6 +6693,7 @@ def convert_dtypes(
convert_integer,
convert_boolean,
convert_floating,
dtype_backend=dtype_backend,
)
else:
results = [
Expand All @@ -6695,6 +6703,7 @@ def convert_dtypes(
convert_integer,
convert_boolean,
convert_floating,
dtype_backend=dtype_backend,
)
for col_name, col in self.items()
]
Expand Down
Loading