diff --git a/.github/workflows/wheels.yml b/.github/workflows/wheels.yml index efa14dd966eb1..4c7a7b329777b 100644 --- a/.github/workflows/wheels.yml +++ b/.github/workflows/wheels.yml @@ -138,7 +138,7 @@ jobs: run: echo "sdist_name=$(cd ./dist && ls -d */)" >> "$GITHUB_ENV" - name: Build wheels - uses: pypa/cibuildwheel@v2.16.1 + uses: pypa/cibuildwheel@v2.16.0 with: package-dir: ./dist/${{ matrix.buildplat[1] == 'macosx_*' && env.sdist_name || needs.build_sdist.outputs.sdist_file }} env: diff --git a/.gitignore b/.gitignore index cd22c2bb8cb5b..d936e1a5269bc 100644 --- a/.gitignore +++ b/.gitignore @@ -129,8 +129,10 @@ doc/source/index.rst doc/build/html/index.html # Windows specific leftover: doc/tmp.sv +doc/tmp.csv env/ doc/source/savefig/ +doc/source/_build # Interactive terminal generated files # ######################################## diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index c911edfa03670..b0b511e1048c6 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -84,7 +84,7 @@ repos: '--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size' ] - repo: https://github.com/pylint-dev/pylint - rev: v3.0.0b0 + rev: v3.0.0a7 hooks: - id: pylint stages: [manual] diff --git a/doc/source/reference/arrays.rst b/doc/source/reference/arrays.rst index fe65364896f54..41ddbd048e6c5 100644 --- a/doc/source/reference/arrays.rst +++ b/doc/source/reference/arrays.rst @@ -134,6 +134,11 @@ is the missing value for datetime data. Timestamp +.. autosummary:: + :toctree: api/ + + NaT + Properties ~~~~~~~~~~ .. autosummary:: @@ -252,6 +257,11 @@ is the missing value for timedelta data. Timedelta +.. autosummary:: + :toctree: api/ + + NaT + Properties ~~~~~~~~~~ .. autosummary:: @@ -455,6 +465,7 @@ pandas provides this through :class:`arrays.IntegerArray`. UInt16Dtype UInt32Dtype UInt64Dtype + NA .. _api.arrays.float_na: @@ -473,6 +484,7 @@ Nullable float Float32Dtype Float64Dtype + NA .. _api.arrays.categorical: @@ -609,6 +621,7 @@ with a bool :class:`numpy.ndarray`. :template: autosummary/class_without_autosummary.rst BooleanDtype + NA .. Dtype attributes which are manually listed in their docstrings: including diff --git a/doc/source/reference/index.rst b/doc/source/reference/index.rst index 7da02f7958416..6d3ce3d31f005 100644 --- a/doc/source/reference/index.rst +++ b/doc/source/reference/index.rst @@ -53,7 +53,6 @@ are mentioned in the documentation. options extensions testing - missing_value .. This is to prevent warnings in the doc build. We don't want to encourage .. these methods. diff --git a/doc/source/reference/missing_value.rst b/doc/source/reference/missing_value.rst deleted file mode 100644 index 3bf22aef765d1..0000000000000 --- a/doc/source/reference/missing_value.rst +++ /dev/null @@ -1,24 +0,0 @@ -{{ header }} - -.. _api.missing_value: - -============== -Missing values -============== -.. currentmodule:: pandas - -NA is the way to represent missing values for nullable dtypes (see below): - -.. autosummary:: - :toctree: api/ - :template: autosummary/class_without_autosummary.rst - - NA - -NaT is the missing value for timedelta and datetime data (see below): - -.. autosummary:: - :toctree: api/ - :template: autosummary/class_without_autosummary.rst - - NaT diff --git a/doc/source/whatsnew/v2.1.2.rst b/doc/source/whatsnew/v2.1.2.rst index 38ef8c8455b9d..bc6c26b27885e 100644 --- a/doc/source/whatsnew/v2.1.2.rst +++ b/doc/source/whatsnew/v2.1.2.rst @@ -15,7 +15,7 @@ Fixed regressions ~~~~~~~~~~~~~~~~~ - Fixed bug in :meth:`DataFrame.resample` where bin edges were not correct for :class:`~pandas.tseries.offsets.MonthBegin` (:issue:`55271`) - Fixed bug where PDEP-6 warning about setting an item of an incompatible dtype was being shown when creating a new conditional column (:issue:`55025`) -- Fixed regression in :meth:`DataFrame.join` where result has missing values and dtype is arrow backed string (:issue:`55348`) +- .. --------------------------------------------------------------------------- .. _whatsnew_212.bug_fixes: diff --git a/doc/source/whatsnew/v2.2.0.rst b/doc/source/whatsnew/v2.2.0.rst index fa3cef6d9457d..de529c9beaf8d 100644 --- a/doc/source/whatsnew/v2.2.0.rst +++ b/doc/source/whatsnew/v2.2.0.rst @@ -14,6 +14,27 @@ including other versions of pandas. Enhancements ~~~~~~~~~~~~ + +.. _whatsnew_220.enhancements.case_when: + +Create Series based on one or more conditions +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The :func:`case_when` function has been added to create a Series object based on one or more conditions. (:issue:`39154`) + +.. ipython:: python + + import pandas as pd + + df = pd.DataFrame(dict(a=[1, 2, 3], b=[4, 5, 6])) + df.assign( + new_column=pd.case_when( + df.a == 1, 'first', # condition, replacement + df.a.gt(1) & df.b.eq(5), 'second', + default='default', # optional + ) + ) + .. _whatsnew_220.enhancements.calamine: Calamine engine for :func:`read_excel` @@ -78,7 +99,6 @@ Other enhancements - :meth:`ExtensionArray._explode` interface method added to allow extension type implementations of the ``explode`` method (:issue:`54833`) - :meth:`ExtensionArray.duplicated` added to allow extension type implementations of the ``duplicated`` method (:issue:`55255`) - DataFrame.apply now allows the usage of numba (via ``engine="numba"``) to JIT compile the passed function, allowing for potential speedups (:issue:`54666`) -- Implement masked algorithms for :meth:`Series.value_counts` (:issue:`54984`) - .. --------------------------------------------------------------------------- @@ -220,7 +240,6 @@ Other Deprecations - Deprecated allowing non-keyword arguments in :meth:`DataFrame.to_parquet` except ``path``. (:issue:`54229`) - Deprecated allowing non-keyword arguments in :meth:`DataFrame.to_pickle` except ``path``. (:issue:`54229`) - Deprecated allowing non-keyword arguments in :meth:`DataFrame.to_string` except ``buf``. (:issue:`54229`) -- Deprecated allowing non-keyword arguments in :meth:`DataFrame.to_xml` except ``path_or_buffer``. (:issue:`54229`) - Deprecated automatic downcasting of object-dtype results in :meth:`Series.replace` and :meth:`DataFrame.replace`, explicitly call ``result = result.infer_objects(copy=False)`` instead. To opt in to the future version, use ``pd.set_option("future.no_silent_downcasting", True)`` (:issue:`54710`) - Deprecated downcasting behavior in :meth:`Series.where`, :meth:`DataFrame.where`, :meth:`Series.mask`, :meth:`DataFrame.mask`, :meth:`Series.clip`, :meth:`DataFrame.clip`; in a future version these will not infer object-dtype columns to non-object dtype, or all-round floats to integer dtype. Call ``result.infer_objects(copy=False)`` on the result for object inference, or explicitly cast floats to ints. To opt in to the future version, use ``pd.set_option("future.no_silent_downcasting", True)`` (:issue:`53656`) - Deprecated including the groups in computations when using :meth:`DataFrameGroupBy.apply` and :meth:`DataFrameGroupBy.resample`; pass ``include_groups=False`` to exclude the groups (:issue:`7155`) @@ -283,7 +302,7 @@ Numeric Conversion ^^^^^^^^^^ -- Bug in :meth:`Series.convert_dtypes` not converting all NA column to ``null[pyarrow]`` (:issue:`55346`) +- - Strings @@ -312,7 +331,7 @@ Missing MultiIndex ^^^^^^^^^^ -- Bug in :meth:`MultiIndex.get_indexer` not raising ``ValueError`` when ``method`` provided and index is non-monotonic (:issue:`53452`) +- - I/O diff --git a/pandas/__init__.py b/pandas/__init__.py index 41e34309232ee..18a462a3a08b6 100644 --- a/pandas/__init__.py +++ b/pandas/__init__.py @@ -73,6 +73,7 @@ notnull, # indexes Index, + case_when, CategoricalIndex, RangeIndex, MultiIndex, @@ -252,6 +253,7 @@ __all__ = [ "ArrowDtype", "BooleanDtype", + "case_when", "Categorical", "CategoricalDtype", "CategoricalIndex", diff --git a/pandas/_libs/hashtable.pyi b/pandas/_libs/hashtable.pyi index 0ac914e86f699..2bc6d74fe6aee 100644 --- a/pandas/_libs/hashtable.pyi +++ b/pandas/_libs/hashtable.pyi @@ -240,7 +240,7 @@ def value_count( values: np.ndarray, dropna: bool, mask: npt.NDArray[np.bool_] | None = ..., -) -> tuple[np.ndarray, npt.NDArray[np.int64], int]: ... # np.ndarray[same-as-values] +) -> tuple[np.ndarray, npt.NDArray[np.int64]]: ... # np.ndarray[same-as-values] # arr and values should have same dtype def ismember( diff --git a/pandas/_libs/hashtable_func_helper.pxi.in b/pandas/_libs/hashtable_func_helper.pxi.in index 19acd4acbdee7..b9cf6011481af 100644 --- a/pandas/_libs/hashtable_func_helper.pxi.in +++ b/pandas/_libs/hashtable_func_helper.pxi.in @@ -36,7 +36,7 @@ cdef value_count_{{dtype}}(ndarray[{{dtype}}] values, bint dropna, const uint8_t cdef value_count_{{dtype}}(const {{dtype}}_t[:] values, bint dropna, const uint8_t[:] mask=None): {{endif}} cdef: - Py_ssize_t i = 0, na_counter = 0, na_add = 0 + Py_ssize_t i = 0 Py_ssize_t n = len(values) kh_{{ttype}}_t *table @@ -49,6 +49,9 @@ cdef value_count_{{dtype}}(const {{dtype}}_t[:] values, bint dropna, const uint8 bint uses_mask = mask is not None bint isna_entry = False + if uses_mask and not dropna: + raise NotImplementedError("uses_mask not implemented with dropna=False") + # we track the order in which keys are first seen (GH39009), # khash-map isn't insertion-ordered, thus: # table maps keys to counts @@ -79,31 +82,25 @@ cdef value_count_{{dtype}}(const {{dtype}}_t[:] values, bint dropna, const uint8 for i in range(n): val = {{to_c_type}}(values[i]) - if uses_mask: - isna_entry = mask[i] - if dropna: - if not uses_mask: + if uses_mask: + isna_entry = mask[i] + else: isna_entry = is_nan_{{c_type}}(val) if not dropna or not isna_entry: - if uses_mask and isna_entry: - na_counter += 1 + k = kh_get_{{ttype}}(table, val) + if k != table.n_buckets: + table.vals[k] += 1 else: - k = kh_get_{{ttype}}(table, val) - if k != table.n_buckets: - table.vals[k] += 1 - else: - k = kh_put_{{ttype}}(table, val, &ret) - table.vals[k] = 1 - result_keys.append(val) + k = kh_put_{{ttype}}(table, val, &ret) + table.vals[k] = 1 + result_keys.append(val) {{endif}} # collect counts in the order corresponding to result_keys: - if na_counter > 0: - na_add = 1 cdef: - int64_t[::1] result_counts = np.empty(table.size + na_add, dtype=np.int64) + int64_t[::1] result_counts = np.empty(table.size, dtype=np.int64) for i in range(table.size): {{if dtype == 'object'}} @@ -113,13 +110,9 @@ cdef value_count_{{dtype}}(const {{dtype}}_t[:] values, bint dropna, const uint8 {{endif}} result_counts[i] = table.vals[k] - if na_counter > 0: - result_counts[table.size] = na_counter - result_keys.append(val) - kh_destroy_{{ttype}}(table) - return result_keys.to_array(), result_counts.base, na_counter + return result_keys.to_array(), result_counts.base @cython.wraparound(False) @@ -406,10 +399,10 @@ def mode(ndarray[htfunc_t] values, bint dropna, const uint8_t[:] mask=None): ndarray[htfunc_t] modes int64_t[::1] counts - int64_t count, _, max_count = -1 + int64_t count, max_count = -1 Py_ssize_t nkeys, k, j = 0 - keys, counts, _ = value_count(values, dropna, mask=mask) + keys, counts = value_count(values, dropna, mask=mask) nkeys = len(keys) modes = np.empty(nkeys, dtype=values.dtype) diff --git a/pandas/_libs/tslib.pyx b/pandas/_libs/tslib.pyx index 43252ffb5bf13..20a18cf56779f 100644 --- a/pandas/_libs/tslib.pyx +++ b/pandas/_libs/tslib.pyx @@ -455,18 +455,18 @@ cpdef array_to_datetime( set out_tzoffset_vals = set() tzinfo tz_out = None bint found_tz = False, found_naive = False - cnp.flatiter it = cnp.PyArray_IterNew(values) + cnp.broadcast mi # specify error conditions assert is_raise or is_ignore or is_coerce result = np.empty((values).shape, dtype="M8[ns]") + mi = cnp.PyArray_MultiIterNew2(result, values) iresult = result.view("i8").ravel() for i in range(n): # Analogous to `val = values[i]` - val = cnp.PyArray_GETITEM(values, cnp.PyArray_ITER_DATA(it)) - cnp.PyArray_ITER_NEXT(it) + val = (cnp.PyArray_MultiIter_DATA(mi, 1))[0] try: if checknull_with_nat_and_na(val): @@ -511,6 +511,7 @@ cpdef array_to_datetime( if parse_today_now(val, &iresult[i], utc): # We can't _quite_ dispatch this to convert_str_to_tsobject # bc there isn't a nice way to pass "utc" + cnp.PyArray_MultiIter_NEXT(mi) continue _ts = convert_str_to_tsobject( @@ -539,10 +540,13 @@ cpdef array_to_datetime( else: raise TypeError(f"{type(val)} is not convertible to datetime") + cnp.PyArray_MultiIter_NEXT(mi) + except (TypeError, OverflowError, ValueError) as ex: ex.args = (f"{ex}, at position {i}",) if is_coerce: iresult[i] = NPY_NAT + cnp.PyArray_MultiIter_NEXT(mi) continue elif is_raise: raise diff --git a/pandas/_libs/tslibs/offsets.pyx b/pandas/_libs/tslibs/offsets.pyx index 8fdba8992f627..74398eb0e2405 100644 --- a/pandas/_libs/tslibs/offsets.pyx +++ b/pandas/_libs/tslibs/offsets.pyx @@ -4329,6 +4329,28 @@ cdef class CustomBusinessHour(BusinessHour): cdef class _CustomBusinessMonth(BusinessMixin): + """ + DateOffset subclass representing custom business month(s). + + Increments between beginning/end of month dates. + + Parameters + ---------- + n : int, default 1 + The number of months represented. + normalize : bool, default False + Normalize start/end dates to midnight before generating date range. + weekmask : str, Default 'Mon Tue Wed Thu Fri' + Weekmask of valid business days, passed to ``numpy.busdaycalendar``. + holidays : list + List/array of dates to exclude from the set of valid business days, + passed to ``numpy.busdaycalendar``. + calendar : np.busdaycalendar + Calendar to integrate. + offset : timedelta, default timedelta(0) + Time offset to apply. + """ + _attributes = tuple( ["n", "normalize", "weekmask", "holidays", "calendar", "offset"] ) @@ -4404,124 +4426,10 @@ cdef class _CustomBusinessMonth(BusinessMixin): cdef class CustomBusinessMonthEnd(_CustomBusinessMonth): - """ - DateOffset subclass representing custom business month(s). - - Increments between end of month dates. - - Parameters - ---------- - n : int, default 1 - The number of months represented. - normalize : bool, default False - Normalize end dates to midnight before generating date range. - weekmask : str, Default 'Mon Tue Wed Thu Fri' - Weekmask of valid business days, passed to ``numpy.busdaycalendar``. - holidays : list - List/array of dates to exclude from the set of valid business days, - passed to ``numpy.busdaycalendar``. - calendar : np.busdaycalendar - Calendar to integrate. - offset : timedelta, default timedelta(0) - Time offset to apply. - - See Also - -------- - :class:`~pandas.tseries.offsets.DateOffset` : Standard kind of date increment. - - Examples - -------- - In the example below we use the default parameters. - - >>> ts = pd.Timestamp(2022, 8, 5) - >>> ts + pd.offsets.CustomBusinessMonthEnd() - Timestamp('2022-08-31 00:00:00') - - Custom business month end can be specified by ``weekmask`` parameter. - To convert the returned datetime object to its string representation - the function strftime() is used in the next example. - - >>> import datetime as dt - >>> freq = pd.offsets.CustomBusinessMonthEnd(weekmask="Wed Thu") - >>> pd.date_range(dt.datetime(2022, 7, 10), dt.datetime(2022, 12, 18), - ... freq=freq).strftime('%a %d %b %Y %H:%M') - Index(['Thu 28 Jul 2022 00:00', 'Wed 31 Aug 2022 00:00', - 'Thu 29 Sep 2022 00:00', 'Thu 27 Oct 2022 00:00', - 'Wed 30 Nov 2022 00:00'], - dtype='object') - - Using NumPy business day calendar you can define custom holidays. - - >>> import datetime as dt - >>> bdc = np.busdaycalendar(holidays=['2022-08-01', '2022-09-30', - ... '2022-10-31', '2022-11-01']) - >>> freq = pd.offsets.CustomBusinessMonthEnd(calendar=bdc) - >>> pd.date_range(dt.datetime(2022, 7, 10), dt.datetime(2022, 11, 10), freq=freq) - DatetimeIndex(['2022-07-29', '2022-08-31', '2022-09-29', '2022-10-28'], - dtype='datetime64[ns]', freq='CBM') - """ - _prefix = "CBM" cdef class CustomBusinessMonthBegin(_CustomBusinessMonth): - """ - DateOffset subclass representing custom business month(s). - - Increments between beginning of month dates. - - Parameters - ---------- - n : int, default 1 - The number of months represented. - normalize : bool, default False - Normalize start dates to midnight before generating date range. - weekmask : str, Default 'Mon Tue Wed Thu Fri' - Weekmask of valid business days, passed to ``numpy.busdaycalendar``. - holidays : list - List/array of dates to exclude from the set of valid business days, - passed to ``numpy.busdaycalendar``. - calendar : np.busdaycalendar - Calendar to integrate. - offset : timedelta, default timedelta(0) - Time offset to apply. - - See Also - -------- - :class:`~pandas.tseries.offsets.DateOffset` : Standard kind of date increment. - - Examples - -------- - In the example below we use the default parameters. - - >>> ts = pd.Timestamp(2022, 8, 5) - >>> ts + pd.offsets.CustomBusinessMonthBegin() - Timestamp('2022-09-01 00:00:00') - - Custom business month start can be specified by ``weekmask`` parameter. - To convert the returned datetime object to its string representation - the function strftime() is used in the next example. - - >>> import datetime as dt - >>> freq = pd.offsets.CustomBusinessMonthBegin(weekmask="Wed Thu") - >>> pd.date_range(dt.datetime(2022, 7, 10), dt.datetime(2022, 12, 18), - ... freq=freq).strftime('%a %d %b %Y %H:%M') - Index(['Wed 03 Aug 2022 00:00', 'Thu 01 Sep 2022 00:00', - 'Wed 05 Oct 2022 00:00', 'Wed 02 Nov 2022 00:00', - 'Thu 01 Dec 2022 00:00'], - dtype='object') - - Using NumPy business day calendar you can define custom holidays. - - >>> import datetime as dt - >>> bdc = np.busdaycalendar(holidays=['2022-08-01', '2022-09-30', - ... '2022-10-31', '2022-11-01']) - >>> freq = pd.offsets.CustomBusinessMonthBegin(calendar=bdc) - >>> pd.date_range(dt.datetime(2022, 7, 10), dt.datetime(2022, 11, 10), freq=freq) - DatetimeIndex(['2022-08-02', '2022-09-01', '2022-10-03', '2022-11-02'], - dtype='datetime64[ns]', freq='CBMS') - """ - _prefix = "CBMS" diff --git a/pandas/_typing.py b/pandas/_typing.py index f18c67fcb0c90..0e2a0881f0122 100644 --- a/pandas/_typing.py +++ b/pandas/_typing.py @@ -4,7 +4,6 @@ Hashable, Iterator, Mapping, - MutableMapping, Sequence, ) from datetime import ( @@ -104,7 +103,6 @@ TypeGuard: Any = None HashableT = TypeVar("HashableT", bound=Hashable) -MutableMappingT = TypeVar("MutableMappingT", bound=MutableMapping) # array-like diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py index 4ff3de2fc7b2b..8c14d8c030ee3 100644 --- a/pandas/core/algorithms.py +++ b/pandas/core/algorithms.py @@ -923,7 +923,7 @@ def value_counts_internal( else: values = _ensure_arraylike(values, func_name="value_counts") - keys, counts, _ = value_counts_arraylike(values, dropna) + keys, counts = value_counts_arraylike(values, dropna) if keys.dtype == np.float16: keys = keys.astype(np.float32) @@ -948,7 +948,7 @@ def value_counts_internal( # Called once from SparseArray, otherwise could be private def value_counts_arraylike( values: np.ndarray, dropna: bool, mask: npt.NDArray[np.bool_] | None = None -) -> tuple[ArrayLike, npt.NDArray[np.int64], int]: +) -> tuple[ArrayLike, npt.NDArray[np.int64]]: """ Parameters ---------- @@ -964,7 +964,7 @@ def value_counts_arraylike( original = values values = _ensure_data(values) - keys, counts, na_counter = htable.value_count(values, dropna, mask=mask) + keys, counts = htable.value_count(values, dropna, mask=mask) if needs_i8_conversion(original.dtype): # datetime, timedelta, or period @@ -974,7 +974,7 @@ def value_counts_arraylike( keys, counts = keys[mask], counts[mask] res_keys = _reconstruct_data(keys, original.dtype, original) - return res_keys, counts, na_counter + return res_keys, counts def duplicated( diff --git a/pandas/core/api.py b/pandas/core/api.py index 2cfe5ffc0170d..f13ccdd98586f 100644 --- a/pandas/core/api.py +++ b/pandas/core/api.py @@ -42,6 +42,7 @@ UInt64Dtype, ) from pandas.core.arrays.string_ import StringDtype +from pandas.core.case_when import case_when from pandas.core.construction import array from pandas.core.flags import Flags from pandas.core.groupby import ( @@ -85,6 +86,7 @@ "ArrowDtype", "bdate_range", "BooleanDtype", + "case_when", "Categorical", "CategoricalDtype", "CategoricalIndex", diff --git a/pandas/core/arrays/masked.py b/pandas/core/arrays/masked.py index 56d3711c7d13b..819a4370e5510 100644 --- a/pandas/core/arrays/masked.py +++ b/pandas/core/arrays/masked.py @@ -1052,22 +1052,28 @@ def value_counts(self, dropna: bool = True) -> Series: ) from pandas.arrays import IntegerArray - keys, value_counts, na_counter = algos.value_counts_arraylike( - self._data, dropna=dropna, mask=self._mask + keys, value_counts = algos.value_counts_arraylike( + self._data, dropna=True, mask=self._mask ) - mask_index = np.zeros((len(value_counts),), dtype=np.bool_) - mask = mask_index.copy() - if na_counter > 0: - mask_index[-1] = True + if dropna: + res = Series(value_counts, index=keys, name="count", copy=False) + res.index = res.index.astype(self.dtype) + res = res.astype("Int64") + return res - arr = IntegerArray(value_counts, mask) - index = Index( - self.dtype.construct_array_type()( - keys, mask_index # type: ignore[arg-type] - ) - ) - return Series(arr, index=index, name="count", copy=False) + # if we want nans, count the mask + counts = np.empty(len(value_counts) + 1, dtype="int64") + counts[:-1] = value_counts + counts[-1] = self._mask.sum() + + index = Index(keys, dtype=self.dtype).insert(len(keys), self.dtype.na_value) + index = index.astype(self.dtype) + + mask = np.zeros(len(counts), dtype="bool") + counts_array = IntegerArray(counts, mask) + + return Series(counts_array, index=index, name="count", copy=False) @doc(ExtensionArray.equals) def equals(self, other) -> bool: diff --git a/pandas/core/arrays/sparse/array.py b/pandas/core/arrays/sparse/array.py index cf349220e4ba7..608468da486b5 100644 --- a/pandas/core/arrays/sparse/array.py +++ b/pandas/core/arrays/sparse/array.py @@ -890,7 +890,7 @@ def value_counts(self, dropna: bool = True) -> Series: Series, ) - keys, counts, _ = algos.value_counts_arraylike(self.sp_values, dropna=dropna) + keys, counts = algos.value_counts_arraylike(self.sp_values, dropna=dropna) fcounts = self.sp_index.ngaps if fcounts > 0 and (not self._null_fill_value or not dropna): mask = isna(keys) if self._null_fill_value else keys == self.fill_value diff --git a/pandas/core/case_when.py b/pandas/core/case_when.py new file mode 100644 index 0000000000000..11e4a319f7b71 --- /dev/null +++ b/pandas/core/case_when.py @@ -0,0 +1,209 @@ +from __future__ import annotations + +from typing import TYPE_CHECKING + +import numpy as np + +from pandas.core.dtypes.cast import ( + construct_1d_arraylike_from_scalar, + find_common_type, +) +from pandas.core.dtypes.common import ( + is_bool_dtype, + is_scalar, +) +from pandas.core.dtypes.generic import ( + ABCDataFrame, + ABCSeries, +) +from pandas.core.dtypes.missing import na_value_for_dtype + +from pandas.core.common import convert_to_list_like +from pandas.core.construction import array as pd_array + +if TYPE_CHECKING: + from pandas._typing import ( + ListLike, + Scalar, + Series, + ) + + +def case_when( + *args: ListLike | Scalar, + default: Scalar | ListLike | None = None, + level: int | None = None, +) -> Series: + """ + Replace values where the conditions are True. + + Parameters + ---------- + *args : array-like, scalar + Variable argument of conditions and expected replacements. + Takes the form: `condition0`, `replacement0`, + `condition1`, `replacement1`, ... . + `condition` should be a 1-D boolean array. + The provided boolean conditions should have the same size. + `replacement` should be a 1-D array or a scalar. + If `replacement` is a 1-D array, it should have the same + shape as the paired `condition`. + When multiple boolean conditions are satisfied, + the first replacement is used. + + default : scalar, array-like, default None + If provided, it is the replacement value to use + if all conditions evaluate to False. + If default is a 1-D array, it should have the same shape as + `condition` and `replacement`. If not specified, + entries will be filled with the corresponding + NULL value (``np.nan`` for numpy dtypes, ``pd.NA`` for extension + dtypes). + + level : int, default None + Alignment level if needed. + + .. versionadded:: 2.2.0 + + Returns + ------- + Series + + See Also + -------- + Series.mask : Replace values where the condition is True. + + Examples + -------- + >>> df = pd.DataFrame({ + ... "a": [0,0,1,2], + ... "b": [0,3,4,5], + ... "c": [6,7,8,9] + ... }) + >>> df + a b c + 0 0 0 6 + 1 0 3 7 + 2 1 4 8 + 3 2 5 9 + + >>> pd.case_when(df.a.gt(0), df.a, # condition, replacement + ... df.b.gt(0), df.b, + ... default=df.c) # optional + 0 6 + 1 3 + 2 1 + 3 2 + Name: c, dtype: int64 + """ + from pandas import Series + + len_args = len(args) + if not len_args: + raise ValueError( + "Kindly provide at least one boolean condition, " + "with a corresponding replacement." + ) + if len_args % 2: + raise ValueError( + "The number of boolean conditions should be equal " + "to the number of replacements. " + "However, the total number of conditions and replacements " + f"is {len(args)}, which is an odd number." + ) + + counter = len_args // 2 - 1 + counter = range(counter, -1, -1) + conditions = [] + for num, condition in zip(counter, args[-2::-2]): + if not hasattr(condition, "shape"): + condition = np.asanyarray(condition) + if condition.ndim > 1: + raise ValueError(f"condition{num} is not a one dimensional array.") + if not is_bool_dtype(condition): + raise TypeError(f"condition{num} is not a boolean array.") + conditions.append(condition) + bool_length = {condition.size for condition in conditions} + if len(bool_length) > 1: + raise ValueError("All boolean conditions should have the same length.") + bool_length = conditions[0].size + if default is not None: + if is_scalar(default): + default = construct_1d_arraylike_from_scalar( + default, length=bool_length, dtype=None + ) + if not isinstance(default, ABCDataFrame): + default = convert_to_list_like(default) + if not hasattr(default, "shape"): + default = pd_array(default, copy=False) + if default.ndim > 1: + raise ValueError( + "The provided default argument should " + "either be a scalar or a 1-D array." + ) + if default.size != conditions[0].size: + raise ValueError( + "The length of the default argument should " + "be the same as the length of any " + "of the boolean conditions." + ) + replacements = [] + # ideally we could skip these checks and let `Series.mask` + # handle it - however, this step is necessary to get + # a common dtype for multiple conditions and replacements. + for num, replacement in zip(counter, args[-1::-2]): + if is_scalar(replacement): + replacement = construct_1d_arraylike_from_scalar( + replacement, length=bool_length, dtype=None + ) + if not isinstance(replacement, ABCDataFrame): + replacement = convert_to_list_like(replacement) + if not hasattr(replacement, "shape"): + replacement = pd_array(replacement, copy=False) + if replacement.ndim > 1: + raise ValueError(f"replacement{num} should be a 1-D array.") + if replacement.size != bool_length: + raise ValueError( + f"The size of replacement{num} array" + f"does not match the size of condition{num} array." + ) + replacements.append(replacement) + common_dtype = [arr.dtype for arr in replacements] + if default is not None: + common_dtype.append(default.dtype) + if len(set(common_dtype)) > 1: + common_dtype = find_common_type(common_dtype) + replacements = [ + arr.astype(common_dtype, copy=False) + if isinstance(arr, ABCSeries) + else pd_array(arr, dtype=common_dtype, copy=False) + for arr in replacements + ] + if (default is not None) and isinstance(default, ABCSeries): + default = default.astype(common_dtype, copy=False) + elif default is not None: + default = pd_array(default, dtype=common_dtype, copy=False) + else: + common_dtype = common_dtype[0] + + if default is None: + default = construct_1d_arraylike_from_scalar( + na_value_for_dtype(common_dtype, compat=False), + length=bool_length, + dtype=common_dtype, + ) + default = Series(default) + elif not isinstance(default, ABCSeries): + default = Series(default) + + for position, condition, replacement in zip(counter, conditions, replacements): + try: + default = default.mask( + condition, other=replacement, axis=0, inplace=False, level=level + ) + except Exception as error: + raise ValueError( + f"condition{position} and replacement{position} failed to evaluate. " + f"Original error message: {error}" + ) from error + return default diff --git a/pandas/core/dtypes/cast.py b/pandas/core/dtypes/cast.py index 3208a742738a3..74e785be06356 100644 --- a/pandas/core/dtypes/cast.py +++ b/pandas/core/dtypes/cast.py @@ -1133,16 +1133,7 @@ def convert_dtypes( base_dtype = np.dtype(str) else: base_dtype = inferred_dtype - if ( - base_dtype.kind == "O" # type: ignore[union-attr] - and len(input_array) > 0 - and isna(input_array).all() - ): - import pyarrow as pa - - pa_type = pa.null() - else: - pa_type = to_pyarrow_type(base_dtype) + pa_type = to_pyarrow_type(base_dtype) if pa_type is not None: inferred_dtype = ArrowDtype(pa_type) elif dtype_backend == "numpy_nullable" and isinstance(inferred_dtype, ArrowDtype): diff --git a/pandas/core/frame.py b/pandas/core/frame.py index 51fceb1f09a62..432c0a745c7a0 100644 --- a/pandas/core/frame.py +++ b/pandas/core/frame.py @@ -230,7 +230,6 @@ Level, MergeHow, MergeValidate, - MutableMappingT, NaAction, NaPosition, NsmallestNlargestKeep, @@ -1928,27 +1927,6 @@ def _create_data_for_split_and_tight_to_dict( def to_dict( self, orient: Literal["dict", "list", "series", "split", "tight", "index"] = ..., - *, - into: type[MutableMappingT] | MutableMappingT, - index: bool = ..., - ) -> MutableMappingT: - ... - - @overload - def to_dict( - self, - orient: Literal["records"], - *, - into: type[MutableMappingT] | MutableMappingT, - index: bool = ..., - ) -> list[MutableMappingT]: - ... - - @overload - def to_dict( - self, - orient: Literal["dict", "list", "series", "split", "tight", "index"] = ..., - *, into: type[dict] = ..., index: bool = ..., ) -> dict: @@ -1958,14 +1936,11 @@ def to_dict( def to_dict( self, orient: Literal["records"], - *, into: type[dict] = ..., index: bool = ..., ) -> list[dict]: ... - # error: Incompatible default for argument "into" (default has type "type - # [dict[Any, Any]]", argument has type "type[MutableMappingT] | MutableMappingT") @deprecate_nonkeyword_arguments( version="3.0", allowed_args=["self", "orient"], name="to_dict" ) @@ -1974,10 +1949,9 @@ def to_dict( orient: Literal[ "dict", "list", "series", "split", "tight", "records", "index" ] = "dict", - into: type[MutableMappingT] - | MutableMappingT = dict, # type: ignore[assignment] + into: type[dict] = dict, index: bool = True, - ) -> MutableMappingT | list[MutableMappingT]: + ) -> dict | list[dict]: """ Convert the DataFrame to a dictionary. @@ -2005,7 +1979,7 @@ def to_dict( 'tight' as an allowed value for the ``orient`` argument into : class, default dict - The collections.abc.MutableMapping subclass used for all Mappings + The collections.abc.Mapping subclass used for all Mappings in the return value. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized. @@ -2019,10 +1993,9 @@ def to_dict( Returns ------- - dict, list or collections.abc.MutableMapping - Return a collections.abc.MutableMapping object representing the - DataFrame. The resulting transformation depends on the `orient` - parameter. + dict, list or collections.abc.Mapping + Return a collections.abc.Mapping object representing the DataFrame. + The resulting transformation depends on the `orient` parameter. See Also -------- @@ -2081,7 +2054,7 @@ def to_dict( """ from pandas.core.methods.to_dict import to_dict - return to_dict(self, orient, into=into, index=index) + return to_dict(self, orient, into, index) @deprecate_nonkeyword_arguments( version="3.0", allowed_args=["self", "destination_table"], name="to_gbq" @@ -3299,55 +3272,6 @@ def to_html( render_links=render_links, ) - @overload - def to_xml( - self, - path_or_buffer: None = ..., - *, - index: bool = ..., - root_name: str | None = ..., - row_name: str | None = ..., - na_rep: str | None = ..., - attr_cols: list[str] | None = ..., - elem_cols: list[str] | None = ..., - namespaces: dict[str | None, str] | None = ..., - prefix: str | None = ..., - encoding: str = ..., - xml_declaration: bool | None = ..., - pretty_print: bool | None = ..., - parser: XMLParsers | None = ..., - stylesheet: FilePath | ReadBuffer[str] | ReadBuffer[bytes] | None = ..., - compression: CompressionOptions = ..., - storage_options: StorageOptions | None = ..., - ) -> str: - ... - - @overload - def to_xml( - self, - path_or_buffer: FilePath | WriteBuffer[bytes] | WriteBuffer[str], - *, - index: bool = ..., - root_name: str | None = ..., - row_name: str | None = ..., - na_rep: str | None = ..., - attr_cols: list[str] | None = ..., - elem_cols: list[str] | None = ..., - namespaces: dict[str | None, str] | None = ..., - prefix: str | None = ..., - encoding: str = ..., - xml_declaration: bool | None = ..., - pretty_print: bool | None = ..., - parser: XMLParsers | None = ..., - stylesheet: FilePath | ReadBuffer[str] | ReadBuffer[bytes] | None = ..., - compression: CompressionOptions = ..., - storage_options: StorageOptions | None = ..., - ) -> None: - ... - - @deprecate_nonkeyword_arguments( - version="3.0", allowed_args=["self", "path_or_buffer"], name="to_xml" - ) @doc( storage_options=_shared_docs["storage_options"], compression_options=_shared_docs["compression_options"] % "path_or_buffer", diff --git a/pandas/core/indexes/base.py b/pandas/core/indexes/base.py index 9017ff121976b..e23887159c9c6 100644 --- a/pandas/core/indexes/base.py +++ b/pandas/core/indexes/base.py @@ -4011,8 +4011,8 @@ def _get_fill_indexer( self, target: Index, method: str_t, limit: int | None = None, tolerance=None ) -> npt.NDArray[np.intp]: if self._is_multi: - if not (self.is_monotonic_increasing or self.is_monotonic_decreasing): - raise ValueError("index must be monotonic increasing or decreasing") + # TODO: get_indexer_with_fill docstring says values must be _sorted_ + # but that doesn't appear to be enforced # error: "IndexEngine" has no attribute "get_indexer_with_fill" engine = self._engine with warnings.catch_warnings(): diff --git a/pandas/core/methods/to_dict.py b/pandas/core/methods/to_dict.py index 3295c4741c03d..f4e0dcddcd34a 100644 --- a/pandas/core/methods/to_dict.py +++ b/pandas/core/methods/to_dict.py @@ -3,7 +3,6 @@ from typing import ( TYPE_CHECKING, Literal, - overload, ) import warnings @@ -17,66 +16,17 @@ from pandas.core import common as com if TYPE_CHECKING: - from pandas._typing import MutableMappingT - from pandas import DataFrame -@overload -def to_dict( - df: DataFrame, - orient: Literal["dict", "list", "series", "split", "tight", "index"] = ..., - *, - into: type[MutableMappingT] | MutableMappingT, - index: bool = ..., -) -> MutableMappingT: - ... - - -@overload -def to_dict( - df: DataFrame, - orient: Literal["records"], - *, - into: type[MutableMappingT] | MutableMappingT, - index: bool = ..., -) -> list[MutableMappingT]: - ... - - -@overload -def to_dict( - df: DataFrame, - orient: Literal["dict", "list", "series", "split", "tight", "index"] = ..., - *, - into: type[dict] = ..., - index: bool = ..., -) -> dict: - ... - - -@overload -def to_dict( - df: DataFrame, - orient: Literal["records"], - *, - into: type[dict] = ..., - index: bool = ..., -) -> list[dict]: - ... - - -# error: Incompatible default for argument "into" (default has type "type[dict -# [Any, Any]]", argument has type "type[MutableMappingT] | MutableMappingT") def to_dict( df: DataFrame, orient: Literal[ "dict", "list", "series", "split", "tight", "records", "index" ] = "dict", - *, - into: type[MutableMappingT] | MutableMappingT = dict, # type: ignore[assignment] + into: type[dict] = dict, index: bool = True, -) -> MutableMappingT | list[MutableMappingT]: +) -> dict | list[dict]: """ Convert the DataFrame to a dictionary. @@ -104,7 +54,7 @@ def to_dict( 'tight' as an allowed value for the ``orient`` argument into : class, default dict - The collections.abc.MutableMapping subclass used for all Mappings + The collections.abc.Mapping subclass used for all Mappings in the return value. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized. @@ -119,8 +69,8 @@ def to_dict( Returns ------- dict, list or collections.abc.Mapping - Return a collections.abc.MutableMapping object representing the - DataFrame. The resulting transformation depends on the `orient` parameter. + Return a collections.abc.Mapping object representing the DataFrame. + The resulting transformation depends on the `orient` parameter. """ if not df.columns.is_unique: warnings.warn( @@ -153,7 +103,7 @@ def to_dict( are_all_object_dtype_cols = len(box_native_indices) == len(df.dtypes) if orient == "dict": - return into_c((k, v.to_dict(into=into)) for k, v in df.items()) + return into_c((k, v.to_dict(into)) for k, v in df.items()) elif orient == "list": object_dtype_indices_as_set: set[int] = set(box_native_indices) diff --git a/pandas/core/resample.py b/pandas/core/resample.py index 7d37a9f1d5113..e9b2bacd9e1df 100644 --- a/pandas/core/resample.py +++ b/pandas/core/resample.py @@ -986,7 +986,7 @@ def interpolate( downcast : optional, 'infer' or None, defaults to None Downcast dtypes if possible. - .. deprecated:: 2.1.0 + .. deprecated::2.1.0 ``**kwargs`` : optional Keyword arguments to pass on to the interpolating function. diff --git a/pandas/core/reshape/merge.py b/pandas/core/reshape/merge.py index ba6579a739f54..4b9fcc80af4bb 100644 --- a/pandas/core/reshape/merge.py +++ b/pandas/core/reshape/merge.py @@ -2443,8 +2443,6 @@ def _factorize_keys( .astype(np.intp, copy=False), len(dc.dictionary), ) - if dc.null_count > 0: - count += 1 if how == "right": return rlab, llab, count return llab, rlab, count diff --git a/pandas/core/series.py b/pandas/core/series.py index d5785a2171cb3..2228c07c19f5e 100644 --- a/pandas/core/series.py +++ b/pandas/core/series.py @@ -48,7 +48,6 @@ from pandas.util._decorators import ( Appender, Substitution, - deprecate_nonkeyword_arguments, doc, ) from pandas.util._exceptions import find_stack_level @@ -100,6 +99,7 @@ from pandas.core.arrays.arrow import StructAccessor from pandas.core.arrays.categorical import CategoricalAccessor from pandas.core.arrays.sparse import SparseAccessor +from pandas.core.case_when import case_when from pandas.core.construction import ( extract_array, sanitize_array, @@ -168,7 +168,7 @@ IndexKeyFunc, IndexLabel, Level, - MutableMappingT, + ListLike, NaPosition, NumpySorter, NumpyValueArrayLike, @@ -1924,40 +1924,21 @@ def keys(self) -> Index: """ return self.index - @overload - def to_dict( - self, *, into: type[MutableMappingT] | MutableMappingT - ) -> MutableMappingT: - ... - - @overload - def to_dict(self, *, into: type[dict] = ...) -> dict: - ... - - # error: Incompatible default for argument "into" (default has type "type[ - # dict[Any, Any]]", argument has type "type[MutableMappingT] | MutableMappingT") - @deprecate_nonkeyword_arguments( - version="3.0", allowed_args=["self"], name="to_dict" - ) - def to_dict( - self, - into: type[MutableMappingT] - | MutableMappingT = dict, # type: ignore[assignment] - ) -> MutableMappingT: + def to_dict(self, into: type[dict] = dict) -> dict: """ Convert Series to {label -> value} dict or dict-like object. Parameters ---------- into : class, default dict - The collections.abc.MutableMapping subclass to use as the return - object. Can be the actual class or an empty instance of the mapping - type you want. If you want a collections.defaultdict, you must - pass it initialized. + The collections.abc.Mapping subclass to use as the return + object. Can be the actual class or an empty + instance of the mapping type you want. If you want a + collections.defaultdict, you must pass it initialized. Returns ------- - collections.abc.MutableMapping + collections.abc.Mapping Key-value representation of Series. Examples @@ -1966,10 +1947,10 @@ def to_dict( >>> s.to_dict() {0: 1, 1: 2, 2: 3, 3: 4} >>> from collections import OrderedDict, defaultdict - >>> s.to_dict(into=OrderedDict) + >>> s.to_dict(OrderedDict) OrderedDict([(0, 1), (1, 2), (2, 3), (3, 4)]) >>> dd = defaultdict(list) - >>> s.to_dict(into=dd) + >>> s.to_dict(dd) defaultdict(, {0: 1, 1: 2, 2: 3, 3: 4}) """ # GH16122 @@ -5462,6 +5443,70 @@ def between( return lmask & rmask + def case_when( + self, + *args: ListLike | Callable | Scalar, + level: int | None = None, + ) -> Series: + """ + Replace values where the conditions are True. + + Parameters + ---------- + *args : array-like, scalar + Variable argument of conditions and expected replacements. + Takes the form: `condition0`, `replacement0`, + `condition1`, `replacement1`, ... . + `condition` should be a 1-D boolean array-like object + or a callable. If `condition` is a callable, + it is computed on the Series + and should return a boolean Series or array. + The callable must not change the input Series + (though pandas doesn`t check it). `replacement` should be a + 1-D array-like object, a scalar or a callable. + If `replacement` is a callable, it is computed on the Series + and should return a scalar or Series. The callable + must not change the input Series + (though pandas doesn`t check it). + + level : int, default None + Alignment level if needed. + + .. versionadded:: 2.2.0 + + Returns + ------- + Series + + See Also + -------- + Series.mask : Replace values where the condition is True. + + Examples + -------- + >>> df = pd.DataFrame({ + ... "a": [0,0,1,2], + ... "b": [0,3,4,5], + ... "c": [6,7,8,9] + ... }) + >>> df + a b c + 0 0 0 6 + 1 0 3 7 + 2 1 4 8 + 3 2 5 9 + + >>> df.c.case_when(df.a.gt(0), df.a, # condition, replacement + ... df.b.gt(0), df.b) + 0 6 + 1 3 + 2 1 + 3 2 + Name: c, dtype: int64 + """ + args = [com.apply_if_callable(arg, self) for arg in args] + return case_when(*args, default=self, level=level) + # ---------------------------------------------------------------------- # Convert to types that support pd.NA diff --git a/pandas/io/xml.py b/pandas/io/xml.py index bd3b515dbca2f..918fe4d22ea62 100644 --- a/pandas/io/xml.py +++ b/pandas/io/xml.py @@ -88,7 +88,7 @@ class _XMLFrameParser: Parse only the attributes at the specified ``xpath``. names : list - Column names for :class:`~pandas.DataFrame` of parsed XML data. + Column names for :class:`~pandas.DataFrame`of parsed XML data. dtype : dict Data type for data or columns. E.g. {{'a': np.float64, diff --git a/pandas/tests/api/test_api.py b/pandas/tests/api/test_api.py index 60bcb97aaa364..ae4fe5d56ebf6 100644 --- a/pandas/tests/api/test_api.py +++ b/pandas/tests/api/test_api.py @@ -106,6 +106,7 @@ class TestPDApi(Base): funcs = [ "array", "bdate_range", + "case_when", "concat", "crosstab", "cut", diff --git a/pandas/tests/frame/methods/test_join.py b/pandas/tests/frame/methods/test_join.py index 3d21faf8b1729..2d4ac1d4a4444 100644 --- a/pandas/tests/frame/methods/test_join.py +++ b/pandas/tests/frame/methods/test_join.py @@ -158,14 +158,9 @@ def test_join_invalid_validate(left_no_dup, right_no_dup): left_no_dup.merge(right_no_dup, on="a", validate="invalid") -@pytest.mark.parametrize("dtype", ["object", "string[pyarrow]"]) -def test_join_on_single_col_dup_on_right(left_no_dup, right_w_dups, dtype): +def test_join_on_single_col_dup_on_right(left_no_dup, right_w_dups): # GH 46622 # Dups on right allowed by one_to_many constraint - if dtype == "string[pyarrow]": - pytest.importorskip("pyarrow") - left_no_dup = left_no_dup.astype(dtype) - right_w_dups.index = right_w_dups.index.astype(dtype) left_no_dup.join( right_w_dups, on="a", diff --git a/pandas/tests/indexes/multi/test_indexing.py b/pandas/tests/indexes/multi/test_indexing.py index d86692477f381..78b2c493ec116 100644 --- a/pandas/tests/indexes/multi/test_indexing.py +++ b/pandas/tests/indexes/multi/test_indexing.py @@ -342,19 +342,6 @@ def test_get_indexer_methods(self): expected = np.array([4, 6, 7], dtype=pad_indexer.dtype) tm.assert_almost_equal(expected, pad_indexer) - @pytest.mark.parametrize("method", ["pad", "ffill", "backfill", "bfill", "nearest"]) - def test_get_indexer_methods_raise_for_non_monotonic(self, method): - # 53452 - mi = MultiIndex.from_arrays([[0, 4, 2], [0, 4, 2]]) - if method == "nearest": - err = NotImplementedError - msg = "not implemented yet for MultiIndex" - else: - err = ValueError - msg = "index must be monotonic increasing or decreasing" - with pytest.raises(err, match=msg): - mi.get_indexer([(1, 1)], method=method) - def test_get_indexer_three_or_more_levels(self): # https://github.com/pandas-dev/pandas/issues/29896 # tests get_indexer() on MultiIndexes with 3+ levels diff --git a/pandas/tests/libs/test_hashtable.py b/pandas/tests/libs/test_hashtable.py index 2c8f4c4149528..b78e6426ca17f 100644 --- a/pandas/tests/libs/test_hashtable.py +++ b/pandas/tests/libs/test_hashtable.py @@ -586,26 +586,15 @@ def test_value_count(self, dtype, writable): expected = (np.arange(N) + N).astype(dtype) values = np.repeat(expected, 5) values.flags.writeable = writable - keys, counts, _ = ht.value_count(values, False) + keys, counts = ht.value_count(values, False) tm.assert_numpy_array_equal(np.sort(keys), expected) assert np.all(counts == 5) - def test_value_count_mask(self, dtype): - if dtype == np.object_: - pytest.skip("mask not implemented for object dtype") - values = np.array([1] * 5, dtype=dtype) - mask = np.zeros((5,), dtype=np.bool_) - mask[1] = True - mask[4] = True - keys, counts, na_counter = ht.value_count(values, False, mask=mask) - assert len(keys) == 2 - assert na_counter == 2 - def test_value_count_stable(self, dtype, writable): # GH12679 values = np.array([2, 1, 5, 22, 3, -1, 8]).astype(dtype) values.flags.writeable = writable - keys, counts, _ = ht.value_count(values, False) + keys, counts = ht.value_count(values, False) tm.assert_numpy_array_equal(keys, values) assert np.all(counts == 1) @@ -696,9 +685,9 @@ def test_unique_label_indices(): class TestHelpFunctionsWithNans: def test_value_count(self, dtype): values = np.array([np.nan, np.nan, np.nan], dtype=dtype) - keys, counts, _ = ht.value_count(values, True) + keys, counts = ht.value_count(values, True) assert len(keys) == 0 - keys, counts, _ = ht.value_count(values, False) + keys, counts = ht.value_count(values, False) assert len(keys) == 1 and np.all(np.isnan(keys)) assert counts[0] == 3 diff --git a/pandas/tests/series/methods/test_convert_dtypes.py b/pandas/tests/series/methods/test_convert_dtypes.py index f621604faae4b..d1c79d0f00365 100644 --- a/pandas/tests/series/methods/test_convert_dtypes.py +++ b/pandas/tests/series/methods/test_convert_dtypes.py @@ -265,11 +265,3 @@ def test_convert_dtypes_pyarrow_to_np_nullable(self): result = ser.convert_dtypes(dtype_backend="numpy_nullable") expected = pd.Series(range(2), dtype="Int32") tm.assert_series_equal(result, expected) - - def test_convert_dtypes_pyarrow_null(self): - # GH#55346 - pa = pytest.importorskip("pyarrow") - ser = pd.Series([None, None]) - result = ser.convert_dtypes(dtype_backend="pyarrow") - expected = pd.Series([None, None], dtype=pd.ArrowDtype(pa.null())) - tm.assert_series_equal(result, expected) diff --git a/pandas/tests/series/methods/test_to_dict.py b/pandas/tests/series/methods/test_to_dict.py index 41c01f4537f23..4c3d9592eebe3 100644 --- a/pandas/tests/series/methods/test_to_dict.py +++ b/pandas/tests/series/methods/test_to_dict.py @@ -13,12 +13,12 @@ class TestSeriesToDict: ) def test_to_dict(self, mapping, datetime_series): # GH#16122 - result = Series(datetime_series.to_dict(into=mapping), name="ts") + result = Series(datetime_series.to_dict(mapping), name="ts") expected = datetime_series.copy() expected.index = expected.index._with_freq(None) tm.assert_series_equal(result, expected) - from_method = Series(datetime_series.to_dict(into=collections.Counter)) + from_method = Series(datetime_series.to_dict(collections.Counter)) from_constructor = Series(collections.Counter(datetime_series.items())) tm.assert_series_equal(from_method, from_constructor) diff --git a/pandas/tests/series/methods/test_value_counts.py b/pandas/tests/series/methods/test_value_counts.py index bde9902fec6e9..f54489ac8a8b4 100644 --- a/pandas/tests/series/methods/test_value_counts.py +++ b/pandas/tests/series/methods/test_value_counts.py @@ -250,22 +250,3 @@ def test_value_counts_complex_numbers(self, input_array, expected): # GH 17927 result = Series(input_array).value_counts() tm.assert_series_equal(result, expected) - - def test_value_counts_masked(self): - # GH#54984 - dtype = "Int64" - ser = Series([1, 2, None, 2, None, 3], dtype=dtype) - result = ser.value_counts(dropna=False) - expected = Series( - [2, 2, 1, 1], - index=Index([2, None, 1, 3], dtype=dtype), - dtype=dtype, - name="count", - ) - tm.assert_series_equal(result, expected) - - result = ser.value_counts(dropna=True) - expected = Series( - [2, 1, 1], index=Index([2, 1, 3], dtype=dtype), dtype=dtype, name="count" - ) - tm.assert_series_equal(result, expected) diff --git a/pandas/tests/test_case_when.py b/pandas/tests/test_case_when.py new file mode 100644 index 0000000000000..743a0012bf305 --- /dev/null +++ b/pandas/tests/test_case_when.py @@ -0,0 +1,155 @@ +import numpy as np +import pytest + +from pandas import ( + DataFrame, + Series, + case_when, +) +import pandas._testing as tm + + +@pytest.fixture +def df(): + """ + base dataframe for testing + """ + return DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) + + +# use fixture and parametrize +def test_case_when_no_args(): + """ + Raise ValueError if no args is provided. + """ + msg = "Kindly provide at least one boolean condition, " + msg += "with a corresponding replacement." + with pytest.raises(ValueError, match=msg): + case_when() + + +def test_case_when_odd_args(df): + """ + Raise ValueError if no of args is odd. + """ + msg = "The number of boolean conditions should be equal " + msg += "to the number of replacements. " + msg += "However, the total number of conditions and replacements " + msg += "is 3, which is an odd number." + with pytest.raises(ValueError, match=msg): + case_when(df["a"].eq(1), 1, df.a.gt(1)) + + +def test_case_when_boolean_ndim(df): + """ + Raise ValueError if boolean array ndim is greater than 1. + """ + with pytest.raises(ValueError, match="condition0 is not a one dimensional array."): + case_when(df, 2) + + +def test_case_when_not_boolean(df): + """ + Raise TypeError if condition is not a boolean array. + """ + with pytest.raises(TypeError, match="condition1 is not a boolean array."): + case_when(df["a"].eq(1), 1, df["a"], 2) + + +def test_case_when_multiple_bool_lengths(df): + """ + Raise ValueError if the boolean conditions do not have the same length. + """ + with pytest.raises( + ValueError, match="All boolean conditions should have the same length." + ): + case_when(df["a"].eq(1), 1, [True, False], 2) + + +def test_case_when_default_ndim(df): + """ + Raise ValueError if the default is not a 1D array. + """ + msg = "The provided default argument should " + msg += "either be a scalar or a 1-D array." + with pytest.raises(ValueError, match=msg): + case_when(df["a"].eq(1), 1, default=df) + + +def test_case_when_default_length(df): + """ + Raise ValueError if the default is not + the same length as any of the boolean conditions. + """ + msg = "The length of the default argument should " + msg += "be the same as the length of any " + msg += "of the boolean conditions." + with pytest.raises(ValueError, match=msg): + case_when(df["a"].eq(1), 1, default=[2]) + + +def test_case_when_replacement_ndim(df): + """ + Raise ValueError if the ndim of the replacement value is greater than 1. + """ + with pytest.raises(ValueError, match="replacement0 should be a 1-D array."): + case_when(df["a"].eq(1), df) + + +def test_case_when_replacement_length(df): + """ + Raise ValueError if the replacement size is not the same as the boolean array. + """ + msg = "The size of replacement0 array" + msg += "does not match the size of condition0 array." + with pytest.raises(ValueError, match=msg): + case_when(df["a"].eq(1), np.array([2])) + + +def test_case_when_multiple_conditions(df): + """ + Test output when booleans are derived from a computation + """ + result = case_when(df.a.eq(1), 1, Series([False, True, False]), 2) + expected = Series([1, 2, np.nan]) + tm.assert_series_equal(result, expected) + + +def test_case_when_multiple_conditions_replacement_list(df): + """ + Test output when replacement is a list + """ + result = case_when( + [True, False, False], 1, df["a"].gt(1) & df["b"].eq(5), [1, 2, 3] + ) + expected = Series([1, 2, np.nan], dtype="Int64") + tm.assert_series_equal(result, expected) + + +def test_case_when_multiple_conditions_replacement_series(df): + """ + Test output when replacement is a Series + """ + result = case_when( + np.array([True, False, False]), + 1, + df["a"].gt(1) & df["b"].eq(5), + Series([1, 2, 3]), + ) + expected = Series([1, 2, np.nan]) + tm.assert_series_equal(result, expected) + + +def test_case_when_multiple_conditions_default_is_not_none(df): + """ + Test output when default is not None + """ + result = case_when( + [True, False, False], + 1, + df["a"].gt(1) & df["b"].eq(5), + Series([1, 2, 3]), + default=-1, + ) + expected = Series([1, 2, -1]) + tm.assert_series_equal(result, expected) diff --git a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md b/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md deleted file mode 100644 index 4fe4b935f144b..0000000000000 --- a/web/pandas/pdeps/0012-compact-and-reversible-JSON-interface.md +++ /dev/null @@ -1,437 +0,0 @@ -# PDEP-12: Compact and reversible JSON interface - -- Created: 16 June 2023 -- Status: Rejected -- Discussion: - [#53252](https://github.com/pandas-dev/pandas/issues/53252) - [#55038](https://github.com/pandas-dev/pandas/issues/55038) -- Author: [Philippe THOMY](https://github.com/loco-philippe) -- Revision: 3 - - -#### Summary -- [Abstract](./0012-compact-and-reversible-JSON-interface.md/#Abstract) - - [Problem description](./0012-compact-and-reversible-JSON-interface.md/#Problem-description) - - [Feature Description](./0012-compact-and-reversible-JSON-interface.md/#Feature-Description) -- [Scope](./0012-compact-and-reversible-JSON-interface.md/#Scope) -- [Motivation](./0012-compact-and-reversible-JSON-interface.md/#Motivation) - - [Why is it important to have a compact and reversible JSON interface ?](./0012-compact-and-reversible-JSON-interface.md/#Why-is-it-important-to-have-a-compact-and-reversible-JSON-interface-?) - - [Is it relevant to take an extended type into account ?](./0012-compact-and-reversible-JSON-interface.md/#Is-it-relevant-to-take-an-extended-type-into-account-?) - - [Is this only useful for pandas ?](./0012-compact-and-reversible-JSON-interface.md/#Is-this-only-useful-for-pandas-?) -- [Description](./0012-compact-and-reversible-JSON-interface.md/#Description) - - [Data typing](./0012-compact-and-reversible-JSON-interface.md/#Data-typing) - - [Correspondence between TableSchema and pandas](./panda0012-compact-and-reversible-JSON-interfaces_PDEP.md/#Correspondence-between-TableSchema-and-pandas) - - [JSON format](./0012-compact-and-reversible-JSON-interface.md/#JSON-format) - - [Conversion](./0012-compact-and-reversible-JSON-interface.md/#Conversion) -- [Usage and impact](./0012-compact-and-reversible-JSON-interface.md/#Usage-and-impact) - - [Usage](./0012-compact-and-reversible-JSON-interface.md/#Usage) - - [Compatibility](./0012-compact-and-reversible-JSON-interface.md/#Compatibility) - - [Impacts on the pandas framework](./0012-compact-and-reversible-JSON-interface.md/#Impacts-on-the-pandas-framework) - - [Risk to do / risk not to do](./0012-compact-and-reversible-JSON-interface.md/#Risk-to-do-/-risk-not-to-do) -- [Implementation](./0012-compact-and-reversible-JSON-interface.md/#Implementation) - - [Modules](./0012-compact-and-reversible-JSON-interface.md/#Modules) - - [Implementation options](./0012-compact-and-reversible-JSON-interface.md/#Implementation-options) -- [F.A.Q.](./0012-compact-and-reversible-JSON-interface.md/#F.A.Q.) -- [Synthesis](./0012-compact-and-reversible-JSON-interface.md/Synthesis) -- [Core team decision](./0012-compact-and-reversible-JSON-interface.md/#Core-team-decision) -- [Timeline](./0012-compact-and-reversible-JSON-interface.md/#Timeline) -- [PDEP history](./0012-compact-and-reversible-JSON-interface.md/#PDEP-history) -------------------------- -## Abstract - -### Problem description -The `dtype` and "Python type" are not explicitly taken into account in the current JSON interface. - -So, the JSON interface is not always reversible and has inconsistencies related to the consideration of the `dtype`. - -Another consequence is the partial application of the Table Schema specification in the `orient="table"` option (6 Table Schema data types are taken into account out of the 24 defined). - -Some JSON-interface problems are detailed in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/ntv-pandas/blob/main/example/example_json_pandas.ipynb#Current-Json-interface) - - -### Feature Description -To have a simple, compact and reversible solution, I propose to use the [JSON-NTV format (Named and Typed Value)](https://github.com/loco-philippe/NTV#readme) - which integrates the notion of type - and its JSON-TAB variation for tabular data (the JSON-NTV format is defined in an [IETF Internet-Draft](https://datatracker.ietf.org/doc/draft-thomy-json-ntv/) (not yet an RFC !!) ). - -This solution allows to include a large number of types (not necessarily pandas `dtype`) which allows to have: -- a Table Schema JSON interface (`orient="table"`) which respects the Table Schema specification (going from 6 types to 20 types), -- a global JSON interface for all pandas data formats. - -#### Global JSON interface example -In the example below, a DataFrame with several data types is converted to JSON. - -The DataFrame resulting from this JSON is identical to the initial DataFrame (reversibility). - -With the existing JSON interface, this conversion is not possible. - -This example uses `ntv_pandas` module defined in the [ntv-pandas repository](https://github.com/loco-philippe/ntv-pandas#readme). - -*data example* -```python -In [1]: from shapely.geometry import Point - from datetime import date - import pandas as pd - import ntv_pandas as npd - -In [2]: data = {'index': [100, 200, 300, 400, 500, 600], - 'dates::date': [date(1964,1,1), date(1985,2,5), date(2022,1,21), date(1964,1,1), date(1985,2,5), date(2022,1,21)], - 'value': [10, 10, 20, 20, 30, 30], - 'value32': pd.Series([12, 12, 22, 22, 32, 32], dtype='int32'), - 'res': [10, 20, 30, 10, 20, 30], - 'coord::point': [Point(1,2), Point(3,4), Point(5,6), Point(7,8), Point(3,4), Point(5,6)], - 'names': pd.Series(['john', 'eric', 'judith', 'mila', 'hector', 'maria'], dtype='string'), - 'unique': True } - -In [3]: df = pd.DataFrame(data).set_index('index') - -In [4]: df -Out[4]: dates::date value value32 res coord::point names unique - index - 100 1964-01-01 10 12 10 POINT (1 2) john True - 200 1985-02-05 10 12 20 POINT (3 4) eric True - 300 2022-01-21 20 22 30 POINT (5 6) judith True - 400 1964-01-01 20 22 10 POINT (7 8) mila True - 500 1985-02-05 30 32 20 POINT (3 4) hector True - 600 2022-01-21 30 32 30 POINT (5 6) maria True -``` - -*JSON representation* - -```python -In [5]: df_to_json = npd.to_json(df) - pprint(df_to_json, width=120) -Out[5]: {':tab': {'coord::point': [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0], [3.0, 4.0], [5.0, 6.0]], - 'dates::date': ['1964-01-01', '1985-02-05', '2022-01-21', '1964-01-01', '1985-02-05', '2022-01-21'], - 'index': [100, 200, 300, 400, 500, 600], - 'names::string': ['john', 'eric', 'judith', 'mila', 'hector', 'maria'], - 'res': [10, 20, 30, 10, 20, 30], - 'unique': [True, True, True, True, True, True], - 'value': [10, 10, 20, 20, 30, 30], - 'value32::int32': [12, 12, 22, 22, 32, 32]}} -``` - -*Reversibility* - -```python -In [5]: df_from_json = npd.read_json(df_to_json) - print('df created from JSON is equal to initial df ? ', df_from_json.equals(df)) -Out[5]: df created from JSON is equal to initial df ? True -``` -Several other examples are provided in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/ntv-pandas/blob/main/example/example_ntv_pandas.ipynb) - -#### Table Schema JSON interface example -In the example below, a DataFrame with several Table Schema data types is converted to JSON. - -The DataFrame resulting from this JSON is identical to the initial DataFrame (reversibility). - -With the existing Table Schema JSON interface, this conversion is not possible. - -```python -In [1]: from shapely.geometry import Point - from datetime import date - -In [2]: df = pd.DataFrame({ - 'end february::date': ['date(2023,2,28)', 'date(2024,2,29)', 'date(2025,2,28)'], - 'coordinates::point': ['Point([2.3, 48.9])', 'Point([5.4, 43.3])', 'Point([4.9, 45.8])'], - 'contact::email': ['john.doe@table.com', 'lisa.minelli@schema.com', 'walter.white@breaking.com'] - }) - -In [3]: df -Out[3]: end february::date coordinates::point contact::email - 0 2023-02-28 POINT (2.3 48.9) john.doe@table.com - 1 2024-02-29 POINT (5.4 43.3) lisa.minelli@schema.com - 2 2025-02-28 POINT (4.9 45.8) walter.white@breaking.com -``` - -*JSON representation* - -```python -In [4]: df_to_table = npd.to_json(df, table=True) - pprint(df_to_table, width=140, sort_dicts=False) -Out[4]: {'schema': {'fields': [{'name': 'index', 'type': 'integer'}, - {'name': 'end february', 'type': 'date'}, - {'name': 'coordinates', 'type': 'geopoint', 'format': 'array'}, - {'name': 'contact', 'type': 'string', 'format': 'email'}], - 'primaryKey': ['index'], - 'pandas_version': '1.4.0'}, - 'data': [{'index': 0, 'end february': '2023-02-28', 'coordinates': [2.3, 48.9], 'contact': 'john.doe@table.com'}, - {'index': 1, 'end february': '2024-02-29', 'coordinates': [5.4, 43.3], 'contact': 'lisa.minelli@schema.com'}, - {'index': 2, 'end february': '2025-02-28', 'coordinates': [4.9, 45.8], 'contact': 'walter.white@breaking.com'}]} -``` - -*Reversibility* - -```python -In [5]: df_from_table = npd.read_json(df_to_table) - print('df created from JSON is equal to initial df ? ', df_from_table.equals(df)) -Out[5]: df created from JSON is equal to initial df ? True -``` -Several other examples are provided in the [linked NoteBook](https://nbviewer.org/github/loco-philippe/ntv-pandas/blob/main/example/example_table_pandas.ipynb) - -## Scope -The objective is to make available the proposed JSON interface for any type of data and for `orient="table"` option or a new option `orient="ntv"`. - -The proposed interface is compatible with existing data. - -## Motivation - -### Why extend the `orient=table` option to other data types? -- The Table Schema specification defines 24 data types, 6 are taken into account in the pandas interface - -### Why is it important to have a compact and reversible JSON interface ? -- a reversible interface provides an exchange format. -- a textual exchange format facilitates exchanges between platforms (e.g. OpenData) -- a JSON exchange format can be used at API level - -### Is it relevant to take an extended type into account ? -- it avoids the addition of an additional data schema -- it increases the semantic scope of the data processed by pandas -- it is an answer to several issues (e.g. #12997, #14358, #16492, #35420, #35464, #36211, #39537, #49585, #50782, #51375, #52595, #53252) -- the use of a complementary type avoids having to modify the pandas data model - -### Is this only useful for pandas ? -- the JSON-TAB format is applicable to tabular data and multi-dimensional data. -- this JSON interface can therefore be used for any application using tabular or multi-dimensional data. This would allow for example reversible data exchanges between pandas - DataFrame and Xarray - DataArray (Xarray issue under construction) [see example DataFrame / DataArray](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#Multidimensional-data). - -## Description - -The proposed solution is based on several key points: -- data typing -- correspondence between TableSchema and pandas -- JSON format for tabular data -- conversion to and from JSON format - -### Data typing -Data types are defined and managed in the NTV project (name, JSON encoder and decoder). - -Pandas `dtype` are compatible with NTV types : - -| **pandas dtype** | **NTV type** | -|--------------------|------------| -| intxx | intxx | -| uintxx | uintxx | -| floatxx | floatxx | -| datetime[ns] | datetime | -| datetime[ns, ] | datetimetz | -| timedelta[ns] | durationiso| -| string | string | -| boolean | boolean | - -Note: -- datetime with timezone is a single NTV type (string ISO8601) -- `CategoricalDtype` and `SparseDtype` are included in the tabular JSON format -- `object` `dtype` is depending on the context (see below) -- `PeriodDtype` and `IntervalDtype` are to be defined - -JSON types (implicit or explicit) are converted in `dtype` following pandas JSON interface: - -| **JSON type** | **pandas dtype** | -|----------------|-------------------| -| number | int64 / float64 | -| string | string / object | -| array | object | -| object | object | -| true, false | boolean | -| null | NaT / NaN / None | - -Note: -- if an NTV type is defined, the `dtype` is adjusted accordingly -- the consideration of null type data needs to be clarified - -The other NTV types are associated with `object` `dtype`. - -### Correspondence between TableSchema and pandas -The TableSchema typing is carried by two attributes `format` and `type`. - -The table below shows the correspondence between TableSchema format / type and pandas NTVtype / dtype: - -| **format / type** | **NTV type / dtype** | -|--------------------|----------------------| -| default / datetime | / datetime64[ns] | -| default / number | / float64 | -| default / integer | / int64 | -| default / boolean | / bool | -| default / string | / object | -| default / duration | / timedelta64[ns] | -| email / string | email / string | -| uri / string | uri / string | -| default / object | object / object | -| default / array | array / object | -| default / date | date / object | -| default / time | time / object | -| default / year | year / int64 | -| default / yearmonth| month / int64 | -| array / geopoint | point / object | -| default / geojson | geojson / object | - -Note: -- other TableSchema format are defined and are to be studied (uuid, binary, topojson, specific format for geopoint and datation) -- the first six lines correspond to the existing - -### JSON format -The JSON format for the TableSchema interface is the existing. - -The JSON format for the Global interface is defined in [JSON-TAB](https://github.com/loco-philippe/NTV/blob/main/documentation/JSON-TAB-standard.pdf) specification. -It includes the naming rules originally defined in the [JSON-ND project](https://github.com/glenkleidon/JSON-ND) and support for categorical data. -The specification have to be updated to include sparse data. - -### Conversion -When data is associated with a non-`object` `dtype`, pandas conversion methods are used. -Otherwise, NTV conversion is used. - -#### pandas -> JSON -- `NTV type` is not defined : use `to_json()` -- `NTV type` is defined and `dtype` is not `object` : use `to_json()` -- `NTV type` is defined and `dtype` is `object` : use NTV conversion (if pandas conversion does not exist) - -#### JSON -> pandas -- `NTV type` is compatible with a `dtype` : use `read_json()` -- `NTV type` is not compatible with a `dtype` : use NTV conversion (if pandas conversion does not exist) - -## Usage and Impact - -### Usage -It seems to me that this proposal responds to important issues: -- having an efficient text format for data exchange - - The alternative CSV format is not reversible and obsolete (last revision in 2005). Current CSV tools do not comply with the standard. - -- taking into account "semantic" data in pandas objects - -- having a complete Table Schema interface - -### Compatibility -Interface can be used without NTV type (compatibility with existing data - [see examples](https://nbviewer.org/github/loco-philippe/ntv-pandas/blob/main/example/example_ntv_pandas.ipynb#Appendix-:-Series-tests)) - -If the interface is available, throw a new `orient` option in the JSON interface, the use of the feature is decoupled from the other features. - -### Impacts on the pandas framework -Initially, the impacts are very limited: -- modification of the `name` of `Series` or `DataFrame columns` (no functional impact), -- added an option in the Json interface (e.g. `orient='ntv'`) and added associated methods (no functional interference with the other methods) - -In later stages, several developments could be considered: -- validation of the `name` of `Series` or `DataFrame columns` , -- management of the NTV type as a "complementary-object-dtype" -- functional extensions depending on the NTV type - -### Risk to do / risk not to do -The JSON-NTV format and the JSON-TAB format are not (yet) recognized and used formats. The risk for pandas is that this function is not used (no functional impacts). - -On the other hand, the early use by pandas will allow a better consideration of the expectations and needs of pandas as well as a reflection on the evolution of the types supported by pandas. - -## Implementation - -### Modules -Two modules are defined for NTV: - -- json-ntv - - this module manages NTV data without dependency to another module - -- ntvconnector - - those modules manage the conversion between objects and JSON data. They have dependency with objects modules (e.g. connectors with shapely location have dependency with shapely). - -The pandas integration of the JSON interface requires importing only the json-ntv module. - -### Implementation options -The interface can be implemented as NTV connector (`SeriesConnector` and `DataFrameConnector`) and as a new pandas JSON interface `orient` option. - -Several pandas implementations are possible: - -1. External: - - In this implementation, the interface is available only in the NTV side. - This option means that this evolution of the JSON interface is not useful or strategic for pandas. - -2. NTV side: - - In this implementation, the interface is available in the both sides and the conversion is located inside NTV. - This option is the one that minimizes the impacts on the pandas side - -3. pandas side: - - In this implementation, the interface is available in the both sides and the conversion is located inside pandas. - This option allows pandas to keep control of this evolution - -4. pandas restricted: - - In this implementation, the pandas interface and the conversion are located inside pandas and only for non-object `dtype`. - This option makes it possible to offer a compact and reversible interface while prohibiting the introduction of types incompatible with the existing `dtype` - -## F.A.Q. - -**Q: Does `orient="table"` not do what you are proposing already?** - -**A**: In principle, yes, this option takes into account the notion of type. - -But this is very limited (see examples added in the [Notebook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb)) : -- **Types and Json interface** - - the only way to keep the types in the json interface is to use the `orient='table'` option - - few dtypes are not allowed in json-table interface : period, timedelta64, interval - - allowed types are not always kept in json-table interface - - data with 'object' dtype is kept only id data is string - - with categorical dtype, the underlying dtype is not included in json interface -- **Data compactness** - - json-table interface is not compact (in the example in the [Notebook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#data-compactness))the size is triple or quadruple the size of the compact format -- **Reversibility** - - Interface is reversible only with few dtypes : int64, float64, bool, string, datetime64 and partially categorical -- **External types** - - the interface does not accept external types - - Table-schema defines 20 data types but the `orient="table"` interface takes into account 5 data types (see [table](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#Converting-table-schema-type-to-pandas-dtype)) - - to integrate external types, it is necessary to first create ExtensionArray and ExtensionDtype objects - -The current interface is not compatible with the data structure defined by table-schema. For this to be possible, it is necessary to integrate a "type extension" like the one proposed (this has moreover been partially achieved with the notion of `extDtype` found in the interface for several formats). - -**Q: In general, we should only have 1 `"table"` format for pandas in read_json/to_json. There is also the issue of backwards compatibility if we do change the format. The fact that the table interface is buggy is not a reason to add a new interface (I'd rather fix those bugs). Can the existing format be adapted in a way that fixes the type issues/issues with roundtripping?** - -**A**: I will add two additional remarks: -- the types defined in Tableschema are partially taken into account (examples of types not taken into account in the interface: string-uri, array, date, time, year, geopoint, string-email): -- the `read_json()` interface works too with the following data: `{'simple': [1,2,3] }` (contrary to what is indicated in the documentation) but it is impossible with `to_json()` to recreate this simple json. - -I think that the problem cannot be limited to bug fixes and that a clear strategy must be defined for the Json interface in particular with the gradual abandonment in open-data solutions of the obsolete CSV format in favor of a Json format. - -As stated, the proposed solution addresses several shortcomings of the current interface and could simply fit into the pandas environment (the other option would be to consider that the Json interface is a peripheral function of pandas and can remain external to pandas) regardless of the `orient='table'` option. - -It is nevertheless possible to merge the proposed format and the `orient='table'` format in order to have an explicit management of the notion of `extDtype` - -**Q: As far as I can tell, JSON NTV is not in any form a standardised JSON format. I believe that pandas (and geopandas, which is where I came from to this issue) should try to follow either de facto or de jure standards and do not opt in for a file format that does not have any community support at this moment. This can obviously change in the future and that is where this PR should be revised. Why would pandas use this standard?** - -**A**: As indicated in the issue (and detailed in [the attached Notebook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb)), the json interface is not reversible (`to_json` then `read_json` does not always return the initial object) and several shortcomings and bugs are present. The main cause of this problem is that the data type is not taken into account in the JSON format (or very partially with the `orient='table'` option). - -The proposal made answers this problem ([the example at the beginning of Notebook](https://nbviewer.org/github/loco-philippe/NTV/blob/main/example/example_pandas.ipynb#0---Simple-example) simply and clearly illustrates the interest of the proposal). - -Regarding the underlying JSON-NTV format, its impact is quite low for tabular data (it is limited to adding the type in the field name). -Nevertheless, the question is relevant: The JSON-NTV format ([IETF Internet-Draft](https://datatracker.ietf.org/doc/draft-thomy-json-ntv/)) is a shared, documented, supported and implemented format, but indeed the community support is for the moment reduced but it only asks to expand !! - -## Synthesis -To conclude, -- if it is important (or strategic) to have a reversible JSON interface for any type of data, the proposal can be allowed, -- if not, a third-party package listed in the [ecosystem](https://pandas.pydata.org/community/ecosystem.html) that reads/writes this format to/from pandas DataFrames should be considered - -## Core team decision -Vote was open from september-11 to setpember-26: -- Final tally is 0 approvals, 5 abstentions, 7 disapprove. The quorum has been met. The PDEP fails. - -**Disapprove comments** : -- 1 Given the newness of the proposed JSON NTV format, I would support (as described in the PDEP): "if not, a third-party package listed in the ecosystem that reads/writes this format to/from pandas DataFrames should be considered" -- 2 Same reason as -1-, this should be a third party package for now -- 3 Not mature enough, and not clear what the market size would be. -- 4 for the same reason I left in the PDEP: "I think this (JSON-NTV format) does not meet the bar of being a commonly used format for implementation within pandas" -- 5 agree with -4- -- 6 agree with the other core-dev responders. I think work in the existing json interface is extremely valuable. A number of the original issues raised are just bug fixes / extensions of already existing functionality. Trying to start anew is likely not worth the migration effort. That said if a format is well supported in the community we can reconsider in the future (obviously json is well supported but the actual specification detailed here is too new / not accepted as a standard) -- 7 while I do think having a more comprehensive JSON format would be worthwhile, making a new format part of pandas means an implicit endorsement of a standard that is still being reviewed by the broader community. - -**Decision**: -- add the `ntv-pandas` package in the [ecosystem](https://pandas.pydata.org/community/ecosystem.html) -- revisit again this PDEP at a later stage, for example in 1/2 to 1 year (based on the evolution of the Internet draft [JSON semantic format (JSON-NTV)](https://www.ietf.org/archive/id/draft-thomy-json-ntv-01.html) and the usage of the [ntv-pandas](https://github.com/loco-philippe/ntv-pandas#readme)) - -## Timeline -Not applicable - -## PDEP History - -- 16 June 2023: Initial draft -- 22 July 2023: Add F.A.Q. -- 06 September 2023: Add Table Schema extension -- 01 Octobre: Add Core team decision