diff --git a/doc/source/whatsnew/v2.2.0.rst b/doc/source/whatsnew/v2.2.0.rst index dce776755ad7e..15b6dd93ebb15 100644 --- a/doc/source/whatsnew/v2.2.0.rst +++ b/doc/source/whatsnew/v2.2.0.rst @@ -105,10 +105,41 @@ For a full list of ADBC drivers and their development status, see the `ADBC Driv Implementation Status `_ documentation. +.. _whatsnew_220.enhancements.to_numpy_ea: + +ExtensionArray.to_numpy converts to suitable NumPy dtype +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:meth:`ExtensionArray.to_numpy` will now convert to a suitable NumPy dtype instead +of ``object`` dtype for nullable extension dtypes. + +*Old behavior:* + +.. code-block:: ipython + + In [1]: ser = pd.Series([1, 2, 3], dtype="Int64") + In [2]: ser.to_numpy() + Out[2]: array([1, 2, 3], dtype=object) + +*New behavior:* + +.. ipython:: python + + ser = pd.Series([1, 2, 3], dtype="Int64") + ser.to_numpy() + +The default NumPy dtype (without any arguments) is determined as follows: + +- float dtypes are cast to NumPy floats +- integer dtypes without missing values are cast to NumPy integer dtypes +- integer dtypes with missing values are cast to NumPy float dtypes and ``NaN`` is used as missing value indicator +- boolean dtypes without missing values are cast to NumPy bool dtype +- boolean dtypes with missing values keep object dtype + .. _whatsnew_220.enhancements.struct_accessor: -Series.struct accessor to with PyArrow structured data -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Series.struct accessor for PyArrow structured data +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``Series.struct`` accessor provides attributes and methods for processing data with ``struct[pyarrow]`` dtype Series. For example, @@ -161,7 +192,7 @@ a Series. (:issue:`55323`) .. _whatsnew_220.enhancements.calamine: Calamine engine for :func:`read_excel` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``calamine`` engine was added to :func:`read_excel`. It uses ``python-calamine``, which provides Python bindings for the Rust library `calamine `__. @@ -185,9 +216,11 @@ For more, see :ref:`io.calamine` in the user guide on IO tools. Other enhancements ^^^^^^^^^^^^^^^^^^ -- :meth:`to_sql` with method parameter set to ``multi`` works with Oracle on the backend +- :meth:`~DataFrame.to_sql` with method parameter set to ``multi`` works with Oracle on the backend - :attr:`Series.attrs` / :attr:`DataFrame.attrs` now uses a deepcopy for propagating ``attrs`` (:issue:`54134`). +- :func:`get_dummies` now returning extension dtypes ``boolean`` or ``bool[pyarrow]`` that are compatible with the input dtype (:issue:`56273`) - :func:`read_csv` now supports ``on_bad_lines`` parameter with ``engine="pyarrow"``. (:issue:`54480`) +- :func:`read_sas` returns ``datetime64`` dtypes with resolutions better matching those stored natively in SAS, and avoids returning object-dtype in cases that cannot be stored with ``datetime64[ns]`` dtype (:issue:`56127`) - :func:`read_spss` now returns a :class:`DataFrame` that stores the metadata in :attr:`DataFrame.attrs`. (:issue:`54264`) - :func:`tseries.api.guess_datetime_format` is now part of the public API (:issue:`54727`) - :meth:`ExtensionArray._explode` interface method added to allow extension type implementations of the ``explode`` method (:issue:`54833`) @@ -195,6 +228,8 @@ Other enhancements - Allow passing ``read_only``, ``data_only`` and ``keep_links`` arguments to openpyxl using ``engine_kwargs`` of :func:`read_excel` (:issue:`55027`) - DataFrame.apply now allows the usage of numba (via ``engine="numba"``) to JIT compile the passed function, allowing for potential speedups (:issue:`54666`) - Implement masked algorithms for :meth:`Series.value_counts` (:issue:`54984`) +- Implemented :meth:`Series.str.extract` for :class:`ArrowDtype` (:issue:`56268`) +- Improved error message that appears in :meth:`DatetimeIndex.to_period` with frequencies which are not supported as period frequencies, such as "BMS" (:issue:`56243`) - Improved error message when constructing :class:`Period` with invalid offsets such as "QS" (:issue:`55785`) .. --------------------------------------------------------------------------- @@ -212,7 +247,7 @@ These are bug fixes that might have notable behavior changes. In previous versions of pandas, :func:`merge` and :meth:`DataFrame.join` did not always return a result that followed the documented sort behavior. pandas now -follows the documented sort behavior in merge and join operations (:issue:`54611`). +follows the documented sort behavior in merge and join operations (:issue:`54611`, :issue:`56426`, :issue:`56443`). As documented, ``sort=True`` sorts the join keys lexicographically in the resulting :class:`DataFrame`. With ``sort=False``, the order of the join keys depends on the @@ -324,19 +359,30 @@ Other API changes Deprecations ~~~~~~~~~~~~ -Deprecate aliases ``M``, ``SM``, ``BM``, ``CBM``, ``Q``, ``BQ``, ``Y``, and ``BY`` in favour of ``ME``, ``SME``, ``BME``, ``CBME``, ``QE``, ``BQE``, ``YE``, and ``BYE`` for offsets -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Deprecate aliases ``M``, ``Q``, ``Y``, etc. in favour of ``ME``, ``QE``, ``YE``, etc. for offsets +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Deprecated the following frequency aliases (:issue:`9586`): -- ``M`` (month end) has been renamed ``ME`` for offsets -- ``SM`` (semi month end) has been renamed ``SME`` for offsets -- ``BM`` (business month end) has been renamed ``BME`` for offsets -- ``CBM`` (custom business month end) has been renamed ``CBME`` for offsets -- ``Q`` (quarter end) has been renamed ``QE`` for offsets -- ``BQ`` (business quarter end) has been renamed ``BQE`` for offsets -- ``Y`` (year end) has been renamed ``YE`` for offsets -- ``BY`` (business year end) has been renamed ``BYE`` for offsets ++-------------------------------+------------------+------------------+ +|offsets |deprecated aliases|new aliases | ++===============================+==================+==================+ +|:class:`MonthEnd` | ``M`` | ``ME`` | ++-------------------------------+------------------+------------------+ +|:class:`BusinessMonthEnd` | ``BM`` | ``BME`` | ++-------------------------------+------------------+------------------+ +|:class:`SemiMonthEnd` | ``SM`` | ``SME`` | ++-------------------------------+------------------+------------------+ +|:class:`CustomBusinessMonthEnd`| ``CBM`` | ``CBME`` | ++-------------------------------+------------------+------------------+ +|:class:`QuarterEnd` | ``Q`` | ``QE`` | ++-------------------------------+------------------+------------------+ +|:class:`BQuarterEnd` | ``BQ`` | ``BQE`` | ++-------------------------------+------------------+------------------+ +|:class:`YearEnd` | ``Y`` | ``YE`` | ++-------------------------------+------------------+------------------+ +|:class:`BYearEnd` | ``BY`` | ``BYE`` | ++-------------------------------+------------------+------------------+ For example: @@ -355,17 +401,46 @@ For example: pd.date_range('2020-01-01', periods=3, freq='QE-NOV') +Deprecated automatic downcasting +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Deprecated the automatic downcasting of object dtype results in a number of +methods. These would silently change the dtype in a hard to predict manner since the +behavior was value dependent. Additionally, pandas is moving away from silent dtype +changes (:issue:`54710`, :issue:`54261`). + +These methods are: + +- :meth:`Series.replace` and :meth:`DataFrame.replace` +- :meth:`DataFrame.fillna`, :meth:`Series.fillna` +- :meth:`DataFrame.ffill`, :meth:`Series.ffill` +- :meth:`DataFrame.bfill`, :meth:`Series.bfill` + +Explicitly call :meth:`DataFrame.infer_objects` to replicate the current behavior in the future. + +.. code-block:: ipython + + result = result.infer_objects(copy=False) + +Set the following option to opt into the future behavior: + +.. code-block:: ipython + + In [9]: pd.set_option("future.no_silent_downcasting", True) + Other Deprecations ^^^^^^^^^^^^^^^^^^ - Changed :meth:`Timedelta.resolution_string` to return ``h``, ``min``, ``s``, ``ms``, ``us``, and ``ns`` instead of ``H``, ``T``, ``S``, ``L``, ``U``, and ``N``, for compatibility with respective deprecations in frequency aliases (:issue:`52536`) - Deprecated :func:`pandas.api.types.is_interval` and :func:`pandas.api.types.is_period`, use ``isinstance(obj, pd.Interval)`` and ``isinstance(obj, pd.Period)`` instead (:issue:`55264`) +- Deprecated :func:`pd.core.internals.api.make_block`, use public APIs instead (:issue:`40226`) - Deprecated :func:`read_gbq` and :meth:`DataFrame.to_gbq`. Use ``pandas_gbq.read_gbq`` and ``pandas_gbq.to_gbq`` instead https://pandas-gbq.readthedocs.io/en/latest/api.html (:issue:`55525`) - Deprecated :meth:`.DataFrameGroupBy.fillna` and :meth:`.SeriesGroupBy.fillna`; use :meth:`.DataFrameGroupBy.ffill`, :meth:`.DataFrameGroupBy.bfill` for forward and backward filling or :meth:`.DataFrame.fillna` to fill with a single value (or the Series equivalents) (:issue:`55718`) - Deprecated :meth:`Index.format`, use ``index.astype(str)`` or ``index.map(formatter)`` instead (:issue:`55413`) - Deprecated :meth:`Series.ravel`, the underlying array is already 1D, so ravel is not necessary (:issue:`52511`) -- Deprecated :meth:`Series.view`, use ``astype`` instead to change the dtype (:issue:`20251`) +- Deprecated :meth:`Series.view`, use :meth:`Series.astype` instead to change the dtype (:issue:`20251`) - Deprecated ``core.internals`` members ``Block``, ``ExtensionBlock``, and ``DatetimeTZBlock``, use public APIs instead (:issue:`55139`) - Deprecated ``year``, ``month``, ``quarter``, ``day``, ``hour``, ``minute``, and ``second`` keywords in the :class:`PeriodIndex` constructor, use :meth:`PeriodIndex.from_fields` instead (:issue:`55960`) +- Deprecated accepting a type as an argument in :meth:`Index.view`, call without any arguments instead (:issue:`55709`) - Deprecated allowing non-integer ``periods`` argument in :func:`date_range`, :func:`timedelta_range`, :func:`period_range`, and :func:`interval_range` (:issue:`56036`) - Deprecated allowing non-keyword arguments in :meth:`DataFrame.to_clipboard`. (:issue:`54229`) - Deprecated allowing non-keyword arguments in :meth:`DataFrame.to_csv` except ``path_or_buf``. (:issue:`54229`) @@ -382,11 +457,13 @@ Other Deprecations - Deprecated allowing non-keyword arguments in :meth:`DataFrame.to_string` except ``buf``. (:issue:`54229`) - Deprecated allowing non-keyword arguments in :meth:`DataFrame.to_xml` except ``path_or_buffer``. (:issue:`54229`) - Deprecated allowing passing :class:`BlockManager` objects to :class:`DataFrame` or :class:`SingleBlockManager` objects to :class:`Series` (:issue:`52419`) -- Deprecated automatic downcasting of object-dtype results in :meth:`Series.replace` and :meth:`DataFrame.replace`, explicitly call ``result = result.infer_objects(copy=False)`` instead. To opt in to the future version, use ``pd.set_option("future.no_silent_downcasting", True)`` (:issue:`54710`) +- Deprecated behavior of :meth:`Index.insert` with an object-dtype index silently performing type inference on the result, explicitly call ``result.infer_objects(copy=False)`` for the old behavior instead (:issue:`51363`) +- Deprecated casting non-datetimelike values (mainly strings) in :meth:`Series.isin` and :meth:`Index.isin` with ``datetime64``, ``timedelta64``, and :class:`PeriodDtype` dtypes (:issue:`53111`) - Deprecated downcasting behavior in :meth:`Series.where`, :meth:`DataFrame.where`, :meth:`Series.mask`, :meth:`DataFrame.mask`, :meth:`Series.clip`, :meth:`DataFrame.clip`; in a future version these will not infer object-dtype columns to non-object dtype, or all-round floats to integer dtype. Call ``result.infer_objects(copy=False)`` on the result for object inference, or explicitly cast floats to ints. To opt in to the future version, use ``pd.set_option("future.no_silent_downcasting", True)`` (:issue:`53656`) -- Deprecated including the groups in computations when using :meth:`DataFrameGroupBy.apply` and :meth:`DataFrameGroupBy.resample`; pass ``include_groups=False`` to exclude the groups (:issue:`7155`) +- Deprecated dtype inference when setting a :class:`Index` into a :class:`DataFrame`, cast explicitly instead (:issue:`56102`) +- Deprecated including the groups in computations when using :meth:`.DataFrameGroupBy.apply` and :meth:`.DataFrameGroupBy.resample`; pass ``include_groups=False`` to exclude the groups (:issue:`7155`) - Deprecated indexing an :class:`Index` with a boolean indexer of length zero (:issue:`55820`) -- Deprecated not passing a tuple to :class:`DataFrameGroupBy.get_group` or :class:`SeriesGroupBy.get_group` when grouping by a length-1 list-like (:issue:`25971`) +- Deprecated not passing a tuple to :class:`.DataFrameGroupBy.get_group` or :class:`.SeriesGroupBy.get_group` when grouping by a length-1 list-like (:issue:`25971`) - Deprecated string ``AS`` denoting frequency in :class:`YearBegin` and strings ``AS-DEC``, ``AS-JAN``, etc. denoting annual frequencies with various fiscal year starts (:issue:`54275`) - Deprecated string ``A`` denoting frequency in :class:`YearEnd` and strings ``A-DEC``, ``A-JAN``, etc. denoting annual frequencies with various fiscal year ends (:issue:`54275`) - Deprecated string ``BAS`` denoting frequency in :class:`BYearBegin` and strings ``BAS-DEC``, ``BAS-JAN``, etc. denoting annual frequencies with various fiscal year starts (:issue:`54275`) @@ -399,13 +476,15 @@ Other Deprecations - Deprecated the :class:`.Grouping` attributes ``group_index``, ``result_index``, and ``group_arraylike``; these will be removed in a future version of pandas (:issue:`56148`) - Deprecated the ``errors="ignore"`` option in :func:`to_datetime`, :func:`to_timedelta`, and :func:`to_numeric`; explicitly catch exceptions instead (:issue:`54467`) - Deprecated the ``fastpath`` keyword in the :class:`Series` constructor (:issue:`20110`) +- Deprecated the ``kind`` keyword in :meth:`Series.resample` and :meth:`DataFrame.resample`, explicitly cast the object's ``index`` instead (:issue:`55895`) - Deprecated the ``ordinal`` keyword in :class:`PeriodIndex`, use :meth:`PeriodIndex.from_ordinals` instead (:issue:`55960`) +- Deprecated the ``unit`` keyword in :class:`TimedeltaIndex` construction, use :func:`to_timedelta` instead (:issue:`55499`) +- Deprecated the behavior of :meth:`DataFrame.replace` and :meth:`Series.replace` with :class:`CategoricalDtype`; in a future version replace will change the values while preserving the categories. To change the categories, use ``ser.cat.rename_categories`` instead (:issue:`55147`) - Deprecated the behavior of :meth:`Series.value_counts` and :meth:`Index.value_counts` with object dtype; in a future version these will not perform dtype inference on the resulting :class:`Index`, do ``result.index = result.index.infer_objects()`` to retain the old behavior (:issue:`56161`) +- Deprecated the default of ``observed=False`` in :meth:`DataFrame.pivot_table`; will be ``True`` in a future version (:issue:`56236`) - Deprecated the extension test classes ``BaseNoReduceTests``, ``BaseBooleanReduceTests``, and ``BaseNumericReduceTests``, use ``BaseReduceTests`` instead (:issue:`54663`) - Deprecated the option ``mode.data_manager`` and the ``ArrayManager``; only the ``BlockManager`` will be available in future versions (:issue:`55043`) - Deprecated the previous implementation of :class:`DataFrame.stack`; specify ``future_stack=True`` to adopt the future version (:issue:`53515`) -- Deprecating downcasting the results of :meth:`DataFrame.fillna`, :meth:`Series.fillna`, :meth:`DataFrame.ffill`, :meth:`Series.ffill`, :meth:`DataFrame.bfill`, :meth:`Series.bfill` in object-dtype cases. To opt in to the future version, use ``pd.set_option("future.no_silent_downcasting", True)`` (:issue:`54261`) -- .. --------------------------------------------------------------------------- .. _whatsnew_220.performance: @@ -417,10 +496,11 @@ Performance improvements - Performance improvement in :func:`get_dummies` (:issue:`56089`) - Performance improvement in :func:`merge_asof` when ``by`` is not ``None`` (:issue:`55580`, :issue:`55678`) - Performance improvement in :func:`read_stata` for files with many variables (:issue:`55515`) -- Performance improvement in :func:`to_dict` on converting DataFrame to dictionary (:issue:`50990`) - Performance improvement in :meth:`DataFrame.groupby` when aggregating pyarrow timestamp and duration dtypes (:issue:`55031`) +- Performance improvement in :meth:`DataFrame.join` when joining on unordered categorical indexes (:issue:`56345`) - Performance improvement in :meth:`DataFrame.loc` and :meth:`Series.loc` when indexing with a :class:`MultiIndex` (:issue:`56062`) - Performance improvement in :meth:`DataFrame.sort_index` and :meth:`Series.sort_index` when indexed by a :class:`MultiIndex` (:issue:`54835`) +- Performance improvement in :meth:`DataFrame.to_dict` on converting DataFrame to dictionary (:issue:`50990`) - Performance improvement in :meth:`Index.difference` (:issue:`55108`) - Performance improvement in :meth:`Index.sort_values` when index is already sorted (:issue:`56128`) - Performance improvement in :meth:`MultiIndex.get_indexer` when ``method`` is not ``None`` (:issue:`55839`) @@ -428,7 +508,8 @@ Performance improvements - Performance improvement in :meth:`Series.str.get_dummies` when dtype is ``"string[pyarrow]"`` or ``"string[pyarrow_numpy]"`` (:issue:`56110`) - Performance improvement in :meth:`Series.str` methods (:issue:`55736`) - Performance improvement in :meth:`Series.value_counts` and :meth:`Series.mode` for masked dtypes (:issue:`54984`, :issue:`55340`) -- Performance improvement in :meth:`SeriesGroupBy.idxmax`, :meth:`SeriesGroupBy.idxmin`, :meth:`DataFrameGroupBy.idxmax`, :meth:`DataFrameGroupBy.idxmin` (:issue:`54234`) +- Performance improvement in :meth:`.DataFrameGroupBy.nunique` and :meth:`.SeriesGroupBy.nunique` (:issue:`55972`) +- Performance improvement in :meth:`.SeriesGroupBy.idxmax`, :meth:`.SeriesGroupBy.idxmin`, :meth:`.DataFrameGroupBy.idxmax`, :meth:`.DataFrameGroupBy.idxmin` (:issue:`54234`) - Performance improvement when indexing into a non-unique index (:issue:`55816`) - Performance improvement when indexing with more than 4 keys (:issue:`54550`) - Performance improvement when localizing time to UTC (:issue:`55241`) @@ -449,6 +530,7 @@ Datetimelike ^^^^^^^^^^^^ - Bug in :class:`DatetimeIndex` construction when passing both a ``tz`` and either ``dayfirst`` or ``yearfirst`` ignoring dayfirst/yearfirst (:issue:`55813`) - Bug in :class:`DatetimeIndex` when passing an object-dtype ndarray of float objects and a ``tz`` incorrectly localizing the result (:issue:`55780`) +- Bug in :func:`Series.isin` with :class:`DatetimeTZDtype` dtype and comparison values that are all ``NaT`` incorrectly returning all-``False`` even if the series contains ``NaT`` entries (:issue:`56427`) - Bug in :func:`concat` raising ``AttributeError`` when concatenating all-NA DataFrame with :class:`DatetimeTZDtype` dtype DataFrame. (:issue:`52093`) - Bug in :func:`testing.assert_extension_array_equal` that could use the wrong unit when comparing resolutions (:issue:`55730`) - Bug in :func:`to_datetime` and :class:`DatetimeIndex` when passing a list of mixed-string-and-numeric types incorrectly raising (:issue:`55780`) @@ -458,6 +540,7 @@ Datetimelike - Bug in :meth:`Index.is_monotonic_increasing` and :meth:`Index.is_monotonic_decreasing` always caching :meth:`Index.is_unique` as ``True`` when first value in index is ``NaT`` (:issue:`55755`) - Bug in :meth:`Index.view` to a datetime64 dtype with non-supported resolution incorrectly raising (:issue:`55710`) - Bug in :meth:`Series.dt.round` with non-nanosecond resolution and ``NaT`` entries incorrectly raising ``OverflowError`` (:issue:`56158`) +- Bug in :meth:`Series.fillna` with non-nanosecond resolution dtypes and higher-resolution vector values returning incorrect (internally-corrupted) results (:issue:`56410`) - Bug in :meth:`Tick.delta` with very large ticks raising ``OverflowError`` instead of ``OutOfBoundsTimedelta`` (:issue:`55503`) - Bug in :meth:`Timestamp.unit` being inferred incorrectly from an ISO8601 format string with minute or hour resolution and a timezone offset (:issue:`56208`) - Bug in ``.astype`` converting from a higher-resolution ``datetime64`` dtype to a lower-resolution ``datetime64`` dtype (e.g. ``datetime64[us]->datetim64[ms]``) silently overflowing with values near the lower implementation bound (:issue:`55979`) @@ -470,8 +553,7 @@ Datetimelike - Bug in creating a :class:`Index`, :class:`Series`, or :class:`DataFrame` with a non-nanosecond ``datetime64`` dtype and inputs that would be out of bounds for a ``datetime64[ns]`` incorrectly raising ``OutOfBoundsDatetime`` (:issue:`55756`) - Bug in parsing datetime strings with nanosecond resolution with non-ISO8601 formats incorrectly truncating sub-microsecond components (:issue:`56051`) - Bug in parsing datetime strings with sub-second resolution and trailing zeros incorrectly inferring second or millisecond resolution (:issue:`55737`) -- Bug in the results of :func:`pd.to_datetime` with an floating-dtype argument with ``unit`` not matching the pointwise results of :class:`Timestamp` (:issue:`56037`) -- +- Bug in the results of :func:`to_datetime` with an floating-dtype argument with ``unit`` not matching the pointwise results of :class:`Timestamp` (:issue:`56037`) Timedelta ^^^^^^^^^ @@ -492,7 +574,8 @@ Numeric Conversion ^^^^^^^^^^ -- Bug in :func:`astype` when called with ``str`` on unpickled array - the array might change in-place (:issue:`54654`) +- Bug in :meth:`DataFrame.astype` when called with ``str`` on unpickled array - the array might change in-place (:issue:`54654`) +- Bug in :meth:`DataFrame.astype` where ``errors="ignore"`` had no effect for extension types (:issue:`54654`) - Bug in :meth:`Series.convert_dtypes` not converting all NA column to ``null[pyarrow]`` (:issue:`55346`) - @@ -500,6 +583,11 @@ Strings ^^^^^^^ - Bug in :func:`pandas.api.types.is_string_dtype` while checking object array with no elements is of the string dtype (:issue:`54661`) - Bug in :meth:`DataFrame.apply` failing when ``engine="numba"`` and columns or index have ``StringDtype`` (:issue:`56189`) +- Bug in :meth:`DataFrame.reindex` not matching :class:`Index` with ``string[pyarrow_numpy]`` dtype (:issue:`56106`) +- Bug in :meth:`Index.str.cat` always casting result to object dtype (:issue:`56157`) +- Bug in :meth:`Series.__mul__` for :class:`ArrowDtype` with ``pyarrow.string`` dtype and ``string[pyarrow]`` for the pyarrow backend (:issue:`51970`) +- Bug in :meth:`Series.str.find` when ``start < 0`` for :class:`ArrowDtype` with ``pyarrow.string`` (:issue:`56411`) +- Bug in :meth:`Series.str.replace` when ``n < 0`` for :class:`ArrowDtype` with ``pyarrow.string`` (:issue:`56404`) - Bug in :meth:`Series.str.startswith` and :meth:`Series.str.endswith` with arguments of type ``tuple[str, ...]`` for ``string[pyarrow]`` (:issue:`54942`) Interval @@ -520,7 +608,7 @@ Indexing Missing ^^^^^^^ -- +- Bug in :meth:`DataFrame.update` wasn't updating in-place for tz-aware datetime64 dtypes (:issue:`56227`) - MultiIndex @@ -530,14 +618,17 @@ MultiIndex I/O ^^^ +- Bug in :func:`read_csv` where ``engine="python"`` did not respect ``chunksize`` arg when ``skiprows`` was specified. (:issue:`56323`) +- Bug in :func:`read_csv` where ``engine="python"`` was causing a ``TypeError`` when a callable ``skiprows`` and a chunk size was specified. (:issue:`55677`) - Bug in :func:`read_csv` where ``on_bad_lines="warn"`` would write to ``stderr`` instead of raise a Python warning. This now yields a :class:`.errors.ParserWarning` (:issue:`54296`) +- Bug in :func:`read_csv` with ``engine="pyarrow"`` where ``quotechar`` was ignored (:issue:`52266`) - Bug in :func:`read_csv` with ``engine="pyarrow"`` where ``usecols`` wasn't working with a csv with no headers (:issue:`54459`) - Bug in :func:`read_excel`, with ``engine="xlrd"`` (``xls`` files) erroring when file contains NaNs/Infs (:issue:`54564`) - Bug in :func:`read_json` not handling dtype conversion properly if ``infer_string`` is set (:issue:`56195`) -- Bug in :func:`to_excel`, with ``OdsWriter`` (``ods`` files) writing boolean/string value (:issue:`54994`) +- Bug in :meth:`DataFrame.to_excel`, with ``OdsWriter`` (``ods`` files) writing boolean/string value (:issue:`54994`) - Bug in :meth:`DataFrame.to_hdf` and :func:`read_hdf` with ``datetime64`` dtypes with non-nanosecond resolution failing to round-trip correctly (:issue:`55622`) -- Bug in :meth:`pandas.read_excel` with ``engine="odf"`` (``ods`` files) when string contains annotation (:issue:`55200`) -- Bug in :meth:`pandas.read_excel` with an ODS file without cached formatted cell for float values (:issue:`55219`) +- Bug in :meth:`~pandas.read_excel` with ``engine="odf"`` (``ods`` files) when string contains annotation (:issue:`55200`) +- Bug in :meth:`~pandas.read_excel` with an ODS file without cached formatted cell for float values (:issue:`55219`) - Bug where :meth:`DataFrame.to_json` would raise an ``OverflowError`` instead of a ``TypeError`` with unsupported NumPy types (:issue:`55403`) Period @@ -558,22 +649,27 @@ Groupby/resample/rolling - Bug in :class:`.Rolling` where duplicate datetimelike indexes are treated as consecutive rather than equal with ``closed='left'`` and ``closed='neither'`` (:issue:`20712`) - Bug in :meth:`.DataFrameGroupBy.idxmin`, :meth:`.DataFrameGroupBy.idxmax`, :meth:`.SeriesGroupBy.idxmin`, and :meth:`.SeriesGroupBy.idxmax` would not retain :class:`.Categorical` dtype when the index was a :class:`.CategoricalIndex` that contained NA values (:issue:`54234`) - Bug in :meth:`.DataFrameGroupBy.transform` and :meth:`.SeriesGroupBy.transform` when ``observed=False`` and ``f="idxmin"`` or ``f="idxmax"`` would incorrectly raise on unobserved categories (:issue:`54234`) +- Bug in :meth:`.DataFrameGroupBy.value_counts` and :meth:`.SeriesGroupBy.value_count` could result in incorrect sorting if the columns of the DataFrame or name of the Series are integers (:issue:`55951`) +- Bug in :meth:`.DataFrameGroupBy.value_counts` and :meth:`.SeriesGroupBy.value_count` would not respect ``sort=False`` in :meth:`DataFrame.groupby` and :meth:`Series.groupby` (:issue:`55951`) +- Bug in :meth:`.DataFrameGroupBy.value_counts` and :meth:`.SeriesGroupBy.value_count` would sort by proportions rather than frequencies when ``sort=True`` and ``normalize=True`` (:issue:`55951`) - Bug in :meth:`DataFrame.asfreq` and :meth:`Series.asfreq` with a :class:`DatetimeIndex` with non-nanosecond resolution incorrectly converting to nanosecond resolution (:issue:`55958`) +- Bug in :meth:`DataFrame.ewm` when passed ``times`` with non-nanosecond ``datetime64`` or :class:`DatetimeTZDtype` dtype (:issue:`56262`) - Bug in :meth:`DataFrame.resample` not respecting ``closed`` and ``label`` arguments for :class:`~pandas.tseries.offsets.BusinessDay` (:issue:`55282`) +- Bug in :meth:`DataFrame.resample` when resampling on a :class:`ArrowDtype` of ``pyarrow.timestamp`` or ``pyarrow.duration`` type (:issue:`55989`) - Bug in :meth:`DataFrame.resample` where bin edges were not correct for :class:`~pandas.tseries.offsets.BusinessDay` (:issue:`55281`) - Bug in :meth:`DataFrame.resample` where bin edges were not correct for :class:`~pandas.tseries.offsets.MonthBegin` (:issue:`55271`) -- Bug in :meth:`DataFrameGroupBy.value_counts` and :meth:`SeriesGroupBy.value_count` could result in incorrect sorting if the columns of the DataFrame or name of the Series are integers (:issue:`55951`) -- Bug in :meth:`DataFrameGroupBy.value_counts` and :meth:`SeriesGroupBy.value_count` would not respect ``sort=False`` in :meth:`DataFrame.groupby` and :meth:`Series.groupby` (:issue:`55951`) -- Bug in :meth:`DataFrameGroupBy.value_counts` and :meth:`SeriesGroupBy.value_count` would sort by proportions rather than frequencies when ``sort=True`` and ``normalize=True`` (:issue:`55951`) +- Reshaping ^^^^^^^^^ - Bug in :func:`concat` ignoring ``sort`` parameter when passed :class:`DatetimeIndex` indexes (:issue:`54769`) +- Bug in :func:`concat` renaming :class:`Series` when ``ignore_index=False`` (:issue:`15047`) - Bug in :func:`merge_asof` raising ``TypeError`` when ``by`` dtype is not ``object``, ``int64``, or ``uint64`` (:issue:`22794`) +- Bug in :func:`merge_asof` raising incorrect error for string dtype (:issue:`56444`) - Bug in :func:`merge` returning columns in incorrect order when left and/or right is empty (:issue:`51929`) -- Bug in :meth:`pandas.DataFrame.melt` where an exception was raised if ``var_name`` was not a string (:issue:`55948`) -- Bug in :meth:`pandas.DataFrame.melt` where it would not preserve the datetime (:issue:`55254`) -- Bug in :meth:`pandas.DataFrame.pivot_table` where the row margin is incorrect when the columns have numeric names (:issue:`26568`) +- Bug in :meth:`DataFrame.melt` where an exception was raised if ``var_name`` was not a string (:issue:`55948`) +- Bug in :meth:`DataFrame.melt` where it would not preserve the datetime (:issue:`55254`) +- Bug in :meth:`DataFrame.pivot_table` where the row margin is incorrect when the columns have numeric names (:issue:`26568`) Sparse ^^^^^^ @@ -593,13 +689,15 @@ Styler Other ^^^^^ - Bug in :func:`DataFrame.describe` when formatting percentiles in the resulting percentile 99.999% is rounded to 100% (:issue:`55765`) +- Bug in :func:`cut` and :func:`qcut` with ``datetime64`` dtype values with non-nanosecond units incorrectly returning nanosecond-unit bins (:issue:`56101`) - Bug in :func:`cut` incorrectly allowing cutting of timezone-aware datetimes with timezone-naive bins (:issue:`54964`) - Bug in :func:`infer_freq` and :meth:`DatetimeIndex.inferred_freq` with weekly frequencies and non-nanosecond resolutions (:issue:`55609`) - Bug in :meth:`DataFrame.apply` where passing ``raw=True`` ignored ``args`` passed to the applied function (:issue:`55009`) -- Bug in :meth:`Dataframe.from_dict` which would always sort the rows of the created :class:`DataFrame`. (:issue:`55683`) +- Bug in :meth:`DataFrame.from_dict` which would always sort the rows of the created :class:`DataFrame`. (:issue:`55683`) - Bug in rendering ``inf`` values inside a a :class:`DataFrame` with the ``use_inf_as_na`` option enabled (:issue:`55483`) - Bug in rendering a :class:`Series` with a :class:`MultiIndex` when one of the index level's names is 0 not having that name displayed (:issue:`55415`) - Bug in the error message when assigning an empty dataframe to a column (:issue:`55956`) +- .. ***DO NOT USE THIS SECTION*** @@ -610,4 +708,4 @@ Other .. _whatsnew_220.contributors: Contributors -~~~~~~~~~~~~ +~~~~~~~~~~~~ \ No newline at end of file diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py index 82de8ae96160f..7a6d947ea9bd1 100644 --- a/pandas/core/algorithms.py +++ b/pandas/core/algorithms.py @@ -604,8 +604,6 @@ def factorize_array( codes = ensure_platform_int(codes) return codes, uniques - - @doc( values=dedent( """\ @@ -621,20 +619,37 @@ def factorize_array( relationship. """ ), + use_na_sentinel=dedent( + """\ + use_na_sentinel : bool, default True + If True, the sentinel -1 will be used for NaN values. If False, + NaN values will be encoded as non-negative integers and will not drop the + NaN from the uniques of the values. + + .. versionadded:: 1.5.0 + """ + ), size_hint=dedent( """\ size_hint : int, optional Hint to the hashtable sizer. """ ), + original_factorization=dedent( + """\ + original_factorization : tuple[np.ndarray, np.ndarray | Index] or None, optional, default: None + A tuple containing the original factorization arrays (labels and uniques) to be used as a reference. + """ + ), ) def factorize( values, sort: bool = False, use_na_sentinel: bool = True, size_hint: int | None = None, + original_factorization: tuple[np.ndarray, np.ndarray | Index] | None = None ) -> tuple[np.ndarray, np.ndarray | Index]: - """ + """ Encode the object as an enumerated type or categorical variable. This method is useful for obtaining a numeric representation of an @@ -651,6 +666,10 @@ def factorize( NaN from the uniques of the values. .. versionadded:: 1.5.0 + + original_factorization : tuple[np.ndarray, np.ndarray | Index] or None, optional, default: None + A tuple containing the original factorization arrays (labels and uniques) to be used as a reference. + {size_hint}\ Returns @@ -758,6 +777,27 @@ def factorize( # Step 2 is dispatched to extension types (like Categorical). They are # responsible only for factorization. All data coercion, sorting and boxing # should happen here. + if original_factorization is not None: + original_uniques, original_codes = original_factorization + unique_to_code = dict(zip(original_uniques, range(len(original_uniques)))) + + # Map existing data to original codes, assign new codes to new uniques + new_codes = [] + new_uniques = list(original_uniques) + next_code = len(original_uniques) + + for item in values: + if item in unique_to_code: + new_codes.append(unique_to_code[item]) + else: + unique_to_code[item] = next_code + new_uniques.append(item) + new_codes.append(next_code) + next_code += 1 + + return new_codes, new_uniques + + if isinstance(values, (ABCIndex, ABCSeries)): return values.factorize(sort=sort, use_na_sentinel=use_na_sentinel) @@ -1833,3 +1873,6 @@ def map_array( return lib.map_infer_mask( values, mapper, mask=isna(values).view(np.uint8), convert=convert ) + + + diff --git a/pandas/core/arrays/categorical.py b/pandas/core/arrays/categorical.py index eec833c600177..1521f491337ed 100644 --- a/pandas/core/arrays/categorical.py +++ b/pandas/core/arrays/categorical.py @@ -678,7 +678,7 @@ def _from_inferred_categories( codes = inferred_codes return cls._simple_new(codes, dtype=dtype) - + @classmethod def from_codes( cls, diff --git a/pandas/tests/arrays/categorical/test_algos.py b/pandas/tests/arrays/categorical/test_algos.py index d4c19a4970135..247cfd38d3206 100644 --- a/pandas/tests/arrays/categorical/test_algos.py +++ b/pandas/tests/arrays/categorical/test_algos.py @@ -87,3 +87,77 @@ def test_diff(): df = ser.to_frame(name="A") with pytest.raises(TypeError, match=msg): df.diff() + + +def test_factorize_with_original_factorization(): + # Original factorization setup + original_values = np.array(['apple', 'banana', 'apple', 'cherry']) + original_codes, original_uniques = pd.factorize(original_values) + original_factorization = (original_uniques, original_codes) + + # New values including some from original and some new + new_values = np.array(['banana', 'apple', 'durian', 'apple', 'elderberry']) + + # Expected outcome + expected_new_codes = np.array([1, 0, 3, 0, 4]) # 'banana' and 'apple' should have original codes + expected_new_uniques = np.array(['apple', 'banana', 'cherry', 'durian', 'elderberry']) + + # Run factorize with original_factorization + new_codes, new_uniques = pd.factorize(new_values, original_factorization=original_factorization) + + # Assertions + np.testing.assert_array_equal(new_codes, expected_new_codes) + np.testing.assert_array_equal(new_uniques, expected_new_uniques) + + +def test_factorize_with_empty_original_factorization(): + values = np.array(['x', 'y', 'z']) + original_factorization = (np.array([]), np.array([])) + + expected_codes = np.array([0, 1, 2]) + expected_uniques = np.array(['x', 'y', 'z']) + + codes, uniques = pd.factorize(values, original_factorization=original_factorization) + + np.testing.assert_array_equal(codes, expected_codes) + np.testing.assert_array_equal(uniques, expected_uniques) + + +def test_factorize_with_non_overlapping_original_factorization(): + original_values = np.array(['a', 'b', 'c']) + new_values = np.array(['x', 'y', 'z']) + original_codes, original_uniques = pd.factorize(original_values) + original_factorization = (original_uniques, original_codes) + + expected_new_codes = np.array([3, 4, 5]) # completely new codes + expected_new_uniques = np.array(['a', 'b', 'c', 'x', 'y', 'z']) + + new_codes, new_uniques = pd.factorize(new_values, original_factorization=original_factorization) + + np.testing.assert_array_equal(new_codes, expected_new_codes) + np.testing.assert_array_equal(new_uniques, expected_new_uniques) + +def test_factorize_with_nan_in_original_factorization(): + original_values = np.array(['a', 'b', np.nan, 'd']) + new_values = np.array(['a', 'b', 'c', np.nan]) + original_codes, original_uniques = pd.factorize(original_values) + original_factorization = (original_uniques, original_codes) + + expected_new_codes = np.array([0, 1, 4, 2]) # 'c' is new, NaN should match original + expected_new_uniques = np.array(['a', 'b', np.nan, 'd', 'c']) + + new_codes, new_uniques = pd.factorize(new_values, original_factorization=original_factorization) + + np.testing.assert_array_equal(new_codes, expected_new_codes) + np.testing.assert_array_equal(new_uniques, expected_new_uniques) + +def test_factorize_with_different_data_types(): + original_values = np.array([1, 2, 3]) # integers + new_values = np.array(['1', '4', '2']) # strings representing numbers + original_codes, original_uniques = pd.factorize(original_values) + original_factorization = (original_uniques, original_codes) + + expected_new_codes = np.array([3, 4, 1]) # '1' and '2' as strings are new + expected_new_uniques = np.array([1, 2, 3, '1', '4']) + + new_codes, new_uniques = pd.factorize(new_values, original_factorization=original_factorization)