Skip to content

Commit 6d990d4

Browse files
Merge branch 'main' into plot
2 parents 5cc8c28 + a393c31 commit 6d990d4

File tree

18 files changed

+427
-119
lines changed

18 files changed

+427
-119
lines changed

doc/source/reference/groupby.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,8 @@ Function application
7979
DataFrameGroupBy.cumsum
8080
DataFrameGroupBy.describe
8181
DataFrameGroupBy.diff
82+
DataFrameGroupBy.ewm
83+
DataFrameGroupBy.expanding
8284
DataFrameGroupBy.ffill
8385
DataFrameGroupBy.first
8486
DataFrameGroupBy.head
@@ -130,6 +132,8 @@ Function application
130132
SeriesGroupBy.cumsum
131133
SeriesGroupBy.describe
132134
SeriesGroupBy.diff
135+
SeriesGroupBy.ewm
136+
SeriesGroupBy.expanding
133137
SeriesGroupBy.ffill
134138
SeriesGroupBy.first
135139
SeriesGroupBy.head

doc/source/whatsnew/v3.0.0.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ Other enhancements
6161
- :meth:`Series.cummin` and :meth:`Series.cummax` now supports :class:`CategoricalDtype` (:issue:`52335`)
6262
- :meth:`Series.plot` now correctly handle the ``ylabel`` parameter for pie charts, allowing for explicit control over the y-axis label (:issue:`58239`)
6363
- :meth:`DataFrame.plot.scatter` argument ``c`` now accepts a column of strings, where rows with the same string are colored identically (:issue:`16827` and :issue:`16485`)
64+
- :meth:`Series.nlargest` uses a 'stable' sort internally and will preserve original ordering.
6465
- :class:`ArrowDtype` now supports ``pyarrow.JsonType`` (:issue:`60958`)
6566
- :class:`DataFrameGroupBy` and :class:`SeriesGroupBy` methods ``sum``, ``mean``, ``median``, ``prod``, ``min``, ``max``, ``std``, ``var`` and ``sem`` now accept ``skipna`` parameter (:issue:`15675`)
6667
- :class:`Rolling` and :class:`Expanding` now support ``nunique`` (:issue:`26958`)
@@ -421,6 +422,7 @@ Other Deprecations
421422
- Deprecated lowercase strings ``w``, ``w-mon``, ``w-tue``, etc. denoting frequencies in :class:`Week` in favour of ``W``, ``W-MON``, ``W-TUE``, etc. (:issue:`58998`)
422423
- Deprecated parameter ``method`` in :meth:`DataFrame.reindex_like` / :meth:`Series.reindex_like` (:issue:`58667`)
423424
- Deprecated strings ``w``, ``d``, ``MIN``, ``MS``, ``US`` and ``NS`` denoting units in :class:`Timedelta` in favour of ``W``, ``D``, ``min``, ``ms``, ``us`` and ``ns`` (:issue:`59051`)
425+
- Deprecated the ``arg`` parameter of ``Series.map``; pass the added ``func`` argument instead. (:issue:`61260`)
424426
- Deprecated using ``epoch`` date format in :meth:`DataFrame.to_json` and :meth:`Series.to_json`, use ``iso`` instead. (:issue:`57063`)
425427

426428
.. ---------------------------------------------------------------------------
@@ -592,6 +594,7 @@ Performance improvements
592594
- :func:`concat` returns a :class:`RangeIndex` column when possible when ``objs`` contains :class:`Series` and :class:`DataFrame` and ``axis=0`` (:issue:`58119`)
593595
- :func:`concat` returns a :class:`RangeIndex` level in the :class:`MultiIndex` result when ``keys`` is a ``range`` or :class:`RangeIndex` (:issue:`57542`)
594596
- :meth:`RangeIndex.append` returns a :class:`RangeIndex` instead of a :class:`Index` when appending values that could continue the :class:`RangeIndex` (:issue:`57467`)
597+
- :meth:`Series.nlargest` has improved performance when there are duplicate values in the index (:issue:`55767`)
595598
- :meth:`Series.str.extract` returns a :class:`RangeIndex` columns instead of an :class:`Index` column when possible (:issue:`57542`)
596599
- :meth:`Series.str.partition` with :class:`ArrowDtype` returns a :class:`RangeIndex` columns instead of an :class:`Index` column when possible (:issue:`57768`)
597600
- Performance improvement in :class:`DataFrame` when ``data`` is a ``dict`` and ``columns`` is specified (:issue:`24368`)
@@ -622,6 +625,7 @@ Performance improvements
622625
- Performance improvement in :meth:`CategoricalDtype.update_dtype` when ``dtype`` is a :class:`CategoricalDtype` with non ``None`` categories and ordered (:issue:`59647`)
623626
- Performance improvement in :meth:`DataFrame.__getitem__` when ``key`` is a :class:`DataFrame` with many columns (:issue:`61010`)
624627
- Performance improvement in :meth:`DataFrame.astype` when converting to extension floating dtypes, e.g. "Float64" (:issue:`60066`)
628+
- Performance improvement in :meth:`DataFrame.stack` when using ``future_stack=True`` and the DataFrame does not have a :class:`MultiIndex` (:issue:`58391`)
625629
- Performance improvement in :meth:`DataFrame.where` when ``cond`` is a :class:`DataFrame` with many columns (:issue:`61010`)
626630
- Performance improvement in :meth:`to_hdf` avoid unnecessary reopenings of the HDF5 file to speedup data addition to files with a very large number of groups . (:issue:`58248`)
627631
- Performance improvement in ``DataFrameGroupBy.__len__`` and ``SeriesGroupBy.__len__`` (:issue:`57595`)
@@ -637,6 +641,7 @@ Bug fixes
637641
Categorical
638642
^^^^^^^^^^^
639643
- Bug in :func:`Series.apply` where ``nan`` was ignored for :class:`CategoricalDtype` (:issue:`59938`)
644+
- Bug in :meth:`DataFrame.pivot` and :meth:`DataFrame.set_index` raising an ``ArrowNotImplementedError`` for columns with pyarrow dictionary dtype (:issue:`53051`)
640645
- Bug in :meth:`Series.convert_dtypes` with ``dtype_backend="pyarrow"`` where empty :class:`CategoricalDtype` :class:`Series` raised an error or got converted to ``null[pyarrow]`` (:issue:`59934`)
641646
-
642647

@@ -649,6 +654,7 @@ Datetimelike
649654
- Bug in :func:`date_range` where using a negative frequency value would not include all points between the start and end values (:issue:`56147`)
650655
- Bug in :func:`tseries.api.guess_datetime_format` would fail to infer time format when "%Y" == "%H%M" (:issue:`57452`)
651656
- Bug in :func:`tseries.frequencies.to_offset` would fail to parse frequency strings starting with "LWOM" (:issue:`59218`)
657+
- Bug in :meth:`DataFrame.fillna` raising an ``AssertionError`` instead of ``OutOfBoundsDatetime`` when filling a ``datetime64[ns]`` column with an out-of-bounds timestamp. Now correctly raises ``OutOfBoundsDatetime``. (:issue:`61208`)
652658
- Bug in :meth:`DataFrame.min` and :meth:`DataFrame.max` casting ``datetime64`` and ``timedelta64`` columns to ``float64`` and losing precision (:issue:`60850`)
653659
- Bug in :meth:`Dataframe.agg` with df with missing values resulting in IndexError (:issue:`58810`)
654660
- Bug in :meth:`DatetimeIndex.is_year_start` and :meth:`DatetimeIndex.is_quarter_start` does not raise on Custom business days frequencies bigger then "1C" (:issue:`58664`)

pandas/__init__.py

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,19 +4,17 @@
44

55
# Let users know if they're missing any of our hard dependencies
66
_hard_dependencies = ("numpy", "dateutil")
7-
_missing_dependencies = []
87

98
for _dependency in _hard_dependencies:
109
try:
1110
__import__(_dependency)
1211
except ImportError as _e: # pragma: no cover
13-
_missing_dependencies.append(f"{_dependency}: {_e}")
12+
raise ImportError(
13+
f"Unable to import required dependency {_dependency}. "
14+
"Please see the traceback for details."
15+
) from _e
1416

15-
if _missing_dependencies: # pragma: no cover
16-
raise ImportError(
17-
"Unable to import required dependencies:\n" + "\n".join(_missing_dependencies)
18-
)
19-
del _hard_dependencies, _dependency, _missing_dependencies
17+
del _hard_dependencies, _dependency
2018

2119
try:
2220
# numpy compat

pandas/core/arrays/categorical.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -452,7 +452,7 @@ def __init__(
452452
if isinstance(values, Index):
453453
arr = values._data._pa_array.combine_chunks()
454454
else:
455-
arr = values._pa_array.combine_chunks()
455+
arr = extract_array(values)._pa_array.combine_chunks()
456456
categories = arr.dictionary.to_pandas(types_mapper=ArrowDtype)
457457
codes = arr.indices.to_numpy()
458458
dtype = CategoricalDtype(categories, values.dtype.pyarrow_dtype.ordered)

pandas/core/groupby/groupby.py

Lines changed: 112 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3803,16 +3803,58 @@ def rolling(
38033803
)
38043804

38053805
@final
3806-
@Substitution(name="groupby")
3807-
@Appender(_common_see_also)
38083806
def expanding(self, *args, **kwargs) -> ExpandingGroupby:
38093807
"""
3810-
Return an expanding grouper, providing expanding
3811-
functionality per group.
3808+
Return an expanding grouper, providing expanding functionality per group.
3809+
3810+
Arguments are the same as `:meth:DataFrame.rolling` except that ``step`` cannot
3811+
be specified.
3812+
3813+
Parameters
3814+
----------
3815+
*args : tuple
3816+
Positional arguments passed to the expanding window constructor.
3817+
**kwargs : dict
3818+
Keyword arguments passed to the expanding window constructor.
38123819
38133820
Returns
38143821
-------
38153822
pandas.api.typing.ExpandingGroupby
3823+
An object that supports expanding transformations over each group.
3824+
3825+
See Also
3826+
--------
3827+
Series.expanding : Expanding transformations for Series.
3828+
DataFrame.expanding : Expanding transformations for DataFrames.
3829+
Series.groupby : Apply a function groupby to a Series.
3830+
DataFrame.groupby : Apply a function groupby.
3831+
3832+
Examples
3833+
--------
3834+
>>> df = pd.DataFrame(
3835+
... {
3836+
... "Class": ["A", "A", "A", "B", "B", "B"],
3837+
... "Value": [10, 20, 30, 40, 50, 60],
3838+
... }
3839+
... )
3840+
>>> df
3841+
Class Value
3842+
0 A 10
3843+
1 A 20
3844+
2 A 30
3845+
3 B 40
3846+
4 B 50
3847+
5 B 60
3848+
3849+
>>> df.groupby("Class").expanding().mean()
3850+
Value
3851+
Class
3852+
A 0 10.0
3853+
1 15.0
3854+
2 20.0
3855+
B 3 40.0
3856+
4 45.0
3857+
5 50.0
38163858
"""
38173859
from pandas.core.window import ExpandingGroupby
38183860

@@ -3824,15 +3866,79 @@ def expanding(self, *args, **kwargs) -> ExpandingGroupby:
38243866
)
38253867

38263868
@final
3827-
@Substitution(name="groupby")
3828-
@Appender(_common_see_also)
38293869
def ewm(self, *args, **kwargs) -> ExponentialMovingWindowGroupby:
38303870
"""
38313871
Return an ewm grouper, providing ewm functionality per group.
38323872
3873+
Parameters
3874+
----------
3875+
*args : tuple
3876+
Positional arguments passed to the EWM window constructor.
3877+
**kwargs : dict
3878+
Keyword arguments passed to the EWM window constructor, such as:
3879+
3880+
com : float, optional
3881+
Specify decay in terms of center of mass.
3882+
``span``, ``halflife``, and ``alpha`` are alternative ways to specify
3883+
decay.
3884+
span : float, optional
3885+
Specify decay in terms of span.
3886+
halflife : float, optional
3887+
Specify decay in terms of half-life.
3888+
alpha : float, optional
3889+
Specify smoothing factor directly.
3890+
min_periods : int, default 0
3891+
Minimum number of observations in the window required to have a value;
3892+
otherwise, result is ``np.nan``.
3893+
adjust : bool, default True
3894+
Divide by decaying adjustment factor to account for imbalance in
3895+
relative weights.
3896+
ignore_na : bool, default False
3897+
Ignore missing values when calculating weights.
3898+
times : str or array-like of datetime64, optional
3899+
Times corresponding to the observations.
3900+
axis : {0 or 'index', 1 or 'columns'}, default 0
3901+
Axis along which the EWM function is applied.
3902+
38333903
Returns
38343904
-------
38353905
pandas.api.typing.ExponentialMovingWindowGroupby
3906+
An object that supports exponentially weighted moving transformations over
3907+
each group.
3908+
3909+
See Also
3910+
--------
3911+
Series.ewm : EWM transformations for Series.
3912+
DataFrame.ewm : EWM transformations for DataFrames.
3913+
Series.groupby : Apply a function groupby to a Series.
3914+
DataFrame.groupby : Apply a function groupby.
3915+
3916+
Examples
3917+
--------
3918+
>>> df = pd.DataFrame(
3919+
... {
3920+
... "Class": ["A", "A", "A", "B", "B", "B"],
3921+
... "Value": [10, 20, 30, 40, 50, 60],
3922+
... }
3923+
... )
3924+
>>> df
3925+
Class Value
3926+
0 A 10
3927+
1 A 20
3928+
2 A 30
3929+
3 B 40
3930+
4 B 50
3931+
5 B 60
3932+
3933+
>>> df.groupby("Class").ewm(com=0.5).mean()
3934+
Value
3935+
Class
3936+
A 0 10.000000
3937+
1 17.500000
3938+
2 26.153846
3939+
B 3 40.000000
3940+
4 47.500000
3941+
5 56.153846
38363942
"""
38373943
from pandas.core.window import ExponentialMovingWindowGroupby
38383944

pandas/core/internals/blocks.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1679,6 +1679,8 @@ def where(self, other, cond) -> list[Block]:
16791679

16801680
try:
16811681
res_values = arr._where(cond, other).T
1682+
except OutOfBoundsDatetime:
1683+
raise
16821684
except (ValueError, TypeError):
16831685
if self.ndim == 1 or self.shape[0] == 1:
16841686
if isinstance(self.dtype, (IntervalDtype, StringDtype)):
@@ -1746,6 +1748,8 @@ def putmask(self, mask, new) -> list[Block]:
17461748
try:
17471749
# Caller is responsible for ensuring matching lengths
17481750
values._putmask(mask, new)
1751+
except OutOfBoundsDatetime:
1752+
raise
17491753
except (TypeError, ValueError):
17501754
if self.ndim == 1 or self.shape[0] == 1:
17511755
if isinstance(self.dtype, IntervalDtype):

pandas/core/methods/selectn.py

Lines changed: 32 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
from typing import (
1212
TYPE_CHECKING,
1313
Generic,
14+
Literal,
1415
cast,
1516
final,
1617
)
@@ -54,7 +55,9 @@
5455

5556

5657
class SelectN(Generic[NDFrameT]):
57-
def __init__(self, obj: NDFrameT, n: int, keep: str) -> None:
58+
def __init__(
59+
self, obj: NDFrameT, n: int, keep: Literal["first", "last", "all"]
60+
) -> None:
5861
self.obj = obj
5962
self.n = n
6063
self.keep = keep
@@ -111,15 +114,25 @@ def compute(self, method: str) -> Series:
111114
if n <= 0:
112115
return self.obj[[]]
113116

114-
dropped = self.obj.dropna()
115-
nan_index = self.obj.drop(dropped.index)
117+
# Save index and reset to default index to avoid performance impact
118+
# from when index contains duplicates
119+
original_index: Index = self.obj.index
120+
default_index = self.obj.reset_index(drop=True)
116121

117-
# slow method
118-
if n >= len(self.obj):
122+
# Slower method used when taking the full length of the series
123+
# In this case, it is equivalent to a sort.
124+
if n >= len(default_index):
119125
ascending = method == "nsmallest"
120-
return self.obj.sort_values(ascending=ascending).head(n)
126+
result = default_index.sort_values(ascending=ascending, kind="stable").head(
127+
n
128+
)
129+
result.index = original_index.take(result.index)
130+
return result
131+
132+
# Fast method used in the general case
133+
dropped = default_index.dropna()
134+
nan_index = default_index.drop(dropped.index)
121135

122-
# fast method
123136
new_dtype = dropped.dtype
124137

125138
# Similar to algorithms._ensure_data
@@ -158,7 +171,7 @@ def compute(self, method: str) -> Series:
158171
else:
159172
kth_val = np.nan
160173
(ns,) = np.nonzero(arr <= kth_val)
161-
inds = ns[arr[ns].argsort(kind="mergesort")]
174+
inds = ns[arr[ns].argsort(kind="stable")]
162175

163176
if self.keep != "all":
164177
inds = inds[:n]
@@ -173,7 +186,9 @@ def compute(self, method: str) -> Series:
173186
# reverse indices
174187
inds = narr - 1 - inds
175188

176-
return concat([dropped.iloc[inds], nan_index]).iloc[:findex]
189+
result = concat([dropped.iloc[inds], nan_index]).iloc[:findex]
190+
result.index = original_index.take(result.index)
191+
return result
177192

178193

179194
class SelectNFrame(SelectN[DataFrame]):
@@ -192,7 +207,13 @@ class SelectNFrame(SelectN[DataFrame]):
192207
nordered : DataFrame
193208
"""
194209

195-
def __init__(self, obj: DataFrame, n: int, keep: str, columns: IndexLabel) -> None:
210+
def __init__(
211+
self,
212+
obj: DataFrame,
213+
n: int,
214+
keep: Literal["first", "last", "all"],
215+
columns: IndexLabel,
216+
) -> None:
196217
super().__init__(obj, n, keep)
197218
if not is_list_like(columns) or isinstance(columns, tuple):
198219
columns = [columns]
@@ -277,4 +298,4 @@ def get_indexer(current_indexer: Index, other_indexer: Index) -> Index:
277298

278299
ascending = method == "nsmallest"
279300

280-
return frame.sort_values(columns, ascending=ascending, kind="mergesort")
301+
return frame.sort_values(columns, ascending=ascending, kind="stable")

0 commit comments

Comments
 (0)