Skip to content

Commit 456796e

Browse files
authored
Merge branch 'main' into main
2 parents d1cfb4d + b983366 commit 456796e

File tree

18 files changed

+310
-21
lines changed

18 files changed

+310
-21
lines changed

.github/workflows/unit-tests.yml

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,11 @@ defaults:
2222

2323
jobs:
2424
ubuntu:
25-
runs-on: ubuntu-22.04
25+
runs-on: ${{ matrix.platform }}
2626
timeout-minutes: 90
2727
strategy:
2828
matrix:
29+
platform: [ubuntu-22.04, ubuntu-24.04-arm]
2930
env_file: [actions-310.yaml, actions-311.yaml, actions-312.yaml]
3031
# Prevent the include jobs from overriding other jobs
3132
pattern: [""]
@@ -35,9 +36,11 @@ jobs:
3536
env_file: actions-311-downstream_compat.yaml
3637
pattern: "not slow and not network and not single_cpu"
3738
pytest_target: "pandas/tests/test_downstream.py"
39+
platform: ubuntu-22.04
3840
- name: "Minimum Versions"
3941
env_file: actions-310-minimum_versions.yaml
4042
pattern: "not slow and not network and not single_cpu"
43+
platform: ubuntu-22.04
4144
- name: "Locale: it_IT"
4245
env_file: actions-311.yaml
4346
pattern: "not slow and not network and not single_cpu"
@@ -48,6 +51,7 @@ jobs:
4851
# Also install it_IT (its encoding is ISO8859-1) but do not activate it.
4952
# It will be temporarily activated during tests with locale.setlocale
5053
extra_loc: "it_IT"
54+
platform: ubuntu-22.04
5155
- name: "Locale: zh_CN"
5256
env_file: actions-311.yaml
5357
pattern: "not slow and not network and not single_cpu"
@@ -58,25 +62,31 @@ jobs:
5862
# Also install zh_CN (its encoding is gb2312) but do not activate it.
5963
# It will be temporarily activated during tests with locale.setlocale
6064
extra_loc: "zh_CN"
65+
platform: ubuntu-22.04
6166
- name: "Future infer strings"
6267
env_file: actions-312.yaml
6368
pandas_future_infer_string: "1"
69+
platform: ubuntu-22.04
6470
- name: "Future infer strings (without pyarrow)"
6571
env_file: actions-311.yaml
6672
pandas_future_infer_string: "1"
73+
platform: ubuntu-22.04
6774
- name: "Pypy"
6875
env_file: actions-pypy-39.yaml
6976
pattern: "not slow and not network and not single_cpu"
7077
test_args: "--max-worker-restart 0"
78+
platform: ubuntu-22.04
7179
- name: "Numpy Dev"
7280
env_file: actions-311-numpydev.yaml
7381
pattern: "not slow and not network and not single_cpu"
7482
test_args: "-W error::DeprecationWarning -W error::FutureWarning"
83+
platform: ubuntu-22.04
7584
- name: "Pyarrow Nightly"
7685
env_file: actions-311-pyarrownightly.yaml
7786
pattern: "not slow and not network and not single_cpu"
87+
platform: ubuntu-22.04
7888
fail-fast: false
79-
name: ${{ matrix.name || format('ubuntu-latest {0}', matrix.env_file) }}
89+
name: ${{ matrix.name || format('ubuntu-latest {0}', matrix.env_file) }}-${{ matrix.platform }}
8090
env:
8191
PATTERN: ${{ matrix.pattern }}
8292
LANG: ${{ matrix.lang || 'C.UTF-8' }}
@@ -91,7 +101,7 @@ jobs:
91101
REMOVE_PYARROW: ${{ matrix.name == 'Future infer strings (without pyarrow)' && '1' || '0' }}
92102
concurrency:
93103
# https://github.community/t/concurrecy-not-work-for-push/183068/7
94-
group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-${{ matrix.env_file }}-${{ matrix.pattern }}-${{ matrix.extra_apt || '' }}-${{ matrix.pandas_future_infer_string }}
104+
group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-${{ matrix.env_file }}-${{ matrix.pattern }}-${{ matrix.extra_apt || '' }}-${{ matrix.pandas_future_infer_string }}-${{ matrix.platform }}
95105
cancel-in-progress: true
96106

97107
services:

.github/workflows/wheels.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ jobs:
9494
buildplat:
9595
- [ubuntu-22.04, manylinux_x86_64]
9696
- [ubuntu-22.04, musllinux_x86_64]
97+
- [ubuntu-24.04-arm, manylinux_aarch64]
9798
- [macos-13, macosx_x86_64]
9899
# Note: M1 images on Github Actions start from macOS 14
99100
- [macos-14, macosx_arm64]

doc/source/whatsnew/v3.0.0.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ Other enhancements
3030
^^^^^^^^^^^^^^^^^^
3131
- :class:`pandas.api.typing.FrozenList` is available for typing the outputs of :attr:`MultiIndex.names`, :attr:`MultiIndex.codes` and :attr:`MultiIndex.levels` (:issue:`58237`)
3232
- :class:`pandas.api.typing.SASReader` is available for typing the output of :func:`read_sas` (:issue:`55689`)
33+
- :meth:`pandas.api.interchange.from_dataframe` now uses the `PyCapsule Interface <https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html>`_ if available, only falling back to the Dataframe Interchange Protocol if that fails (:issue:`60739`)
3334
- :class:`pandas.api.typing.NoDefault` is available for typing ``no_default``
3435
- :func:`DataFrame.to_excel` now raises an ``UserWarning`` when the character count in a cell exceeds Excel's limitation of 32767 characters (:issue:`56954`)
3536
- :func:`pandas.merge` now validates the ``how`` parameter input (merge type) (:issue:`59435`)
@@ -59,6 +60,7 @@ Other enhancements
5960
- :meth:`DataFrame.plot.scatter` argument ``c`` now accepts a column of strings, where rows with the same string are colored identically (:issue:`16827` and :issue:`16485`)
6061
- :class:`Rolling` and :class:`Expanding` now support aggregations ``first`` and ``last`` (:issue:`33155`)
6162
- :func:`read_parquet` accepts ``to_pandas_kwargs`` which are forwarded to :meth:`pyarrow.Table.to_pandas` which enables passing additional keywords to customize the conversion to pandas, such as ``maps_as_pydicts`` to read the Parquet map data type as python dictionaries (:issue:`56842`)
63+
- :meth:`.DataFrameGroupBy.mean`, :meth:`.DataFrameGroupBy.sum`, :meth:`.SeriesGroupBy.mean` and :meth:`.SeriesGroupBy.sum` now accept ``skipna`` parameter (:issue:`15675`)
6264
- :meth:`.DataFrameGroupBy.transform`, :meth:`.SeriesGroupBy.transform`, :meth:`.DataFrameGroupBy.agg`, :meth:`.SeriesGroupBy.agg`, :meth:`.SeriesGroupBy.apply`, :meth:`.DataFrameGroupBy.apply` now support ``kurt`` (:issue:`40139`)
6365
- :meth:`DataFrameGroupBy.transform`, :meth:`SeriesGroupBy.transform`, :meth:`DataFrameGroupBy.agg`, :meth:`SeriesGroupBy.agg`, :meth:`RollingGroupby.apply`, :meth:`ExpandingGroupby.apply`, :meth:`Rolling.apply`, :meth:`Expanding.apply`, :meth:`DataFrame.apply` with ``engine="numba"`` now supports positional arguments passed as kwargs (:issue:`58995`)
6466
- :meth:`Rolling.agg`, :meth:`Expanding.agg` and :meth:`ExponentialMovingWindow.agg` now accept :class:`NamedAgg` aggregations through ``**kwargs`` (:issue:`28333`)

pandas/_libs/groupby.pyi

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ def group_sum(
6666
result_mask: np.ndarray | None = ...,
6767
min_count: int = ...,
6868
is_datetimelike: bool = ...,
69+
skipna: bool = ...,
6970
) -> None: ...
7071
def group_prod(
7172
out: np.ndarray, # int64float_t[:, ::1]
@@ -115,6 +116,7 @@ def group_mean(
115116
is_datetimelike: bool = ..., # bint
116117
mask: np.ndarray | None = ...,
117118
result_mask: np.ndarray | None = ...,
119+
skipna: bool = ...,
118120
) -> None: ...
119121
def group_ohlc(
120122
out: np.ndarray, # floatingintuint_t[:, ::1]

pandas/_libs/groupby.pyx

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -700,13 +700,14 @@ def group_sum(
700700
uint8_t[:, ::1] result_mask=None,
701701
Py_ssize_t min_count=0,
702702
bint is_datetimelike=False,
703+
bint skipna=True,
703704
) -> None:
704705
"""
705706
Only aggregates on axis=0 using Kahan summation
706707
"""
707708
cdef:
708709
Py_ssize_t i, j, N, K, lab, ncounts = len(counts)
709-
sum_t val, t, y
710+
sum_t val, t, y, nan_val
710711
sum_t[:, ::1] sumx, compensation
711712
int64_t[:, ::1] nobs
712713
Py_ssize_t len_values = len(values), len_labels = len(labels)
@@ -722,6 +723,15 @@ def group_sum(
722723
compensation = np.zeros((<object>out).shape, dtype=(<object>out).base.dtype)
723724

724725
N, K = (<object>values).shape
726+
if uses_mask:
727+
nan_val = 0
728+
elif is_datetimelike:
729+
nan_val = NPY_NAT
730+
elif sum_t is int64_t or sum_t is uint64_t:
731+
# This has no effect as int64 can't be nan. Setting to 0 to avoid type error
732+
nan_val = 0
733+
else:
734+
nan_val = NAN
725735

726736
with nogil(sum_t is not object):
727737
for i in range(N):
@@ -734,6 +744,16 @@ def group_sum(
734744
for j in range(K):
735745
val = values[i, j]
736746

747+
if not skipna and (
748+
(uses_mask and result_mask[lab, j]) or
749+
(is_datetimelike and sumx[lab, j] == NPY_NAT) or
750+
_treat_as_na(sumx[lab, j], False)
751+
):
752+
# If sum is already NA, don't add to it. This is important for
753+
# datetimelikebecause adding a value to NPY_NAT may not result
754+
# in a NPY_NAT
755+
continue
756+
737757
if uses_mask:
738758
isna_entry = mask[i, j]
739759
else:
@@ -765,6 +785,11 @@ def group_sum(
765785
# because of no gil
766786
compensation[lab, j] = 0
767787
sumx[lab, j] = t
788+
elif not skipna:
789+
if uses_mask:
790+
result_mask[lab, j] = True
791+
else:
792+
sumx[lab, j] = nan_val
768793

769794
_check_below_mincount(
770795
out, uses_mask, result_mask, ncounts, K, nobs, min_count, sumx
@@ -1100,6 +1125,7 @@ def group_mean(
11001125
bint is_datetimelike=False,
11011126
const uint8_t[:, ::1] mask=None,
11021127
uint8_t[:, ::1] result_mask=None,
1128+
bint skipna=True,
11031129
) -> None:
11041130
"""
11051131
Compute the mean per label given a label assignment for each value.
@@ -1125,6 +1151,8 @@ def group_mean(
11251151
Mask of the input values.
11261152
result_mask : ndarray[bool, ndim=2], optional
11271153
Mask of the out array
1154+
skipna : bool, optional
1155+
If True, ignore nans in `values`.
11281156

11291157
Notes
11301158
-----
@@ -1168,6 +1196,16 @@ def group_mean(
11681196
for j in range(K):
11691197
val = values[i, j]
11701198

1199+
if not skipna and (
1200+
(uses_mask and result_mask[lab, j]) or
1201+
(is_datetimelike and sumx[lab, j] == NPY_NAT) or
1202+
_treat_as_na(sumx[lab, j], False)
1203+
):
1204+
# If sum is already NA, don't add to it. This is important for
1205+
# datetimelike because adding a value to NPY_NAT may not result
1206+
# in NPY_NAT
1207+
continue
1208+
11711209
if uses_mask:
11721210
isna_entry = mask[i, j]
11731211
elif is_datetimelike:
@@ -1191,6 +1229,14 @@ def group_mean(
11911229
# because of no gil
11921230
compensation[lab, j] = 0.
11931231
sumx[lab, j] = t
1232+
elif not skipna:
1233+
# Set the nobs to 0 so that in case of datetimelike,
1234+
# dividing NPY_NAT by nobs may not result in a NPY_NAT
1235+
nobs[lab, j] = 0
1236+
if uses_mask:
1237+
result_mask[lab, j] = True
1238+
else:
1239+
sumx[lab, j] = nan_val
11941240

11951241
for i in range(ncounts):
11961242
for j in range(K):

pandas/core/_numba/kernels/mean_.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -169,9 +169,10 @@ def grouped_mean(
169169
labels: npt.NDArray[np.intp],
170170
ngroups: int,
171171
min_periods: int,
172+
skipna: bool,
172173
) -> tuple[np.ndarray, list[int]]:
173174
output, nobs_arr, comp_arr, consecutive_counts, prev_vals = grouped_kahan_sum(
174-
values, result_dtype, labels, ngroups
175+
values, result_dtype, labels, ngroups, skipna
175176
)
176177

177178
# Post-processing, replace sums that don't satisfy min_periods

pandas/core/_numba/kernels/sum_.py

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,7 @@ def grouped_kahan_sum(
165165
result_dtype: np.dtype,
166166
labels: npt.NDArray[np.intp],
167167
ngroups: int,
168+
skipna: bool,
168169
) -> tuple[
169170
np.ndarray, npt.NDArray[np.int64], np.ndarray, npt.NDArray[np.int64], np.ndarray
170171
]:
@@ -180,7 +181,15 @@ def grouped_kahan_sum(
180181
lab = labels[i]
181182
val = values[i]
182183

183-
if lab < 0:
184+
if lab < 0 or np.isnan(output[lab]):
185+
continue
186+
187+
if not skipna and np.isnan(val):
188+
output[lab] = np.nan
189+
nobs_arr[lab] += 1
190+
comp_arr[lab] = np.nan
191+
consecutive_counts[lab] = 1
192+
prev_vals[lab] = np.nan
184193
continue
185194

186195
sum_x = output[lab]
@@ -219,11 +228,12 @@ def grouped_sum(
219228
labels: npt.NDArray[np.intp],
220229
ngroups: int,
221230
min_periods: int,
231+
skipna: bool,
222232
) -> tuple[np.ndarray, list[int]]:
223233
na_pos = []
224234

225235
output, nobs_arr, comp_arr, consecutive_counts, prev_vals = grouped_kahan_sum(
226-
values, result_dtype, labels, ngroups
236+
values, result_dtype, labels, ngroups, skipna
227237
)
228238

229239
# Post-processing, replace sums that don't satisfy min_periods

pandas/core/frame.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6890,7 +6890,8 @@ def sort_values(
68906890
builtin :meth:`sorted` function, with the notable difference that
68916891
this `key` function should be *vectorized*. It should expect a
68926892
``Series`` and return a Series with the same shape as the input.
6893-
It will be applied to each column in `by` independently.
6893+
It will be applied to each column in `by` independently. The values in the
6894+
returned Series will be used as the keys for sorting.
68946895
68956896
Returns
68966897
-------

pandas/core/generic.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4884,7 +4884,8 @@ def sort_values(
48844884
builtin :meth:`sorted` function, with the notable difference that
48854885
this `key` function should be *vectorized*. It should expect a
48864886
``Series`` and return a Series with the same shape as the input.
4887-
It will be applied to each column in `by` independently.
4887+
It will be applied to each column in `by` independently. The values in the
4888+
returned Series will be used as the keys for sorting.
48884889
48894890
Returns
48904891
-------

0 commit comments

Comments
 (0)