Skip to content

Commit 439385c

Browse files
authored
Merge branch 'main' into bug/dt_attr/non-nano
2 parents bee2282 + a6d5db7 commit 439385c

File tree

133 files changed

+1723
-795
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

133 files changed

+1723
-795
lines changed

.github/workflows/assign.yml

-19
This file was deleted.

.github/workflows/preview-docs.yml renamed to .github/workflows/comment-commands.yml

+11-5
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,25 @@
1-
name: Preview docs
1+
name: Comment Commands
22
on:
33
issue_comment:
44
types: created
55

66
permissions:
77
contents: read
8+
issues: write
9+
pull-requests: write
810

911
jobs:
12+
issue_assign:
13+
runs-on: ubuntu-22.04
14+
steps:
15+
- if: (!github.event.issue.pull_request) && github.event.comment.body == 'take'
16+
run: |
17+
echo "Assigning issue ${{ github.event.issue.number }} to ${{ github.event.comment.user.login }}"
18+
curl -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" -d '{"assignees": ["${{ github.event.comment.user.login }}"]}' https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.issue.number }}/assignees
1019
preview_docs:
11-
permissions:
12-
issues: write
13-
pull-requests: write
1420
runs-on: ubuntu-22.04
1521
steps:
16-
- if: github.event.comment.body == '/preview'
22+
- if: github.event.issue.pull_request && github.event.comment.body == '/preview'
1723
run: |
1824
if curl --output /dev/null --silent --head --fail "https://pandas.pydata.org/preview/${{ github.event.issue.number }}/"; then
1925
curl -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" -d '{"body": "Website preview of this PR available at: https://pandas.pydata.org/preview/${{ github.event.issue.number }}/"}' https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.issue.number }}/comments

asv_bench/benchmarks/io/csv.py

+15
Original file line numberDiff line numberDiff line change
@@ -555,4 +555,19 @@ def time_read_csv_index_col(self):
555555
read_csv(self.StringIO_input, index_col="a")
556556

557557

558+
class ReadCSVDatePyarrowEngine(StringIORewind):
559+
def setup(self):
560+
count_elem = 100_000
561+
data = "a\n" + "2019-12-31\n" * count_elem
562+
self.StringIO_input = StringIO(data)
563+
564+
def time_read_csv_index_col(self):
565+
read_csv(
566+
self.StringIO_input,
567+
parse_dates=["a"],
568+
engine="pyarrow",
569+
dtype_backend="pyarrow",
570+
)
571+
572+
558573
from ..pandas_vb_common import setup # noqa: F401 isort:skip

doc/source/user_guide/cookbook.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -459,7 +459,7 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
459459
df
460460
461461
# List the size of the animals with the highest weight.
462-
df.groupby("animal").apply(lambda subf: subf["size"][subf["weight"].idxmax()])
462+
df.groupby("animal")[["size", "weight"]].apply(lambda subf: subf["size"][subf["weight"].idxmax()])
463463
464464
`Using get_group
465465
<https://stackoverflow.com/questions/14734533/how-to-access-pandas-groupby-dataframe-by-key>`__
@@ -482,7 +482,7 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
482482
return pd.Series(["L", avg_weight, True], index=["size", "weight", "adult"])
483483
484484
485-
expected_df = gb.apply(GrowUp)
485+
expected_df = gb[["size", "weight"]].apply(GrowUp)
486486
expected_df
487487
488488
`Expanding apply

doc/source/user_guide/groupby.rst

+10-4
Original file line numberDiff line numberDiff line change
@@ -430,6 +430,12 @@ This is mainly syntactic sugar for the alternative, which is much more verbose:
430430
Additionally, this method avoids recomputing the internal grouping information
431431
derived from the passed key.
432432

433+
You can also include the grouping columns if you want to operate on them.
434+
435+
.. ipython:: python
436+
437+
grouped[["A", "B"]].sum()
438+
433439
.. _groupby.iterating-label:
434440

435441
Iterating through groups
@@ -1067,7 +1073,7 @@ missing values with the ``ffill()`` method.
10671073
).set_index("date")
10681074
df_re
10691075
1070-
df_re.groupby("group").resample("1D").ffill()
1076+
df_re.groupby("group")[["val"]].resample("1D").ffill()
10711077
10721078
.. _groupby.filter:
10731079

@@ -1233,13 +1239,13 @@ the argument ``group_keys`` which defaults to ``True``. Compare
12331239

12341240
.. ipython:: python
12351241
1236-
df.groupby("A", group_keys=True).apply(lambda x: x)
1242+
df.groupby("A", group_keys=True)[["B", "C", "D"]].apply(lambda x: x)
12371243
12381244
with
12391245

12401246
.. ipython:: python
12411247
1242-
df.groupby("A", group_keys=False).apply(lambda x: x)
1248+
df.groupby("A", group_keys=False)[["B", "C", "D"]].apply(lambda x: x)
12431249
12441250
12451251
Numba Accelerated Routines
@@ -1722,7 +1728,7 @@ column index name will be used as the name of the inserted column:
17221728
result = {"b_sum": x["b"].sum(), "c_mean": x["c"].mean()}
17231729
return pd.Series(result, name="metrics")
17241730
1725-
result = df.groupby("a").apply(compute_metrics)
1731+
result = df.groupby("a")[["b", "c"]].apply(compute_metrics)
17261732
17271733
result
17281734

doc/source/user_guide/io.rst

+12
Original file line numberDiff line numberDiff line change
@@ -3449,6 +3449,18 @@ Reading Excel files
34493449
In the most basic use-case, ``read_excel`` takes a path to an Excel
34503450
file, and the ``sheet_name`` indicating which sheet to parse.
34513451

3452+
When using the ``engine_kwargs`` parameter, pandas will pass these arguments to the
3453+
engine. For this, it is important to know which function pandas is
3454+
using internally.
3455+
3456+
* For the engine openpyxl, pandas is using :func:`openpyxl.load_workbook` to read in (``.xlsx``) and (``.xlsm``) files.
3457+
3458+
* For the engine xlrd, pandas is using :func:`xlrd.open_workbook` to read in (``.xls``) files.
3459+
3460+
* For the engine pyxlsb, pandas is using :func:`pyxlsb.open_workbook` to read in (``.xlsb``) files.
3461+
3462+
* For the engine odf, pandas is using :func:`odf.opendocument.load` to read in (``.ods``) files.
3463+
34523464
.. code-block:: python
34533465
34543466
# Returns a DataFrame

doc/source/user_guide/scale.rst

+4
Original file line numberDiff line numberDiff line change
@@ -257,6 +257,7 @@ We'll import ``dask.dataframe`` and notice that the API feels similar to pandas.
257257
We can use Dask's ``read_parquet`` function, but provide a globstring of files to read in.
258258

259259
.. ipython:: python
260+
:okwarning:
260261
261262
import dask.dataframe as dd
262263
@@ -286,6 +287,7 @@ column names and dtypes. That's because Dask hasn't actually read the data yet.
286287
Rather than executing immediately, doing operations build up a **task graph**.
287288

288289
.. ipython:: python
290+
:okwarning:
289291
290292
ddf
291293
ddf["name"]
@@ -300,6 +302,7 @@ returns a Dask Series with the same dtype and the same name.
300302
To get the actual result you can call ``.compute()``.
301303

302304
.. ipython:: python
305+
:okwarning:
303306
304307
%time ddf["name"].value_counts().compute()
305308
@@ -345,6 +348,7 @@ known automatically. In this case, since we created the parquet files manually,
345348
we need to supply the divisions manually.
346349

347350
.. ipython:: python
351+
:okwarning:
348352
349353
N = 12
350354
starts = [f"20{i:>02d}-01-01" for i in range(N)]

doc/source/whatsnew/v0.14.0.rst

+18-6
Original file line numberDiff line numberDiff line change
@@ -328,13 +328,25 @@ More consistent behavior for some groupby methods:
328328

329329
- groupby ``head`` and ``tail`` now act more like ``filter`` rather than an aggregation:
330330

331-
.. ipython:: python
331+
.. code-block:: ipython
332332
333-
df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
334-
g = df.groupby('A')
335-
g.head(1) # filters DataFrame
333+
In [1]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
334+
335+
In [2]: g = df.groupby('A')
336+
337+
In [3]: g.head(1) # filters DataFrame
338+
Out[3]:
339+
A B
340+
0 1 2
341+
2 5 6
342+
343+
In [4]: g.apply(lambda x: x.head(1)) # used to simply fall-through
344+
Out[4]:
345+
A B
346+
A
347+
1 0 1 2
348+
5 2 5 6
336349
337-
g.apply(lambda x: x.head(1)) # used to simply fall-through
338350
339351
- groupby head and tail respect column selection:
340352

@@ -494,7 +506,7 @@ See also issues (:issue:`6134`, :issue:`4036`, :issue:`3057`, :issue:`2598`, :is
494506

495507
You should specify all axes in the ``.loc`` specifier, meaning the indexer for the **index** and
496508
for the **columns**. Their are some ambiguous cases where the passed indexer could be mis-interpreted
497-
as indexing *both* axes, rather than into say the MuliIndex for the rows.
509+
as indexing *both* axes, rather than into say the MultiIndex for the rows.
498510

499511
You should do this:
500512

doc/source/whatsnew/v0.18.1.rst

+87-6
Original file line numberDiff line numberDiff line change
@@ -77,9 +77,52 @@ Previously you would have to do this to get a rolling window mean per-group:
7777
df = pd.DataFrame({"A": [1] * 20 + [2] * 12 + [3] * 8, "B": np.arange(40)})
7878
df
7979
80-
.. ipython:: python
80+
.. code-block:: ipython
8181
82-
df.groupby("A").apply(lambda x: x.rolling(4).B.mean())
82+
In [1]: df.groupby("A").apply(lambda x: x.rolling(4).B.mean())
83+
Out[1]:
84+
A
85+
1 0 NaN
86+
1 NaN
87+
2 NaN
88+
3 1.5
89+
4 2.5
90+
5 3.5
91+
6 4.5
92+
7 5.5
93+
8 6.5
94+
9 7.5
95+
10 8.5
96+
11 9.5
97+
12 10.5
98+
13 11.5
99+
14 12.5
100+
15 13.5
101+
16 14.5
102+
17 15.5
103+
18 16.5
104+
19 17.5
105+
2 20 NaN
106+
21 NaN
107+
22 NaN
108+
23 21.5
109+
24 22.5
110+
25 23.5
111+
26 24.5
112+
27 25.5
113+
28 26.5
114+
29 27.5
115+
30 28.5
116+
31 29.5
117+
3 32 NaN
118+
33 NaN
119+
34 NaN
120+
35 33.5
121+
36 34.5
122+
37 35.5
123+
38 36.5
124+
39 37.5
125+
Name: B, dtype: float64
83126
84127
Now you can do:
85128

@@ -101,15 +144,53 @@ For ``.resample(..)`` type of operations, previously you would have to:
101144
102145
df
103146
104-
.. ipython:: python
147+
.. code-block:: ipython
105148
106-
df.groupby("group").apply(lambda x: x.resample("1D").ffill())
149+
In[1]: df.groupby("group").apply(lambda x: x.resample("1D").ffill())
150+
Out[1]:
151+
group val
152+
group date
153+
1 2016-01-03 1 5
154+
2016-01-04 1 5
155+
2016-01-05 1 5
156+
2016-01-06 1 5
157+
2016-01-07 1 5
158+
2016-01-08 1 5
159+
2016-01-09 1 5
160+
2016-01-10 1 6
161+
2 2016-01-17 2 7
162+
2016-01-18 2 7
163+
2016-01-19 2 7
164+
2016-01-20 2 7
165+
2016-01-21 2 7
166+
2016-01-22 2 7
167+
2016-01-23 2 7
168+
2016-01-24 2 8
107169
108170
Now you can do:
109171

110-
.. ipython:: python
172+
.. code-block:: ipython
111173
112-
df.groupby("group").resample("1D").ffill()
174+
In[1]: df.groupby("group").resample("1D").ffill()
175+
Out[1]:
176+
group val
177+
group date
178+
1 2016-01-03 1 5
179+
2016-01-04 1 5
180+
2016-01-05 1 5
181+
2016-01-06 1 5
182+
2016-01-07 1 5
183+
2016-01-08 1 5
184+
2016-01-09 1 5
185+
2016-01-10 1 6
186+
2 2016-01-17 2 7
187+
2016-01-18 2 7
188+
2016-01-19 2 7
189+
2016-01-20 2 7
190+
2016-01-21 2 7
191+
2016-01-22 2 7
192+
2016-01-23 2 7
193+
2016-01-24 2 8
113194
114195
.. _whatsnew_0181.enhancements.method_chain:
115196

doc/source/whatsnew/v2.0.1.rst

+9
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,9 @@ Fixed regressions
1515
~~~~~~~~~~~~~~~~~
1616
- Fixed regression for subclassed Series when constructing from a dictionary (:issue:`52445`)
1717
- Fixed regression in :meth:`Series.describe` showing ``RuntimeWarning`` for extension dtype :class:`Series` with one element (:issue:`52515`)
18+
- Fixed regression in :meth:`DataFrame.sort_values` not resetting index when :class:`DataFrame` is already sorted and ``ignore_index=True`` (:issue:`52553`)
19+
- Fixed regression in :meth:`MultiIndex.isin` raising ``TypeError`` for ``Generator`` (:issue:`52568`)
20+
- Fixed regression in :meth:`DataFrame.pivot` changing :class:`Index` name of input object (:issue:`52629`)
1821

1922
.. ---------------------------------------------------------------------------
2023
.. _whatsnew_201.bug_fixes:
@@ -27,12 +30,18 @@ Bug fixes
2730
- Bug in :meth:`Series.describe` not returning :class:`ArrowDtype` with ``pyarrow.float64`` type with numeric data (:issue:`52427`)
2831
- Fixed segfault in :meth:`Series.to_numpy` with ``null[pyarrow]`` dtype (:issue:`52443`)
2932
- Bug in :func:`pandas.testing.assert_series_equal` where ``check_dtype=False`` would still raise for datetime or timedelta types with different resolutions (:issue:`52449`)
33+
- Bug in :meth:`DataFrame.max` and related casting different :class:`Timestamp` resolutions always to nanoseconds (:issue:`52524`)
34+
- Bug in :meth:`ArrowDtype.__from_arrow__` not respecting if dtype is explicitly given (:issue:`52533`)
35+
- Bug in :func:`read_csv` casting PyArrow datetimes to NumPy when ``dtype_backend="pyarrow"`` and ``parse_dates`` is set causing a performance bottleneck in the process (:issue:`52546`)
36+
- Bug in :class:`arrays.DatetimeArray` constructor returning an incorrect unit when passed a non-nanosecond numpy datetime array (:issue:`52555`)
37+
- Bug in :func:`to_numeric` with ``errors='coerce'`` and ``dtype_backend='pyarrow'`` with :class:`ArrowDtype` data (:issue:`52588`)
3038

3139
.. ---------------------------------------------------------------------------
3240
.. _whatsnew_201.other:
3341

3442
Other
3543
~~~~~
44+
- Implemented :meth:`Series.str.split` and :meth:`Series.str.rsplit` for :class:`ArrowDtype` with ``pyarrow.string`` (:issue:`52401`)
3645
- :class:`DataFrame` created from empty dicts had :attr:`~DataFrame.columns` of dtype ``object``. It is now a :class:`RangeIndex` (:issue:`52404`)
3746
- :class:`Series` created from empty dicts had :attr:`~Series.index` of dtype ``object``. It is now a :class:`RangeIndex` (:issue:`52404`)
3847

0 commit comments

Comments
 (0)