Skip to content

Commit 55e8a0d

Browse files
authored
Merge branch 'main' into BUG-53846-extractall-with-arrow-returns-object-dtype
2 parents ee33c4e + 59d4e84 commit 55e8a0d

File tree

155 files changed

+1181
-593
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

155 files changed

+1181
-593
lines changed

ci/code_checks.sh

+2-3
Original file line numberDiff line numberDiff line change
@@ -176,9 +176,8 @@ fi
176176

177177
### SINGLE-PAGE DOCS ###
178178
if [[ -z "$CHECK" || "$CHECK" == "single-docs" ]]; then
179-
python doc/make.py --warnings-are-errors --single pandas.Series.value_counts
180-
python doc/make.py --warnings-are-errors --single pandas.Series.str.split
181-
python doc/make.py clean
179+
python doc/make.py --warnings-are-errors --no-browser --single pandas.Series.value_counts
180+
python doc/make.py --warnings-are-errors --no-browser --single pandas.Series.str.split
182181
fi
183182

184183
exit $RET

doc/make.py

+11-2
Original file line numberDiff line numberDiff line change
@@ -45,12 +45,14 @@ def __init__(
4545
single_doc=None,
4646
verbosity=0,
4747
warnings_are_errors=False,
48+
no_browser=False,
4849
) -> None:
4950
self.num_jobs = num_jobs
5051
self.include_api = include_api
5152
self.whatsnew = whatsnew
5253
self.verbosity = verbosity
5354
self.warnings_are_errors = warnings_are_errors
55+
self.no_browser = no_browser
5456

5557
if single_doc:
5658
single_doc = self._process_single_doc(single_doc)
@@ -234,11 +236,11 @@ def html(self):
234236
os.remove(zip_fname)
235237

236238
if ret_code == 0:
237-
if self.single_doc_html is not None:
239+
if self.single_doc_html is not None and not self.no_browser:
238240
self._open_browser(self.single_doc_html)
239241
else:
240242
self._add_redirects()
241-
if self.whatsnew:
243+
if self.whatsnew and not self.no_browser:
242244
self._open_browser(os.path.join("whatsnew", "index.html"))
243245

244246
return ret_code
@@ -349,6 +351,12 @@ def main():
349351
action="store_true",
350352
help="fail if warnings are raised",
351353
)
354+
argparser.add_argument(
355+
"--no-browser",
356+
help="Don't open browser",
357+
default=False,
358+
action="store_true",
359+
)
352360
args = argparser.parse_args()
353361

354362
if args.command not in cmds:
@@ -374,6 +382,7 @@ def main():
374382
args.single,
375383
args.verbosity,
376384
args.warnings_are_errors,
385+
args.no_browser,
377386
)
378387
return getattr(builder, args.command)()
379388

doc/source/development/contributing_codebase.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -540,7 +540,7 @@ xfail during the testing phase. To do so, use the ``request`` fixture:
540540
541541
def test_xfail(request):
542542
mark = pytest.mark.xfail(raises=TypeError, reason="Indicate why here")
543-
request.node.add_marker(mark)
543+
request.applymarker(mark)
544544
545545
xfail is not to be used for tests involving failure due to invalid user arguments.
546546
For these tests, we need to verify the correct exception type and error message

doc/source/getting_started/comparison/comparison_with_sql.rst

+5-5
Original file line numberDiff line numberDiff line change
@@ -164,24 +164,24 @@ The pandas equivalent would be:
164164
165165
tips.groupby("sex").size()
166166
167-
Notice that in the pandas code we used :meth:`~pandas.core.groupby.DataFrameGroupBy.size` and not
168-
:meth:`~pandas.core.groupby.DataFrameGroupBy.count`. This is because
169-
:meth:`~pandas.core.groupby.DataFrameGroupBy.count` applies the function to each column, returning
167+
Notice that in the pandas code we used :meth:`.DataFrameGroupBy.size` and not
168+
:meth:`.DataFrameGroupBy.count`. This is because
169+
:meth:`.DataFrameGroupBy.count` applies the function to each column, returning
170170
the number of ``NOT NULL`` records within each.
171171

172172
.. ipython:: python
173173
174174
tips.groupby("sex").count()
175175
176-
Alternatively, we could have applied the :meth:`~pandas.core.groupby.DataFrameGroupBy.count` method
176+
Alternatively, we could have applied the :meth:`.DataFrameGroupBy.count` method
177177
to an individual column:
178178

179179
.. ipython:: python
180180
181181
tips.groupby("sex")["total_bill"].count()
182182
183183
Multiple functions can also be applied at once. For instance, say we'd like to see how tip amount
184-
differs by day of the week - :meth:`~pandas.core.groupby.DataFrameGroupBy.agg` allows you to pass a dictionary
184+
differs by day of the week - :meth:`.DataFrameGroupBy.agg` allows you to pass a dictionary
185185
to your grouped DataFrame, indicating which functions to apply to specific columns.
186186

187187
.. code-block:: sql

doc/source/user_guide/10min.rst

+3-1
Original file line numberDiff line numberDiff line change
@@ -525,7 +525,7 @@ See the :ref:`Grouping section <groupby>`.
525525
df
526526
527527
Grouping by a column label, selecting column labels, and then applying the
528-
:meth:`~pandas.core.groupby.DataFrameGroupBy.sum` function to the resulting
528+
:meth:`.DataFrameGroupBy.sum` function to the resulting
529529
groups:
530530

531531
.. ipython:: python
@@ -763,12 +763,14 @@ Parquet
763763
Writing to a Parquet file:
764764

765765
.. ipython:: python
766+
:okwarning:
766767
767768
df.to_parquet("foo.parquet")
768769
769770
Reading from a Parquet file Store using :func:`read_parquet`:
770771

771772
.. ipython:: python
773+
:okwarning:
772774
773775
pd.read_parquet("foo.parquet")
774776

doc/source/user_guide/copy_on_write.rst

+75-55
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ Copy-on-Write (CoW)
77
*******************
88

99
Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the
10-
optimizations that become possible through CoW are implemented and supported. A complete list
11-
can be found at :ref:`Copy-on-Write optimizations <copy_on_write.optimizations>`.
10+
optimizations that become possible through CoW are implemented and supported. All possible
11+
optimizations are supported starting from pandas 2.1.
1212

1313
We expect that CoW will be enabled by default in version 3.0.
1414

@@ -154,66 +154,86 @@ With copy on write this can be done by using ``loc``.
154154
155155
df.loc[df["bar"] > 5, "foo"] = 100
156156
157+
Read-only NumPy arrays
158+
----------------------
159+
160+
Accessing the underlying NumPy array of a DataFrame will return a read-only array if the array
161+
shares data with the initial DataFrame:
162+
163+
The array is a copy if the initial DataFrame consists of more than one array:
164+
165+
166+
.. ipython:: python
167+
168+
df = pd.DataFrame({"a": [1, 2], "b": [1.5, 2.5]})
169+
df.to_numpy()
170+
171+
The array shares data with the DataFrame if the DataFrame consists of only one NumPy array:
172+
173+
.. ipython:: python
174+
175+
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
176+
df.to_numpy()
177+
178+
This array is read-only, which means that it can't be modified inplace:
179+
180+
.. ipython:: python
181+
:okexcept:
182+
183+
arr = df.to_numpy()
184+
arr[0, 0] = 100
185+
186+
The same holds true for a Series, since a Series always consists of a single array.
187+
188+
There are two potential solution to this:
189+
190+
- Trigger a copy manually if you want to avoid updating DataFrames that share memory with your array.
191+
- Make the array writeable. This is a more performant solution but circumvents Copy-on-Write rules, so
192+
it should be used with caution.
193+
194+
.. ipython:: python
195+
196+
arr = df.to_numpy()
197+
arr.flags.writeable = True
198+
arr[0, 0] = 100
199+
arr
200+
201+
Patterns to avoid
202+
-----------------
203+
204+
No defensive copy will be performed if two objects share the same data while
205+
you are modifying one object inplace.
206+
207+
.. ipython:: python
208+
209+
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
210+
df2 = df.reset_index()
211+
df2.iloc[0, 0] = 100
212+
213+
This creates two objects that share data and thus the setitem operation will trigger a
214+
copy. This is not necessary if the initial object ``df`` isn't needed anymore.
215+
Simply reassigning to the same variable will invalidate the reference that is
216+
held by the object.
217+
218+
.. ipython:: python
219+
220+
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
221+
df = df.reset_index()
222+
df.iloc[0, 0] = 100
223+
224+
No copy is necessary in this example.
225+
Creating multiple references keeps unnecessary references alive
226+
and thus will hurt performance with Copy-on-Write.
227+
157228
.. _copy_on_write.optimizations:
158229

159230
Copy-on-Write optimizations
160231
---------------------------
161232

162233
A new lazy copy mechanism that defers the copy until the object in question is modified
163234
and only if this object shares data with another object. This mechanism was added to
164-
following methods:
165-
166-
- :meth:`DataFrame.reset_index` / :meth:`Series.reset_index`
167-
- :meth:`DataFrame.set_index`
168-
- :meth:`DataFrame.set_axis` / :meth:`Series.set_axis`
169-
- :meth:`DataFrame.set_flags` / :meth:`Series.set_flags`
170-
- :meth:`DataFrame.rename_axis` / :meth:`Series.rename_axis`
171-
- :meth:`DataFrame.reindex` / :meth:`Series.reindex`
172-
- :meth:`DataFrame.reindex_like` / :meth:`Series.reindex_like`
173-
- :meth:`DataFrame.assign`
174-
- :meth:`DataFrame.drop`
175-
- :meth:`DataFrame.dropna` / :meth:`Series.dropna`
176-
- :meth:`DataFrame.select_dtypes`
177-
- :meth:`DataFrame.align` / :meth:`Series.align`
178-
- :meth:`Series.to_frame`
179-
- :meth:`DataFrame.rename` / :meth:`Series.rename`
180-
- :meth:`DataFrame.add_prefix` / :meth:`Series.add_prefix`
181-
- :meth:`DataFrame.add_suffix` / :meth:`Series.add_suffix`
182-
- :meth:`DataFrame.drop_duplicates` / :meth:`Series.drop_duplicates`
183-
- :meth:`DataFrame.droplevel` / :meth:`Series.droplevel`
184-
- :meth:`DataFrame.reorder_levels` / :meth:`Series.reorder_levels`
185-
- :meth:`DataFrame.between_time` / :meth:`Series.between_time`
186-
- :meth:`DataFrame.filter` / :meth:`Series.filter`
187-
- :meth:`DataFrame.head` / :meth:`Series.head`
188-
- :meth:`DataFrame.tail` / :meth:`Series.tail`
189-
- :meth:`DataFrame.isetitem`
190-
- :meth:`DataFrame.pipe` / :meth:`Series.pipe`
191-
- :meth:`DataFrame.pop` / :meth:`Series.pop`
192-
- :meth:`DataFrame.replace` / :meth:`Series.replace`
193-
- :meth:`DataFrame.shift` / :meth:`Series.shift`
194-
- :meth:`DataFrame.sort_index` / :meth:`Series.sort_index`
195-
- :meth:`DataFrame.sort_values` / :meth:`Series.sort_values`
196-
- :meth:`DataFrame.squeeze` / :meth:`Series.squeeze`
197-
- :meth:`DataFrame.swapaxes`
198-
- :meth:`DataFrame.swaplevel` / :meth:`Series.swaplevel`
199-
- :meth:`DataFrame.take` / :meth:`Series.take`
200-
- :meth:`DataFrame.to_timestamp` / :meth:`Series.to_timestamp`
201-
- :meth:`DataFrame.to_period` / :meth:`Series.to_period`
202-
- :meth:`DataFrame.truncate`
203-
- :meth:`DataFrame.iterrows`
204-
- :meth:`DataFrame.tz_convert` / :meth:`Series.tz_localize`
205-
- :meth:`DataFrame.fillna` / :meth:`Series.fillna`
206-
- :meth:`DataFrame.interpolate` / :meth:`Series.interpolate`
207-
- :meth:`DataFrame.ffill` / :meth:`Series.ffill`
208-
- :meth:`DataFrame.bfill` / :meth:`Series.bfill`
209-
- :meth:`DataFrame.where` / :meth:`Series.where`
210-
- :meth:`DataFrame.infer_objects` / :meth:`Series.infer_objects`
211-
- :meth:`DataFrame.astype` / :meth:`Series.astype`
212-
- :meth:`DataFrame.convert_dtypes` / :meth:`Series.convert_dtypes`
213-
- :meth:`DataFrame.join`
214-
- :meth:`DataFrame.eval`
215-
- :func:`concat`
216-
- :func:`merge`
235+
methods that don't require a copy of the underlying data. Popular examples are :meth:`DataFrame.drop` for ``axis=1``
236+
and :meth:`DataFrame.rename`.
217237

218238
These methods return views when Copy-on-Write is enabled, which provides a significant
219239
performance improvement compared to the regular execution.

doc/source/user_guide/enhancingperf.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -453,7 +453,7 @@ by evaluate arithmetic and boolean expression all at once for large :class:`~pan
453453
:func:`~pandas.eval` is many orders of magnitude slower for
454454
smaller expressions or objects than plain Python. A good rule of thumb is
455455
to only use :func:`~pandas.eval` when you have a
456-
:class:`~pandas.core.frame.DataFrame` with more than 10,000 rows.
456+
:class:`.DataFrame` with more than 10,000 rows.
457457

458458
Supported syntax
459459
~~~~~~~~~~~~~~~~

doc/source/user_guide/groupby.rst

+3-3
Original file line numberDiff line numberDiff line change
@@ -458,7 +458,7 @@ Selecting a group
458458
-----------------
459459

460460
A single group can be selected using
461-
:meth:`~pandas.core.groupby.DataFrameGroupBy.get_group`:
461+
:meth:`.DataFrameGroupBy.get_group`:
462462

463463
.. ipython:: python
464464
@@ -1531,7 +1531,7 @@ Enumerate groups
15311531

15321532
To see the ordering of the groups (as opposed to the order of rows
15331533
within a group given by ``cumcount``) you can use
1534-
:meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`.
1534+
:meth:`.DataFrameGroupBy.ngroup`.
15351535

15361536

15371537

@@ -1660,7 +1660,7 @@ Regroup columns of a DataFrame according to their sum, and sum the aggregated on
16601660
Multi-column factorization
16611661
~~~~~~~~~~~~~~~~~~~~~~~~~~
16621662

1663-
By using :meth:`~pandas.core.groupby.DataFrameGroupBy.ngroup`, we can extract
1663+
By using :meth:`.DataFrameGroupBy.ngroup`, we can extract
16641664
information about the groups in a way similar to :func:`factorize` (as described
16651665
further in the :ref:`reshaping API <reshaping.factorize>`) but which applies
16661666
naturally to multiple columns of mixed type and different

doc/source/user_guide/io.rst

+5-3
Original file line numberDiff line numberDiff line change
@@ -2247,6 +2247,7 @@ For line-delimited json files, pandas can also return an iterator which reads in
22472247
Line-limited json can also be read using the pyarrow reader by specifying ``engine="pyarrow"``.
22482248

22492249
.. ipython:: python
2250+
:okwarning:
22502251
22512252
from io import BytesIO
22522253
df = pd.read_json(BytesIO(jsonl.encode()), lines=True, engine="pyarrow")
@@ -2701,7 +2702,7 @@ in the method ``to_string`` described above.
27012702
.. note::
27022703

27032704
Not all of the possible options for ``DataFrame.to_html`` are shown here for
2704-
brevity's sake. See :func:`~pandas.core.frame.DataFrame.to_html` for the
2705+
brevity's sake. See :func:`.DataFrame.to_html` for the
27052706
full set of options.
27062707

27072708
.. note::
@@ -5554,6 +5555,7 @@ Read from an orc file.
55545555
Read only certain columns of an orc file.
55555556

55565557
.. ipython:: python
5558+
:okwarning:
55575559
55585560
result = pd.read_orc(
55595561
"example_pa.orc",
@@ -6020,7 +6022,7 @@ Stata format
60206022
Writing to stata format
60216023
'''''''''''''''''''''''
60226024

6023-
The method :func:`~pandas.core.frame.DataFrame.to_stata` will write a DataFrame
6025+
The method :func:`.DataFrame.to_stata` will write a DataFrame
60246026
into a .dta file. The format version of this file is always 115 (Stata 12).
60256027

60266028
.. ipython:: python
@@ -6060,7 +6062,7 @@ outside of this range, the variable is cast to ``int16``.
60606062
.. warning::
60616063

60626064
:class:`~pandas.io.stata.StataWriter` and
6063-
:func:`~pandas.core.frame.DataFrame.to_stata` only support fixed width
6065+
:func:`.DataFrame.to_stata` only support fixed width
60646066
strings containing up to 244 characters, a limitation imposed by the version
60656067
115 dta file format. Attempting to write *Stata* dta files with strings
60666068
longer than 244 characters raises a ``ValueError``.

doc/source/user_guide/pyarrow.rst

+3
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,7 @@ To convert a :external+pyarrow:py:class:`pyarrow.Table` to a :class:`DataFrame`,
104104
:external+pyarrow:py:meth:`pyarrow.Table.to_pandas` method with ``types_mapper=pd.ArrowDtype``.
105105

106106
.. ipython:: python
107+
:okwarning:
107108
108109
table = pa.table([pa.array([1, 2, 3], type=pa.int64())], names=["a"])
109110
@@ -164,6 +165,7 @@ functions provide an ``engine`` keyword that can dispatch to PyArrow to accelera
164165
* :func:`read_feather`
165166

166167
.. ipython:: python
168+
:okwarning:
167169
168170
import io
169171
data = io.StringIO("""a,b,c
@@ -178,6 +180,7 @@ PyArrow-backed data by specifying the parameter ``dtype_backend="pyarrow"``. A r
178180
``engine="pyarrow"`` to necessarily return PyArrow-backed data.
179181

180182
.. ipython:: python
183+
:okwarning:
181184
182185
import io
183186
data = io.StringIO("""a,b,c,d,e,f,g,h,i

0 commit comments

Comments
 (0)