Skip to content

Commit d0600f9

Browse files
committed
Merge remote-tracking branch 'upstream/master' into io_csv_docstring_fixed
* upstream/master: DOC: Fixes to docstring to add validation to CI (pandas-dev#23560) DOC: Remove incorrect periods at the end of parameter types (pandas-dev#23600) MAINT: tm.assert_raises_regex --> pytest.raises (pandas-dev#23592) DOC: Updating Series.resample and DataFrame.resample docstrings (pandas-dev#23197) ENH: Support for partition_cols in to_parquet (pandas-dev#23321) TST: Use intp as expected dtype in IntervalIndex indexing tests (pandas-dev#23609)
2 parents 5e85114 + 2cea659 commit d0600f9

File tree

239 files changed

+2335
-2180
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

239 files changed

+2335
-2180
lines changed

ci/code_checks.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,7 @@ if [[ -z "$CHECK" || "$CHECK" == "doctests" ]]; then
151151

152152
MSG='Doctests generic.py' ; echo $MSG
153153
pytest -q --doctest-modules pandas/core/generic.py \
154-
-k"-_set_axis_name -_xs -describe -droplevel -groupby -interpolate -pct_change -pipe -reindex -reindex_axis -resample -to_json -transpose -values -xs"
154+
-k"-_set_axis_name -_xs -describe -droplevel -groupby -interpolate -pct_change -pipe -reindex -reindex_axis -to_json -transpose -values -xs"
155155
RET=$(($RET + $?)) ; echo $MSG "DONE"
156156

157157
MSG='Doctests top-level reshaping functions' ; echo $MSG

doc/source/io.rst

+37
Original file line numberDiff line numberDiff line change
@@ -4673,6 +4673,43 @@ Passing ``index=True`` will *always* write the index, even if that's not the
46734673
underlying engine's default behavior.
46744674

46754675

4676+
Partitioning Parquet files
4677+
''''''''''''''''''''''''''
4678+
4679+
.. versionadded:: 0.24.0
4680+
4681+
Parquet supports partitioning of data based on the values of one or more columns.
4682+
4683+
.. ipython:: python
4684+
4685+
df = pd.DataFrame({'a': [0, 0, 1, 1], 'b': [0, 1, 0, 1]})
4686+
df.to_parquet(fname='test', engine='pyarrow', partition_cols=['a'], compression=None)
4687+
4688+
The `fname` specifies the parent directory to which data will be saved.
4689+
The `partition_cols` are the column names by which the dataset will be partitioned.
4690+
Columns are partitioned in the order they are given. The partition splits are
4691+
determined by the unique values in the partition columns.
4692+
The above example creates a partitioned dataset that may look like:
4693+
4694+
.. code-block:: text
4695+
4696+
test
4697+
├── a=0
4698+
│ ├── 0bac803e32dc42ae83fddfd029cbdebc.parquet
4699+
│ └── ...
4700+
└── a=1
4701+
├── e6ab24a4f45147b49b54a662f0c412a3.parquet
4702+
└── ...
4703+
4704+
.. ipython:: python
4705+
:suppress:
4706+
4707+
from shutil import rmtree
4708+
try:
4709+
rmtree('test')
4710+
except Exception:
4711+
pass
4712+
46764713
.. _io.sql:
46774714

46784715
SQL Queries

doc/source/whatsnew/v0.24.0.txt

+1
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,7 @@ Other Enhancements
236236
- New attribute :attr:`__git_version__` will return git commit sha of current build (:issue:`21295`).
237237
- Compatibility with Matplotlib 3.0 (:issue:`22790`).
238238
- Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`)
239+
- :func:`~DataFrame.to_parquet` now supports writing a ``DataFrame`` as a directory of parquet files partitioned by a subset of the columns when ``engine = 'pyarrow'`` (:issue:`23283`)
239240
- :meth:`Timestamp.tz_localize`, :meth:`DatetimeIndex.tz_localize`, and :meth:`Series.tz_localize` have gained the ``nonexistent`` argument for alternative handling of nonexistent times. See :ref:`timeseries.timezone_nonexsistent` (:issue:`8917`)
240241

241242
.. _whatsnew_0240.api_breaking:

pandas/core/dtypes/inference.py

+11-11
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ def is_string_like(obj):
7373
7474
Parameters
7575
----------
76-
obj : The object to check.
76+
obj : The object to check
7777
7878
Examples
7979
--------
@@ -127,7 +127,7 @@ def is_iterator(obj):
127127
128128
Parameters
129129
----------
130-
obj : The object to check.
130+
obj : The object to check
131131
132132
Returns
133133
-------
@@ -172,7 +172,7 @@ def is_file_like(obj):
172172
173173
Parameters
174174
----------
175-
obj : The object to check.
175+
obj : The object to check
176176
177177
Returns
178178
-------
@@ -203,7 +203,7 @@ def is_re(obj):
203203
204204
Parameters
205205
----------
206-
obj : The object to check.
206+
obj : The object to check
207207
208208
Returns
209209
-------
@@ -227,7 +227,7 @@ def is_re_compilable(obj):
227227
228228
Parameters
229229
----------
230-
obj : The object to check.
230+
obj : The object to check
231231
232232
Returns
233233
-------
@@ -261,7 +261,7 @@ def is_list_like(obj, allow_sets=True):
261261
262262
Parameters
263263
----------
264-
obj : The object to check.
264+
obj : The object to check
265265
allow_sets : boolean, default True
266266
If this parameter is False, sets will not be considered list-like
267267
@@ -310,7 +310,7 @@ def is_array_like(obj):
310310
311311
Parameters
312312
----------
313-
obj : The object to check.
313+
obj : The object to check
314314
315315
Returns
316316
-------
@@ -343,7 +343,7 @@ def is_nested_list_like(obj):
343343
344344
Parameters
345345
----------
346-
obj : The object to check.
346+
obj : The object to check
347347
348348
Returns
349349
-------
@@ -384,7 +384,7 @@ def is_dict_like(obj):
384384
385385
Parameters
386386
----------
387-
obj : The object to check.
387+
obj : The object to check
388388
389389
Returns
390390
-------
@@ -408,7 +408,7 @@ def is_named_tuple(obj):
408408
409409
Parameters
410410
----------
411-
obj : The object to check.
411+
obj : The object to check
412412
413413
Returns
414414
-------
@@ -468,7 +468,7 @@ def is_sequence(obj):
468468
469469
Parameters
470470
----------
471-
obj : The object to check.
471+
obj : The object to check
472472
473473
Returns
474474
-------

pandas/core/frame.py

+35-20
Original file line numberDiff line numberDiff line change
@@ -864,12 +864,17 @@ def iterrows(self):
864864
data types, the iterator returns a copy and not a view, and writing
865865
to it will have no effect.
866866
867-
Returns
868-
-------
867+
Yields
868+
------
869+
index : label or tuple of label
870+
The index of the row. A tuple for a `MultiIndex`.
871+
data : Series
872+
The data of the row as a Series.
873+
869874
it : generator
870875
A generator that iterates over the rows of the frame.
871876
872-
See also
877+
See Also
873878
--------
874879
itertuples : Iterate over DataFrame rows as namedtuples of the values.
875880
iteritems : Iterate over (column name, Series) pairs.
@@ -1970,7 +1975,7 @@ def to_feather(self, fname):
19701975
to_feather(self, fname)
19711976

19721977
def to_parquet(self, fname, engine='auto', compression='snappy',
1973-
index=None, **kwargs):
1978+
index=None, partition_cols=None, **kwargs):
19741979
"""
19751980
Write a DataFrame to the binary parquet format.
19761981
@@ -1984,7 +1989,11 @@ def to_parquet(self, fname, engine='auto', compression='snappy',
19841989
Parameters
19851990
----------
19861991
fname : str
1987-
String file path.
1992+
File path or Root Directory path. Will be used as Root Directory
1993+
path while writing a partitioned dataset.
1994+
1995+
.. versionchanged:: 0.24.0
1996+
19881997
engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto'
19891998
Parquet library to use. If 'auto', then the option
19901999
``io.parquet.engine`` is used. The default ``io.parquet.engine``
@@ -1999,6 +2008,12 @@ def to_parquet(self, fname, engine='auto', compression='snappy',
19992008
20002009
.. versionadded:: 0.24.0
20012010
2011+
partition_cols : list, optional, default None
2012+
Column names by which to partition the dataset
2013+
Columns are partitioned in the order they are given
2014+
2015+
.. versionadded:: 0.24.0
2016+
20022017
**kwargs
20032018
Additional arguments passed to the parquet library. See
20042019
:ref:`pandas io <io.parquet>` for more details.
@@ -2027,7 +2042,8 @@ def to_parquet(self, fname, engine='auto', compression='snappy',
20272042
"""
20282043
from pandas.io.parquet import to_parquet
20292044
to_parquet(self, fname, engine,
2030-
compression=compression, index=index, **kwargs)
2045+
compression=compression, index=index,
2046+
partition_cols=partition_cols, **kwargs)
20312047

20322048
@Substitution(header='Write out the column names. If a list of strings '
20332049
'is given, it is assumed to be aliases for the '
@@ -3940,6 +3956,10 @@ def set_index(self, keys, drop=True, append=False, inplace=False,
39403956
necessary. Setting to False will improve the performance of this
39413957
method
39423958
3959+
Returns
3960+
-------
3961+
DataFrame
3962+
39433963
Examples
39443964
--------
39453965
>>> df = pd.DataFrame({'month': [1, 4, 7, 10],
@@ -3980,10 +4000,6 @@ def set_index(self, keys, drop=True, append=False, inplace=False,
39804000
2 2014 4 40
39814001
3 2013 7 84
39824002
4 2014 10 31
3983-
3984-
Returns
3985-
-------
3986-
dataframe : DataFrame
39874003
"""
39884004
inplace = validate_bool_kwarg(inplace, 'inplace')
39894005
if not isinstance(keys, list):
@@ -6683,6 +6699,15 @@ def round(self, decimals=0, *args, **kwargs):
66836699
of `decimals` which are not columns of the input will be
66846700
ignored.
66856701
6702+
Returns
6703+
-------
6704+
DataFrame
6705+
6706+
See Also
6707+
--------
6708+
numpy.around
6709+
Series.round
6710+
66866711
Examples
66876712
--------
66886713
>>> df = pd.DataFrame(np.random.random([3, 3]),
@@ -6708,15 +6733,6 @@ def round(self, decimals=0, *args, **kwargs):
67086733
first 0.0 1 0.17
67096734
second 0.0 1 0.58
67106735
third 0.9 0 0.49
6711-
6712-
Returns
6713-
-------
6714-
DataFrame object
6715-
6716-
See Also
6717-
--------
6718-
numpy.around
6719-
Series.round
67206736
"""
67216737
from pandas.core.reshape.concat import concat
67226738

@@ -6782,7 +6798,6 @@ def corr(self, method='pearson', min_periods=1):
67826798
67836799
Examples
67846800
--------
6785-
>>> import numpy as np
67866801
>>> histogram_intersection = lambda a, b: np.minimum(a, b
67876802
... ).sum().round(decimals=1)
67886803
>>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],

0 commit comments

Comments
 (0)