Skip to content

Commit 1f0978f

Browse files
committed
Updated documentation
Fix"Should raise error on using partition_cols and partition_on together"
1 parent a5164b8 commit 1f0978f

File tree

5 files changed

+48
-8
lines changed

5 files changed

+48
-8
lines changed

doc/source/io.rst

+27-2
Original file line numberDiff line numberDiff line change
@@ -4574,8 +4574,6 @@ Several caveats.
45744574
* Categorical dtypes can be serialized to parquet, but will de-serialize as ``object`` dtype.
45754575
* Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message
45764576
on an attempt at serialization.
4577-
* ``partition_cols`` will be used for partitioning the dataset, where the dataset will be written to multiple
4578-
files in the path specified. Therefore, the path specified, must be a directory path.
45794577

45804578
You can specify an ``engine`` to direct the serialization. This can be one of ``pyarrow``, or ``fastparquet``, or ``auto``.
45814579
If the engine is NOT specified, then the ``pd.options.io.parquet.engine`` option is checked; if this is also ``auto``,
@@ -4670,6 +4668,33 @@ Passing ``index=True`` will *always* write the index, even if that's not the
46704668
underlying engine's default behavior.
46714669

46724670

4671+
Partitioning Parquet files
4672+
''''''''''''''''''''''''''
4673+
4674+
Parquet supports partitioning of data based on the values of one or more columns.
4675+
4676+
.. ipython:: python
4677+
4678+
df = pd.DataFrame({'a': [0, 0, 1, 1], 'b': [0, 1, 0, 1]})
4679+
df.to_parquet(fname='test', engine='pyarrow', partition_cols=['a'], compression=None)
4680+
4681+
The `fname` specifies the parent directory to which data will be saved.
4682+
The `partition_cols` are the column names by which the dataset will be partitioned.
4683+
Columns are partitioned in the order they are given. The partition splits are
4684+
determined by the unique values in the partition columns.
4685+
The above example creates a partitioned dataset that may look like:
4686+
4687+
::
4688+
4689+
test/
4690+
a=0/
4691+
0bac803e32dc42ae83fddfd029cbdebc.parquet
4692+
...
4693+
a=1/
4694+
e6ab24a4f45147b49b54a662f0c412a3.parquet
4695+
...
4696+
4697+
46734698
.. _io.sql:
46744699

46754700
SQL Queries

doc/source/whatsnew/v0.24.0.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -235,7 +235,7 @@ Other Enhancements
235235
- New attribute :attr:`__git_version__` will return git commit sha of current build (:issue:`21295`).
236236
- Compatibility with Matplotlib 3.0 (:issue:`22790`).
237237
- Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`)
238-
- :func:`~DataFrame.to_parquet` now supports writing a DataFrame as a directory of parquet files partitioned by a subset of the columns. (:issue:`23283`).
238+
- With the pyarrow engine, :func:`~DataFrame.to_parquet` now supports writing a DataFrame as a directory of parquet files partitioned by a subset of the columns. (:issue:`23283`).
239239
- :meth:`Timestamp.tz_localize`, :meth:`DatetimeIndex.tz_localize`, and :meth:`Series.tz_localize` have gained the ``nonexistent`` argument for alternative handling of nonexistent times. See :ref:`timeseries.timezone_nonexsistent` (:issue:`8917`)
240240

241241
.. _whatsnew_0240.api_breaking:

pandas/core/frame.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -2002,8 +2002,8 @@ def to_parquet(self, fname, engine='auto', compression='snappy',
20022002
partition_cols : list, optional, default None
20032003
Column names by which to partition the dataset
20042004
Columns are partitioned in the order they are given
2005-
The behaviour applies only to pyarrow >= 0.7.0 and fastparquet
2006-
For other versions, this argument will be ignored.
2005+
The behaviour applies only to pyarrow >= 0.7.0 and fastparquet.
2006+
Raises a ValueError for other versions.
20072007
20082008
.. versionadded:: 0.24.0
20092009

pandas/io/parquet.py

+8-3
Original file line numberDiff line numberDiff line change
@@ -227,7 +227,12 @@ def write(self, df, path, compression='snappy', index=None,
227227
# Use tobytes() instead.
228228

229229
if 'partition_on' in kwargs:
230-
partition_cols = kwargs.pop('partition_on')
230+
if partition_cols is None:
231+
partition_cols = kwargs.pop('partition_on')
232+
else:
233+
raise ValueError("Cannot use both partition_on and "
234+
"partition_cols. Use partition_cols for "
235+
"partitioning data")
231236

232237
if partition_cols is not None:
233238
kwargs['file_scheme'] = 'hive'
@@ -290,8 +295,8 @@ def to_parquet(df, path, engine='auto', compression='snappy', index=None,
290295
partition_cols : list, optional
291296
Column names by which to partition the dataset
292297
Columns are partitioned in the order they are given
293-
The behaviour applies only to pyarrow >= 0.7.0 and fastparquet
294-
For other versions, this argument will be ignored.
298+
The behaviour applies only to pyarrow >= 0.7.0 and fastparquet.
299+
Raises a ValueError for other versions.
295300
.. versionadded:: 0.24.0
296301
kwargs
297302
Additional keyword arguments passed to the engine

pandas/tests/io/test_parquet.py

+10
Original file line numberDiff line numberDiff line change
@@ -589,3 +589,13 @@ def test_partition_on_supported(self, fp, df_full):
589589
import fastparquet
590590
actual_partition_cols = fastparquet.ParquetFile(path, False).cats
591591
assert len(actual_partition_cols) == 2
592+
593+
def test_error_on_using_partition_cols_and_partition_on(self, fp, df_full):
594+
# GH #23283
595+
partition_cols = ['bool', 'int']
596+
df = df_full
597+
with pytest.raises(ValueError):
598+
with tm.ensure_clean_dir() as path:
599+
df.to_parquet(path, engine="fastparquet", compression=None,
600+
partition_on=partition_cols,
601+
partition_cols=partition_cols)

0 commit comments

Comments
 (0)