Make pandas.to_parquet handles partition columns better #27117

dclong · 2019-06-29T05:20:36Z

Code Sample, a copy-pastable example if possible

Assuming frame is a pandas DataFrame which contains column cal_dt. If I want to write the DataFrame into a parquet partitioned by the column cal_dt, I have the following code without reading the doc carefully.

frame.to_parquet('partitioned_parquet', partition_cols='cal_dt')

Problem description

The above code raises an issue of "KeyError: 'c'", which is not clear enough to users.

Expected Output

Of course, I know the right way is to pass a list of columns to partition_cols (see the code below).

frame.to_parquet('partitioned_parquet', partition_cols=['cal_dt'])

However, as I mentioned that people will likely have the first example of code instead (expecting that passing a single column name would work) without reading the doc carefully. I think the method to_parquet should be enhanced to be either of the following.

Throws an exception with a clearer message saying that a list is required for partition_cols when a user passes a non-list object to it.
Support passing a single string to partition_cols in which it means to use that column as the partition column.
Either way, the implementation is simple but it does improve user experience.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-06-29T15:40:15Z

Does either parquet engine distinguish between a length-of-1 list partition_cols and a scalar partition_cols? If not, then option 2 seems fine.

jorisvandenbossche · 2019-06-29T21:34:32Z

Agreed the above is not a very nice user experience.

But it should maybe be the responsibility of the engines? (so report over there)

jorisvandenbossche · 2019-08-08T11:39:35Z

@dclong Interested in doing a PR for this?

dclong · 2019-08-08T18:08:51Z

@jorisvandenbossche Yes, this sounds like a very simple fix. I am very willing to submit a PR. However, I've never contributed a PR to pandas before. I will start reading the guidelines. It is greatly appreciated if you have additional tips on how to contributing.

jorisvandenbossche · 2019-08-08T20:47:16Z

The most difficult part is already OK: picking a good issue to tackle :)
And for the rest: if you have any questions about the workflow or the guide, feel free to ask (here or at the gitter channel https://gitter.im/pydata/pandas)

ai006 · 2019-08-17T19:00:35Z

my team and I from the #PandasHack2019 in Bentonville were hoping to tackle the issue by implementing the 2nd option as well.

jorisvandenbossche added the IO Parquet parquet, feather label Jun 29, 2019

jorisvandenbossche added the good first issue label Aug 8, 2019

jorisvandenbossche added this to the Contributions Welcome milestone Aug 8, 2019

unchecked9 mentioned this issue Aug 18, 2019

Support str param type for partition_cols in to_parquet function #27984

Closed

5 tasks

HawkinsBA mentioned this issue Dec 11, 2019

ENH: Support string arguments for partition_cols in pandas.to_parquet #30213

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.0 Dec 13, 2019

jorisvandenbossche closed this as completed in #30213 Dec 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make pandas.to_parquet handles partition columns better #27117

Make pandas.to_parquet handles partition columns better #27117

dclong commented Jun 29, 2019 •

edited

Loading

TomAugspurger commented Jun 29, 2019

jorisvandenbossche commented Jun 29, 2019

jorisvandenbossche commented Aug 8, 2019

dclong commented Aug 8, 2019

jorisvandenbossche commented Aug 8, 2019

ai006 commented Aug 17, 2019 •

edited

Loading

Make pandas.to_parquet handles partition columns better #27117

Make pandas.to_parquet handles partition columns better #27117

Comments

dclong commented Jun 29, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

TomAugspurger commented Jun 29, 2019

jorisvandenbossche commented Jun 29, 2019

jorisvandenbossche commented Aug 8, 2019

dclong commented Aug 8, 2019

jorisvandenbossche commented Aug 8, 2019

ai006 commented Aug 17, 2019 • edited Loading

dclong commented Jun 29, 2019 •

edited

Loading

ai006 commented Aug 17, 2019 •

edited

Loading