Skip to content

Make pandas.to_parquet handles partition columns better #27117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dclong opened this issue Jun 29, 2019 · 6 comments · Fixed by #30213
Closed

Make pandas.to_parquet handles partition columns better #27117

dclong opened this issue Jun 29, 2019 · 6 comments · Fixed by #30213
Labels
Milestone

Comments

@dclong
Copy link

dclong commented Jun 29, 2019

Code Sample, a copy-pastable example if possible

Assuming frame is a pandas DataFrame which contains column cal_dt. If I want to write the DataFrame into a parquet partitioned by the column cal_dt, I have the following code without reading the doc carefully.

frame.to_parquet('partitioned_parquet', partition_cols='cal_dt')

Problem description

The above code raises an issue of "KeyError: 'c'", which is not clear enough to users.

Expected Output

Of course, I know the right way is to pass a list of columns to partition_cols (see the code below).

frame.to_parquet('partitioned_parquet', partition_cols=['cal_dt'])

However, as I mentioned that people will likely have the first example of code instead (expecting that passing a single column name would work) without reading the doc carefully. I think the method to_parquet should be enhanced to be either of the following.

  1. Throws an exception with a clearer message saying that a list is required for partition_cols when a user passes a non-list object to it.
  2. Support passing a single string to partition_cols in which it means to use that column as the partition column.
    Either way, the implementation is simple but it does improve user experience.
@jorisvandenbossche jorisvandenbossche added the IO Parquet parquet, feather label Jun 29, 2019
@TomAugspurger
Copy link
Contributor

Does either parquet engine distinguish between a length-of-1 list partition_cols and a scalar partition_cols? If not, then option 2 seems fine.

@jorisvandenbossche
Copy link
Member

Agreed the above is not a very nice user experience.

But it should maybe be the responsibility of the engines? (so report over there)

@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Aug 8, 2019
@jorisvandenbossche
Copy link
Member

@dclong Interested in doing a PR for this?

@dclong
Copy link
Author

dclong commented Aug 8, 2019

@jorisvandenbossche Yes, this sounds like a very simple fix. I am very willing to submit a PR. However, I've never contributed a PR to pandas before. I will start reading the guidelines. It is greatly appreciated if you have additional tips on how to contributing.

@jorisvandenbossche
Copy link
Member

The most difficult part is already OK: picking a good issue to tackle :)
And for the rest: if you have any questions about the workflow or the guide, feel free to ask (here or at the gitter channel https://gitter.im/pydata/pandas)

@ai006
Copy link

ai006 commented Aug 17, 2019

my team and I from the #PandasHack2019 in Bentonville were hoping to tackle the issue by implementing the 2nd option as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants