-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Make pandas.to_parquet handles partition columns better #27117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Does either parquet engine distinguish between a length-of-1 list |
Agreed the above is not a very nice user experience. But it should maybe be the responsibility of the engines? (so report over there) |
@dclong Interested in doing a PR for this? |
@jorisvandenbossche Yes, this sounds like a very simple fix. I am very willing to submit a PR. However, I've never contributed a PR to pandas before. I will start reading the guidelines. It is greatly appreciated if you have additional tips on how to contributing. |
The most difficult part is already OK: picking a good issue to tackle :) |
my team and I from the #PandasHack2019 in Bentonville were hoping to tackle the issue by implementing the 2nd option as well. |
Code Sample, a copy-pastable example if possible
Assuming
frame
is a pandas DataFrame which contains columncal_dt
. If I want to write the DataFrame into a parquet partitioned by the columncal_dt
, I have the following code without reading the doc carefully.Problem description
The above code raises an issue of "KeyError: 'c'", which is not clear enough to users.
Expected Output
Of course, I know the right way is to pass a list of columns to
partition_cols
(see the code below).However, as I mentioned that people will likely have the first example of code instead (expecting that passing a single column name would work) without reading the doc carefully. I think the method
to_parquet
should be enhanced to be either of the following.partition_cols
when a user passes a non-list object to it.partition_cols
in which it means to use that column as the partition column.Either way, the implementation is simple but it does improve user experience.
The text was updated successfully, but these errors were encountered: