Skip to content

Appending parquet file from pandas to s3 #20638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Jeeva-Ganesan opened this issue Apr 8, 2018 · 3 comments
Closed

Appending parquet file from pandas to s3 #20638

Jeeva-Ganesan opened this issue Apr 8, 2018 · 3 comments

Comments

@Jeeva-Ganesan
Copy link

Here is my snippet in spark-shell

jdbcDF.write.mode("append").partitionBy("date").parquet("s3://bucket/Data/")

Problem description

Now, i am trying to do the same thing in pandas. I see pandas supports to_parquet without any issue, however, as per this #19429, writing in s3 is not supported yet and will be supported in 0.23.0.
But, i cant find a solution to do the to_parquet in append mode. As per this, https://stackoverflow.com/questions/47191675/pandas-write-dataframe-to-parquet-format-with-append , the client API doesn't support it yet. But how come it works in spark? Anyone clarify this please and let me know if at all possible to do this append ?

Thanks.

@jreback
Copy link
Contributor

jreback commented Apr 9, 2018

you would be better off asking on fastparquet or pyarrow tracker

this just passes thru

@jreback jreback closed this as completed Apr 9, 2018
@Jeeva-Ganesan
Copy link
Author

Jeeva-Ganesan commented Apr 18, 2018

This is possible using fast parquet, its working like this,

import s3fs
from fastparquet import write
s3 = s3fs.S3FileSystem()
myopen = s3.open
write('bucketpath', dataframe, file_scheme='hive', partition_on = ['date'], append = True, open_with=myopen)

Nice to have the same in pandas.

@garci66
Copy link

garci66 commented May 4, 2018

I still need to try it.. but it seems like it should be possible to use the the same syntax as in fastparquet.. After all the to_parquet method has a **kwargs that passes parameters to the fastparquet engine.

in my case I use it as follows:

df.to_parquet('./parquetstore/' + this_table + '.parquet',engine='fastparquet', partition_on=['partitionTime'],file_scheme='hive', append=True)

so it seems feasible that you should be able to use something like this:

import s3fs
s3 = s3fs.S3FileSystem()
myopen = s3.open
nop = lambda *args, **kwargs: None

df.to_parquet('./parquetstore/' + this_table + '.parquet',engine='fastparquet', partition_on=['partitionTime'],file_scheme='hive', append=True, open_with=myopen, mkdirs=nop)

See: [https://fastparquet.readthedocs.io/en/latest/filesystems.html] and [https://github.com/dask/fastparquet/issues/327] for the reason of the nop lamba in order to avoid spurious directories appearing. I still need to test this, but wanted to share it since we seem to be doing something similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants