Appending parquet file from pandas to s3 #20638

Jeeva-Ganesan · 2018-04-08T23:43:34Z

Here is my snippet in spark-shell

jdbcDF.write.mode("append").partitionBy("date").parquet("s3://bucket/Data/")

Problem description

Now, i am trying to do the same thing in pandas. I see pandas supports to_parquet without any issue, however, as per this #19429, writing in s3 is not supported yet and will be supported in 0.23.0.
But, i cant find a solution to do the to_parquet in append mode. As per this, https://stackoverflow.com/questions/47191675/pandas-write-dataframe-to-parquet-format-with-append , the client API doesn't support it yet. But how come it works in spark? Anyone clarify this please and let me know if at all possible to do this append ?

Thanks.

The text was updated successfully, but these errors were encountered:

jreback · 2018-04-09T00:53:50Z

you would be better off asking on fastparquet or pyarrow tracker

this just passes thru

Jeeva-Ganesan · 2018-04-18T22:08:59Z

This is possible using fast parquet, its working like this,

import s3fs
from fastparquet import write
s3 = s3fs.S3FileSystem()
myopen = s3.open
write('bucketpath', dataframe, file_scheme='hive', partition_on = ['date'], append = True, open_with=myopen)

Nice to have the same in pandas.

garci66 · 2018-05-04T22:48:26Z

I still need to try it.. but it seems like it should be possible to use the the same syntax as in fastparquet.. After all the to_parquet method has a **kwargs that passes parameters to the fastparquet engine.

in my case I use it as follows:

df.to_parquet('./parquetstore/' + this_table + '.parquet',engine='fastparquet', partition_on=['partitionTime'],file_scheme='hive', append=True)

so it seems feasible that you should be able to use something like this:

import s3fs
s3 = s3fs.S3FileSystem()
myopen = s3.open
nop = lambda *args, **kwargs: None

df.to_parquet('./parquetstore/' + this_table + '.parquet',engine='fastparquet', partition_on=['partitionTime'],file_scheme='hive', append=True, open_with=myopen, mkdirs=nop)

See: [https://fastparquet.readthedocs.io/en/latest/filesystems.html] and [https://github.com/dask/fastparquet/issues/327] for the reason of the nop lamba in order to avoid spurious directories appearing. I still need to test this, but wanted to share it since we seem to be doing something similar.

jreback closed this as completed Apr 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appending parquet file from pandas to s3 #20638

Appending parquet file from pandas to s3 #20638

Jeeva-Ganesan commented Apr 8, 2018

jreback commented Apr 9, 2018

Jeeva-Ganesan commented Apr 18, 2018 •

edited

Loading

garci66 commented May 4, 2018

Appending parquet file from pandas to s3 #20638

Appending parquet file from pandas to s3 #20638

Comments

Jeeva-Ganesan commented Apr 8, 2018

Here is my snippet in spark-shell

Problem description

jreback commented Apr 9, 2018

Jeeva-Ganesan commented Apr 18, 2018 • edited Loading

garci66 commented May 4, 2018

Jeeva-Ganesan commented Apr 18, 2018 •

edited

Loading