Skip to content

pd.DataFrame.to_csv('filename.zip') doesn't extract with a '.csv' extension #26023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stevesimmons opened this issue Apr 7, 2019 · 7 comments · Fixed by #26024
Closed

pd.DataFrame.to_csv('filename.zip') doesn't extract with a '.csv' extension #26023

stevesimmons opened this issue Apr 7, 2019 · 7 comments · Fixed by #26024
Labels
Enhancement IO CSV read_csv, to_csv

Comments

@stevesimmons
Copy link
Contributor

stevesimmons commented Apr 7, 2019

Code Sample, a copy-pastable example if possible

df = pd.DataFrame({'name': ['Raphael', 'Donatello'], 'mask': ['red', 'purple']})

# When trying to create a compressed csv, these give odd results.
df.to_csv('out.csv', compression='zip')  # --> zip file named 'out.csv' containing csv file 'out.csv'
df.to_csv('out.zip')                     # --> zip file named 'out.zip' containing csv file 'out.zip'
df.to_csv('out.csv.zip')                 # --> zip file named 'out.csv.zip' containing csv file 'out.csv.zip'

# This would be the desired behaviour, if we had an 'arcname' 
# parameter like zipfile.ZipFile.write(arcname, data)
df.to_csv('out.zip', arcname='data.csv') # --> zip file named 'out.zip' containing csv file 'data.csv'

Problem description

When pd.DataFrame.to_csv creates compressed zip files, the name of the csv file inside the archive is always the same as the name of the zip archive file itself. This is obviously problematic because the archive has a .zip extension but we want the csv file to have a .csv extension when it is extracted.

Other compression methods meant for a single file like 'bz2', 'gzip', and 'xz' do not have this problem because a file 'file.csv.gz' for instance, will automatically become 'file.csv' when decompressed.

This would be a relatively easy fix by adding an arcname=None parameter to to_csv, passing it through pandas.io.formats.csvs.CSVFormatter to pandas.io.formats.csvs._get_handle and using that instead of ZipFile.filename if provided.

Expected Output

See comments in Code Sample above for expected output.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.0-17-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: 4.3.0
pip: 19.0.3
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.16.2
scipy: 1.2.1
pyarrow: 0.11.1
xarray: None
IPython: 7.1.1
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: 2.4.11
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.2.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@drew-heenan
Copy link
Contributor

I'll take a look at this!

@devforfu
Copy link

Has this thing been fixed already?

@TomAugspurger
Copy link
Contributor

There's an open PR: #26024.

@Sshetty2
Copy link

having the same issue

@mullenkamp
Copy link

I can confirm. The problem only occurs when using the 'zip' option.

@fingoldo
Copy link

fingoldo commented Jan 29, 2021

Are you guys kidding. Has this been fixed? in pandas 1.2.1 the same dumb behavior again. arcname is not recognized as a valid dict option. to_csv still saves .zip inside .zip, whereas it is expected to save .csv inside .zip. Can someone explain to me correct syntax for achieving this, please? Ah sorry found it. Now it can be given as df.to_csv('name.csv.zip',compression={'method':'zip','archive_name':'name.csv'})

@CyberQin
Copy link

when my PR #40387 is accepted ,you can use 'myfile.csv.zip' as infer zipfile name, it will use "myfile.csv" as archive_name, but it won't add ".csv" to the end automatically

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
10 participants