Skip to content

Compression keyword for Stata and others? #26599

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ozak opened this issue May 31, 2019 · 10 comments · Fixed by #34013 or #39432
Closed

Compression keyword for Stata and others? #26599

ozak opened this issue May 31, 2019 · 10 comments · Fixed by #34013 or #39432
Labels
Enhancement IO Stata read_stata, to_stata
Milestone

Comments

@ozak
Copy link

ozak commented May 31, 2019

Hi,

I was trying to open zipped Stata files and thought one could do it as in read_csvusing the compression='zip' keyword option. Is this not implemented? Saw #15644 and follow ups, but at the time it seems the discussion was about output not input files. In many cases providers give Stata files in zip format, which e.g. in my case may mean 1000's of different zipped archives containing Stata + other files. #12103 seems to have added the functionality more generally, but at least in pd.__version__=0.23.4 it still complains when passing the compression keyword. I do not see the option in pandas/io/stata.py although in common.py the option seems to be there. Any pointers? I am happy to contribute if I can figure out where and how this is implemented for other formats.

@WillAyd
Copy link
Member

WillAyd commented Jun 2, 2019

Related issue #21640

@WillAyd WillAyd added the IO Stata read_stata, to_stata label Jun 2, 2019
@bashtage
Copy link
Contributor

Compression on io function is for native compression. csv can be gziped so this looks like a native (and common) compression. Stata dta files do not support compression. All you need to do is to use zipfile.ZipFile to traverse the .dta files, and then you can read them into DataFrames using read_stata. This doesn't require (or justify) adding an option to read_stata that is supporting something that is orthogonal to the dta file format.

@ozak
Copy link
Author

ozak commented Jun 13, 2019

Since all kinds of files can be zipped and are distributed like that, it seems strange to implement it in one kind and not others. In any case, you are right that any zip file can be unzipped and loaded into a pandas data frame using zipfile.

@ozak ozak closed this as completed Jun 13, 2019
@MaxGhenis
Copy link

MaxGhenis commented Dec 1, 2019

This distinction also seems strange to me; Stata files are commonly zipped (e.g. the Federal Reserve's Survey of Consumer Finances, where zipping reduces filesize by 90%), and since it's possible, it seems more convenient to offer the same interface provided to read_csv (even if the underlying mechanism will differ).

I asked how to do this and got a couple answers using zipfile.ZipFile: https://stackoverflow.com/questions/59122596/read-a-zipped-stata-file-from-url-into-pandas

Adding the functionality to read_stata would save six lines of code when pulling the zip file from a URL.

@jzwinck
Copy link
Contributor

jzwinck commented Dec 2, 2019

I agree, it's not nice that read_csv and read_json and read_pickle all understand compression but read_stata does not. Zipped Stata files are popular, and there's a zipsave package to create and load them in their native habitat: https://ideas.repec.org/c/boc/bocode/s446702.html

Pandas should support this. @WillAyd would you consider re-opening this?

@WillAyd WillAyd reopened this Dec 2, 2019
bashtage added a commit to bashtage/pandas that referenced this issue May 5, 2020
Add standard compression optons to stata exporters

closes pandas-dev#26599
bashtage added a commit to bashtage/pandas that referenced this issue May 5, 2020
Add standard compression optons to stata exporters

closes pandas-dev#26599
@jreback jreback added this to the 1.1 milestone May 11, 2020
@MaxGhenis
Copy link

Was this added to 1.1? It's not working for me:

In [1]: import pandas as pd

In [2]: res = pd.read_stata('https://www.federalreserve.gov/econres/files/scfp2016s.zip')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-d0194b601001> in <module>
----> 1 res = pd.read_stata('https://www.federalreserve.gov/econres/files/scfp2016s.zip')

~/anaconda3/lib/python3.7/site-packages/pandas/io/stata.py in read_stata(filepath_or_buffer, convert_dates, convert_categoricals, index_col, convert_missing, preserve_dtypes, columns, order_categoricals, chunksize, iterator)                                              
   1919         columns=columns,
   1920         order_categoricals=order_categoricals,
-> 1921         chunksize=chunksize,
   1922     )
   1923

~/anaconda3/lib/python3.7/site-packages/pandas/io/stata.py in __init__(self, path_or_buf, convert_dates, convert_categoricals, index_col, convert_missing, preserve_dtypes, columns, order_categoricals, chunksize)                                                           
   1078             self.path_or_buf = BytesIO(contents)
   1079
-> 1080         self._read_header()
   1081         self._setup_dtype()
   1082

~/anaconda3/lib/python3.7/site-packages/pandas/io/stata.py in _read_header(self)
   1110             self._read_new_header()
   1111         else:
-> 1112             self._read_old_header(first_char)
   1113
   1114         self.has_string_data = len([x for x in self.typlist if type(x) is int]) > 0

~/anaconda3/lib/python3.7/site-packages/pandas/io/stata.py in _read_old_header(self, first_char)
   1314         self.format_version = struct.unpack("b", first_char)[0]
   1315         if self.format_version not in [104, 105, 108, 111, 113, 114, 115]:
-> 1316             raise ValueError(_version_error.format(version=self.format_version))
   1317         self._set_encoding()
   1318         self.byteorder = (

ValueError: Version of given Stata file is 80. pandas supports importing versions 105, 108, 111 (Stata 7SE), 113 (Stata 8/9), 114 (Stata 10/11), 115 (Stata 12), 117 (Stata 13), 118 (Stata 14/15/16),and 119 (Stata 15/16, over 32,767 variables).                           

In [3]: res = pd.read_stata('https://www.federalreserve.gov/econres/files/scfp2016s.zip', compression='zip')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-cfb1214b74db> in <module>
----> 1 res = pd.read_stata('https://www.federalreserve.gov/econres/files/scfp2016s.zip', compression='zip')

TypeError: read_stata() got an unexpected keyword argument 'compression'

In [4]: pd.__version__
Out[4]: '1.1.0'

@MaxGhenis
Copy link

Friendly ping, I got the same error messages as above in pandas 1.2.0.

@bashtage
Copy link
Contributor

@MaxGhenis it has not been added to StataReader, so this is the expected outcome.

@MaxGhenis
Copy link

@bashtage would adding that be a separate issue? Or is there another way to read a zipped stata file now?

@jreback
Copy link
Contributor

jreback commented Jan 26, 2021

pls create a new issue

bashtage added a commit to bashtage/pandas that referenced this issue Jan 27, 2021
Add support for reading compressed dta files directly

xref pandas-dev#26599
bashtage added a commit to bashtage/pandas that referenced this issue Feb 2, 2021
Add support for reading compressed dta files directly

xref pandas-dev#26599
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Stata read_stata, to_stata
Projects
None yet
7 participants