Compression keyword for Stata and others? #26599

ozak · 2019-05-31T21:26:46Z

Hi,

I was trying to open zipped Stata files and thought one could do it as in read_csvusing the compression='zip' keyword option. Is this not implemented? Saw #15644 and follow ups, but at the time it seems the discussion was about output not input files. In many cases providers give Stata files in zip format, which e.g. in my case may mean 1000's of different zipped archives containing Stata + other files. #12103 seems to have added the functionality more generally, but at least in pd.__version__=0.23.4 it still complains when passing the compression keyword. I do not see the option in pandas/io/stata.py although in common.py the option seems to be there. Any pointers? I am happy to contribute if I can figure out where and how this is implemented for other formats.

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-06-02T02:28:20Z

Related issue #21640

bashtage · 2019-06-12T23:40:10Z

Compression on io function is for native compression. csv can be gziped so this looks like a native (and common) compression. Stata dta files do not support compression. All you need to do is to use zipfile.ZipFile to traverse the .dta files, and then you can read them into DataFrames using read_stata. This doesn't require (or justify) adding an option to read_stata that is supporting something that is orthogonal to the dta file format.

ozak · 2019-06-13T00:22:54Z

Since all kinds of files can be zipped and are distributed like that, it seems strange to implement it in one kind and not others. In any case, you are right that any zip file can be unzipped and loaded into a pandas data frame using zipfile.

MaxGhenis · 2019-12-01T05:43:06Z

This distinction also seems strange to me; Stata files are commonly zipped (e.g. the Federal Reserve's Survey of Consumer Finances, where zipping reduces filesize by 90%), and since it's possible, it seems more convenient to offer the same interface provided to read_csv (even if the underlying mechanism will differ).

I asked how to do this and got a couple answers using zipfile.ZipFile: https://stackoverflow.com/questions/59122596/read-a-zipped-stata-file-from-url-into-pandas

Adding the functionality to read_stata would save six lines of code when pulling the zip file from a URL.

jzwinck · 2019-12-02T06:38:29Z

I agree, it's not nice that read_csv and read_json and read_pickle all understand compression but read_stata does not. Zipped Stata files are popular, and there's a zipsave package to create and load them in their native habitat: https://ideas.repec.org/c/boc/bocode/s446702.html

Pandas should support this. @WillAyd would you consider re-opening this?

Add standard compression optons to stata exporters closes pandas-dev#26599

MaxGhenis · 2020-08-22T03:22:07Z

Was this added to 1.1? It's not working for me:

In [1]: import pandas as pd

In [2]: res = pd.read_stata('https://www.federalreserve.gov/econres/files/scfp2016s.zip')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-d0194b601001> in <module>
----> 1 res = pd.read_stata('https://www.federalreserve.gov/econres/files/scfp2016s.zip')

~/anaconda3/lib/python3.7/site-packages/pandas/io/stata.py in read_stata(filepath_or_buffer, convert_dates, convert_categoricals, index_col, convert_missing, preserve_dtypes, columns, order_categoricals, chunksize, iterator)                                              
   1919         columns=columns,
   1920         order_categoricals=order_categoricals,
-> 1921         chunksize=chunksize,
   1922     )
   1923

~/anaconda3/lib/python3.7/site-packages/pandas/io/stata.py in __init__(self, path_or_buf, convert_dates, convert_categoricals, index_col, convert_missing, preserve_dtypes, columns, order_categoricals, chunksize)                                                           
   1078             self.path_or_buf = BytesIO(contents)
   1079
-> 1080         self._read_header()
   1081         self._setup_dtype()
   1082

~/anaconda3/lib/python3.7/site-packages/pandas/io/stata.py in _read_header(self)
   1110             self._read_new_header()
   1111         else:
-> 1112             self._read_old_header(first_char)
   1113
   1114         self.has_string_data = len([x for x in self.typlist if type(x) is int]) > 0

~/anaconda3/lib/python3.7/site-packages/pandas/io/stata.py in _read_old_header(self, first_char)
   1314         self.format_version = struct.unpack("b", first_char)[0]
   1315         if self.format_version not in [104, 105, 108, 111, 113, 114, 115]:
-> 1316             raise ValueError(_version_error.format(version=self.format_version))
   1317         self._set_encoding()
   1318         self.byteorder = (

ValueError: Version of given Stata file is 80. pandas supports importing versions 105, 108, 111 (Stata 7SE), 113 (Stata 8/9), 114 (Stata 10/11), 115 (Stata 12), 117 (Stata 13), 118 (Stata 14/15/16),and 119 (Stata 15/16, over 32,767 variables).                           

In [3]: res = pd.read_stata('https://www.federalreserve.gov/econres/files/scfp2016s.zip', compression='zip')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-cfb1214b74db> in <module>
----> 1 res = pd.read_stata('https://www.federalreserve.gov/econres/files/scfp2016s.zip', compression='zip')

TypeError: read_stata() got an unexpected keyword argument 'compression'

In [4]: pd.__version__
Out[4]: '1.1.0'

MaxGhenis · 2021-01-26T05:38:29Z

Friendly ping, I got the same error messages as above in pandas 1.2.0.

bashtage · 2021-01-26T08:04:14Z

@MaxGhenis it has not been added to StataReader, so this is the expected outcome.

MaxGhenis · 2021-01-26T15:34:01Z

@bashtage would adding that be a separate issue? Or is there another way to read a zipped stata file now?

jreback · 2021-01-26T18:17:49Z

pls create a new issue

Add support for reading compressed dta files directly xref pandas-dev#26599

WillAyd added the IO Stata read_stata, to_stata label Jun 2, 2019

ozak closed this as completed Jun 13, 2019

MaxGhenis mentioned this issue Dec 2, 2019

Use more efficient approach for read_stata_zip function. PolicyEngine/microdf#48

Merged

WillAyd reopened this Dec 2, 2019

mroeschke added the Enhancement label Apr 3, 2020

bashtage added a commit to bashtage/pandas that referenced this issue May 5, 2020

ENH: Add compression to stata exporters

3553924

Add standard compression optons to stata exporters closes pandas-dev#26599

bashtage mentioned this issue May 5, 2020

ENH: Add compression to stata exporters #34013

Merged

5 tasks

bashtage added a commit to bashtage/pandas that referenced this issue May 5, 2020

ENH: Add compression to stata exporters

f1c87bf

Add standard compression optons to stata exporters closes pandas-dev#26599

jreback added this to the 1.1 milestone May 11, 2020

jreback closed this as completed in #34013 May 12, 2020

MaxGhenis mentioned this issue Aug 22, 2020

Remove read_stata_zip once pandas adds it natively PolicyEngine/microdf#95

Open

bashtage added a commit to bashtage/pandas that referenced this issue Jan 27, 2021

ENH: Add compression to read_stata and StataReader

1f52991

Add support for reading compressed dta files directly xref pandas-dev#26599

bashtage mentioned this issue Jan 27, 2021

ENH: Add compression to read_stata and StataReader #39432

Merged

4 tasks

bashtage added a commit to bashtage/pandas that referenced this issue Feb 2, 2021

ENH: Add compression to read_stata and StataReader

87403ad

Add support for reading compressed dta files directly xref pandas-dev#26599

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compression keyword for Stata and others? #26599

Compression keyword for Stata and others? #26599

ozak commented May 31, 2019

WillAyd commented Jun 2, 2019

bashtage commented Jun 12, 2019

ozak commented Jun 13, 2019

MaxGhenis commented Dec 1, 2019 •

edited

Loading

jzwinck commented Dec 2, 2019

MaxGhenis commented Aug 22, 2020

MaxGhenis commented Jan 26, 2021

bashtage commented Jan 26, 2021

MaxGhenis commented Jan 26, 2021

jreback commented Jan 26, 2021

Compression keyword for Stata and others? #26599

Compression keyword for Stata and others? #26599

Comments

ozak commented May 31, 2019

WillAyd commented Jun 2, 2019

bashtage commented Jun 12, 2019

ozak commented Jun 13, 2019

MaxGhenis commented Dec 1, 2019 • edited Loading

jzwinck commented Dec 2, 2019

MaxGhenis commented Aug 22, 2020

MaxGhenis commented Jan 26, 2021

bashtage commented Jan 26, 2021

MaxGhenis commented Jan 26, 2021

jreback commented Jan 26, 2021

MaxGhenis commented Dec 1, 2019 •

edited

Loading