-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Trouble writing to_stata with a GzipFile #21041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks! Could you narrow down your example to a minimal example? It's hard to see exactly what the problem is with that long of an input. http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports |
Sorry about that! I reorganized things above, hopefully for the better. |
Thanks for the update. Agreed something is going on here. Any interest in debugging further? It'll all be in https://github.com/pandas-dev/pandas/blob/master/pandas/io/stata.py probably. I can help narrow it down further if you need. |
I can give it a look, but it will take me a bit. If you (or anyone else reading this) want to get this working sooner, please do! |
I can see this is never going to work with the new format Stata dta writer since writing a dta file requires rewriting some values in an area of the file called the map. It might work with the old format since it is pretty linear. The docstring should be updated to reflect that it one really works with file object, not general file-like objects. |
It appears that ndarray.tofile, which is used to write the data, does not work correctly with gzip files.
and then
|
Ah, that map business is messy. |
Unfortunately, BytesIO doesn't work with NumPy tofile either. TempFile is destroyed immediately on close and so neither of these would allow gzipped files to be cleanly written to disk without an intermediate dta file. |
Enable support for general file-like objects whene xporting stata files closes pandas-dev#21041
Enable support for general file-like objects when exporting stata files closes pandas-dev#21041
Enable support for general file-like objects when exporting stata files closes pandas-dev#21041
@bashtage, you can get around the deletion with import tempfile
import subprocess
import shutil
import pandas as pd
df = pd.DataFrame({
'a': [1],
'b': [1.5],
'c': ["z"]})
tmp = tempfile.NamedTemporaryFile(delete = False, suffix = ".dta")
df.to_stata(tmp.name)
tmp.close()
subprocess.run(['gzip', tmp.name])
shutil.move(tmp.name + ".gz", some_file) |
Could use something like:
with current Pandas. The patch fixes this issue so that a standard gzip can be used. It should be in 0.23.1 |
Enable support for general file-like objects when exporting stata files closes pandas-dev#21041
Enable support for general file-like objects when exporting stata files closes pandas-dev#21041
Enable support for general file-like objects when exporting stata files closes #21041
Enable support for general file-like objects when exporting stata files closes pandas-dev#21041 (cherry picked from commit f91e28c)
Enable support for general file-like objects when exporting stata files closes pandas-dev#21041
Problem description
When a Stata dataset writing to a
GzipFile
, the written dataset is all zero/blank.I think the Pandas would ideally write out the correct information to the
GzipFile
Stata output, or if that's not an easy change, might consider raising an error when the user tries to write to aGzipFile
.Expected Output
I expected to read back the same data I tried to write, or to get an error when writing.
Here's the table I tried to write to the GzipFile (
df
in the code):Here's the table that gets read back (
df_from_gzip
in the code):I think this is an error in writing, rather than in reading back, because Stata reads the same all-zeros table.
Code Sample
Other fun facts
bz2.BZ2File
andlzma.LZMAFile
refuse to write dta files, with the error "UnsupportedOperation: Seeking is only supported on files open for reading"read_stata
; opening the files in Stata itself gives the same results.Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-20-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: None
pip: 9.0.3
setuptools: 39.1.0
Cython: None
numpy: 1.14.3
scipy: 1.1.0
pyarrow: 0.9.0
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.5
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: