Skip to content

to_stata + read_stata results in NaNs (close to double precision limit) #14618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mverleg opened this issue Nov 8, 2016 · 15 comments
Closed
Labels
Bug IO Stata read_stata, to_stata Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@mverleg
Copy link

mverleg commented Nov 8, 2016

Explanation

Saving and loading data as stata results in a lot of NaNs.

I think the code & output is pretty self-explanatory, otherwise please ask.

I've not been able to test this on other systems yet.

If this is somehow expected behaviour, maybe a bigger warning would be in order.

A small, complete example of the issue

	from numpy.random.mtrand import RandomState
	from pandas import DataFrame, read_stata
	
	pth = '/tmp/demo.dta'
	rs = RandomState(seed=123456789)
	data = (2 * rs.rand(1000, 400).astype('float64') - 1) * 1.7976931348623157e+308
	
	colnames = tuple('c{0:03d}'.format(k) for k in range(data.shape[1]))
	frame = DataFrame(data=data, columns=colnames)
	with open(pth, 'w+') as fh:
		frame.to_stata(fh)
	
	with open(pth, 'r') as fh:
		frame2 = read_stata(fh)
	
	print(frame2.tail())

Expected Output

     index          c000           c001           c002           c003  \
995    995  1.502566e+308  1.019238e+308 -1.169342e+308  6.845363e+307
996    996 -3.418435e+307 -8.113486e+307  2.544741e+306  5.771775e+307
997    997  1.507324e+308  4.610183e+307 -1.016633e+308 -1.632862e+308
998    998 -8.138620e+307  6.312126e+307 -6.557370e+307  6.342690e+307
999    999 -1.179032e+308  1.554709e+308 -1.175680e+308  1.921731e+307

              c004           c005           c006           c007  \
995  1.611898e+308 -5.171776e+307 -8.918000e+307 -5.322720e+307
996  3.693405e+307 -1.480267e+308  1.586053e+308  7.489689e+306
997  1.060605e+308 -6.826590e+307  1.644990e+308 -1.379562e+308
998  1.379642e+308  1.005632e+307 -1.206948e+308 -1.198931e+308
999 -5.965607e+307  8.844623e+307  2.727894e+307 -5.433995e+307

              c008           c009      ...                 c390  \
995 -6.580851e+306  1.284482e+308      ...       -1.770789e+308
996 -9.312612e+307 -1.778315e+308      ...        7.410784e+307
997 -9.415141e+307  9.058828e+307      ...       -5.451829e+305
998  1.651712e+308  4.435415e+307      ...        5.220773e+307
999 -1.747738e+308 -1.603248e+308      ...        1.415798e+307

              c391           c392           c393           c394  \
995  7.360232e+307 -3.850417e+307  1.453624e+308  5.690363e+307
996 -6.943490e+307  1.047268e+308  4.026712e+307  9.161669e+305
997  4.406343e+306  1.617739e+308  4.218585e+307  1.573892e+307
998 -2.390131e+307 -6.649416e+307  6.548489e+307  1.000078e+307
999 -1.239203e+308 -5.038284e+307 -1.340608e+307 -1.193758e+308

              c395           c396           c397           c398           c399
995  8.371989e+307  3.491895e+307  7.344525e+307 -9.260950e+307  1.032120e+308
996  9.200510e+307 -1.729595e+308  4.021503e+307  2.274318e+307  5.856302e+307
997 -7.624901e+307 -1.206386e+308 -6.164537e+306 -7.634148e+307 -1.462809e+308
998 -9.399560e+307  9.697224e+307 -6.963726e+307 -1.655656e+308  1.513218e+308
999 -1.476121e+308  1.187603e+308  1.402195e+308 -1.584051e+308 -1.232190e+308

[5 rows x 401 columns]

Actual Output

     index           c000           c001           c002           c003  \
995    995            NaN            NaN -1.169342e+308  6.845363e+307
996    996 -3.418435e+307 -8.113486e+307  2.544741e+306  5.771775e+307
997    997            NaN  4.610183e+307 -1.016633e+308 -1.632862e+308
998    998 -8.138620e+307  6.312126e+307 -6.557370e+307  6.342690e+307
999    999 -1.179032e+308            NaN -1.175680e+308  1.921731e+307

              c004           c005           c006           c007  \
995            NaN -5.171776e+307 -8.918000e+307 -5.322720e+307
996  3.693405e+307 -1.480267e+308            NaN  7.489689e+306
997            NaN -6.826590e+307            NaN -1.379562e+308
998            NaN  1.005632e+307 -1.206948e+308 -1.198931e+308
999 -5.965607e+307  8.844623e+307  2.727894e+307 -5.433995e+307

              c008      ...                 c390           c391  \
995 -6.580851e+306      ...       -1.770789e+308  7.360232e+307
996 -9.312612e+307      ...        7.410784e+307 -6.943490e+307
997 -9.415141e+307      ...       -5.451829e+305  4.406343e+306
998            NaN      ...        5.220773e+307 -2.390131e+307
999 -1.747738e+308      ...        1.415798e+307 -1.239203e+308

              c392           c393           c394           c395  \
995 -3.850417e+307            NaN  5.690363e+307  8.371989e+307
996            NaN  4.026712e+307  9.161669e+305            NaN
997            NaN  4.218585e+307  1.573892e+307 -7.624901e+307
998 -6.649416e+307  6.548489e+307  1.000078e+307 -9.399560e+307
999 -5.038284e+307 -1.340608e+307 -1.193758e+308 -1.476121e+308

              c396           c397           c398           c399
995  3.491895e+307  7.344525e+307 -9.260950e+307            NaN
996 -1.729595e+308  4.021503e+307  2.274318e+307  5.856302e+307
997 -1.206386e+308 -6.164537e+306 -7.634148e+307 -1.462809e+308
998            NaN -6.963726e+307 -1.655656e+308            NaN
999            NaN            NaN -1.584051e+308 -1.232190e+308

[5 rows x 401 columns]

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-45-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8

pandas: 0.18.1
nose: None
pip: 9.0.1
setuptools: 26.1.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Nov 8, 2016

these look like out of bounds floats (or very close to the limit)
can you actually generate this with stata ?
i am not sure how much precision it has anyhow

@mverleg
Copy link
Author

mverleg commented Nov 8, 2016

Yeah it's close to the limit, but should be within an IEEE double precision float I think. I don't know if Stata deals with those somehow differently.
EDIT: but yes, it's based on the magnitude of numbers. It's the biggest ones that become NaNs, and in the visible output it's exclusively positive ones.

@mverleg mverleg changed the title to_stata + read_stata results in NaNs to_stata + read_stata results in NaNs (close to double precision limit) Nov 8, 2016
@jreback
Copy link
Contributor

jreback commented Nov 8, 2016

the floats might have overflowed and so when round tripped they have an undefined behavior

welcome to have a look

though using numbers close to the limit can easily cause issues - any particular reason you are trying to do this?

@mverleg
Copy link
Author

mverleg commented Nov 8, 2016

If they're overflowing (which seems likely), happens in either to_stata or read_stata. It's not overflowed in the script yet; I compared other methods (npy, csv, etc) which don't give NaNs.

I'm using it for a benchmark, so I'm using random data that uses the full range to test compression. I guess most people don't use such data and I don't urgently need it, so this is probably a low-priority bug. But it seems like a bug nonetheless.

@jreback
Copy link
Contributor

jreback commented Nov 8, 2016

csv is not a valid comparison the floats get stringified

sure could be either in to or from stata - i'll mark it but would take a community pull request to fix

@jreback jreback added Bug Numeric Operations Arithmetic, Comparison, and Logical operations IO Stata read_stata, to_stata Difficulty Intermediate labels Nov 8, 2016
@jreback jreback added this to the Next Major Release milestone Nov 8, 2016
@jreback
Copy link
Contributor

jreback commented Nov 8, 2016

cc @bashtage

@mverleg
Copy link
Author

mverleg commented Nov 8, 2016

Yeah CSV doesn't win the benckmark :-)

Thanks, I'll use smaller data for now, hope no one else runs into it!

@bashtage
Copy link
Contributor

bashtage commented Nov 8, 2016

Stata has a maximum value for doubles and uses the very largest values to indicate coded values

From the dta spec:

                minimum nonmissing    -1.798e+308 (-1.fffffffffffffX+3ff)
                maximum nonmissing    +8.988e+307 (+1.fffffffffffffX+3fe)
                code for .                        (+1.0000000000000X+3ff)
                code for .a                       (+1.0010000000000X+3ff)
                code for .b                       (+1.0020000000000X+3ff)
                ...
                code for .z                       (+1.01a0000000000X+3ff)

@bashtage
Copy link
Contributor

bashtage commented Nov 8, 2016

Would probably be best to warn/error when values like these are encountered for float and double. Right now integers are promoted to a larger type if possible to avoid this issue.

@bashtage
Copy link
Contributor

bashtage commented Nov 8, 2016

I don't think there is any promise to correctly round trip to_state/read_stata, especially for edge cases. The most important cases are to read data saved by Stata with read_stata and generate files Stata will correctly read in with to_stata.

@bashtage
Copy link
Contributor

bashtage commented Nov 8, 2016

Also, for performance measurement, at its core to_stata uses ndarray.tofile. You should just use this rather than going through to_stata. tofile is very fast since it just dumps the memory contents of an ndarray to disk and sis usually limited by disk read/write speed.

@mverleg
Copy link
Author

mverleg commented Nov 9, 2016

Ah I guess it's related to the encoding thing. It's probably good to use NaNs rather than just returning .a as a number. So the current behaviour is, in a sense, the desirable way.

A warning would be useful though. If the performance penalty is worth it, which I'm not sure of.

Also thanks for the benchmark hint.

bashtage added a commit to bashtage/pandas that referenced this issue Nov 10, 2016
Add explicit error checking for out-of-range doubles when writing Stata files

closes pandas-dev#14618
bashtage added a commit to bashtage/pandas that referenced this issue Nov 10, 2016
Add explicit error checking for out-of-range doubles when writing Stata files

closes pandas-dev#14618
bashtage added a commit to bashtage/pandas that referenced this issue Nov 10, 2016
Add explicit error checking for out-of-range doubles when writing Stata files
Upcasts float32 to float64 if out-of-range values encountered

closes pandas-dev#14618
@bashtage
Copy link
Contributor

@jreback I assume #14631 is causing these segfaults. Otherwise I think I'm happy with it.

bashtage added a commit to bashtage/pandas that referenced this issue Nov 11, 2016
Add explicit error checking for out-of-range doubles when writing Stata files
Upcasts float32 to float64 if out-of-range values encountered

closes pandas-dev#14618
bashtage added a commit to bashtage/pandas that referenced this issue Nov 14, 2016
Add explicit error checking for out-of-range doubles when writing Stata files
Upcasts float32 to float64 if out-of-range values encountered

closes pandas-dev#14618
@bashtage
Copy link
Contributor

I think this is ready unless you see something.

bashtage added a commit to bashtage/pandas that referenced this issue Nov 15, 2016
Add explicit error checking for out-of-range doubles when writing Stata files
Upcasts float32 to float64 if out-of-range values encountered
Tests for infinite values and raises if found

closes pandas-dev#14618
@mverleg
Copy link
Author

mverleg commented Nov 16, 2016

Couldn't get cython to work to test it, but the source looks good!

bashtage added a commit to bashtage/pandas that referenced this issue Nov 17, 2016
Add explicit error checking for out-of-range doubles when writing Stata files
Upcasts float32 to float64 if out-of-range values encountered
Tests for infinite values and raises if found

closes pandas-dev#14618
@jreback jreback modified the milestones: 0.19.2, Next Major Release Nov 17, 2016
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this issue Dec 14, 2016
Add explicit error checking for out-of-range doubles when writing Stata files
Upcasts float32 to float64 if out-of-range values encountered
Tests for infinite values and raises if found

closes pandas-dev#14618
closes pandas-dev#14637

(cherry picked from commit fe555db)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Stata read_stata, to_stata Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants