-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
to_stata + read_stata results in NaNs (close to double precision limit) #14618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
these look like out of bounds floats (or very close to the limit) |
Yeah it's close to the limit, but should be within an IEEE double precision float I think. I don't know if Stata deals with those somehow differently. |
the floats might have overflowed and so when round tripped they have an undefined behavior welcome to have a look though using numbers close to the limit can easily cause issues - any particular reason you are trying to do this? |
If they're overflowing (which seems likely), happens in either to_stata or read_stata. It's not overflowed in the script yet; I compared other methods (npy, csv, etc) which don't give NaNs. I'm using it for a benchmark, so I'm using random data that uses the full range to test compression. I guess most people don't use such data and I don't urgently need it, so this is probably a low-priority bug. But it seems like a bug nonetheless. |
csv is not a valid comparison the floats get stringified sure could be either in to or from stata - i'll mark it but would take a community pull request to fix |
cc @bashtage |
Yeah CSV doesn't win the benckmark :-) Thanks, I'll use smaller data for now, hope no one else runs into it! |
Stata has a maximum value for doubles and uses the very largest values to indicate coded values From the dta spec:
|
Would probably be best to warn/error when values like these are encountered for float and double. Right now integers are promoted to a larger type if possible to avoid this issue. |
I don't think there is any promise to correctly round trip to_state/read_stata, especially for edge cases. The most important cases are to read data saved by Stata with |
Also, for performance measurement, at its core to_stata uses |
Ah I guess it's related to the encoding thing. It's probably good to use NaNs rather than just returning A warning would be useful though. If the performance penalty is worth it, which I'm not sure of. Also thanks for the benchmark hint. |
Add explicit error checking for out-of-range doubles when writing Stata files closes pandas-dev#14618
Add explicit error checking for out-of-range doubles when writing Stata files closes pandas-dev#14618
Add explicit error checking for out-of-range doubles when writing Stata files Upcasts float32 to float64 if out-of-range values encountered closes pandas-dev#14618
Add explicit error checking for out-of-range doubles when writing Stata files Upcasts float32 to float64 if out-of-range values encountered closes pandas-dev#14618
Add explicit error checking for out-of-range doubles when writing Stata files Upcasts float32 to float64 if out-of-range values encountered closes pandas-dev#14618
I think this is ready unless you see something. |
Add explicit error checking for out-of-range doubles when writing Stata files Upcasts float32 to float64 if out-of-range values encountered Tests for infinite values and raises if found closes pandas-dev#14618
Couldn't get cython to work to test it, but the source looks good! |
Add explicit error checking for out-of-range doubles when writing Stata files Upcasts float32 to float64 if out-of-range values encountered Tests for infinite values and raises if found closes pandas-dev#14618
Add explicit error checking for out-of-range doubles when writing Stata files Upcasts float32 to float64 if out-of-range values encountered Tests for infinite values and raises if found closes pandas-dev#14618 closes pandas-dev#14637 (cherry picked from commit fe555db)
Explanation
Saving and loading data as stata results in a lot of NaNs.
I think the code & output is pretty self-explanatory, otherwise please ask.
I've not been able to test this on other systems yet.
If this is somehow expected behaviour, maybe a bigger warning would be in order.
A small, complete example of the issue
Expected Output
Actual Output
Output of
pd.show_versions()
pandas: 0.18.1
nose: None
pip: 9.0.1
setuptools: 26.1.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: