Skip to content

ENH: Explicit range checking when writing Stata #14637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

bashtage
Copy link
Contributor

Add explicit error checking for out-of-range doubles when writing Stata files

closes #14618

DOUBLE_MAX = struct.unpack('<d', b'\x00\x00\x00\x00\x00\x00\xe0\x7f')[0]
for col in data:
if data[col].dtype == np.double:
value = data[col].max()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use pandas.types.common.is_floating_dtype (or float_dtype) forgot

@bashtage
Copy link
Contributor Author

@jreback That doesn't work since it is True for np.float32.

@bashtage bashtage force-pushed the stata-max-value-check branch from b6f6432 to db89413 Compare November 10, 2016 23:48
@bashtage
Copy link
Contributor Author

Changed the approach and also added check for float32 column range with upcast if needed.

@jorisvandenbossche jorisvandenbossche added Bug IO Stata read_stata, to_stata labels Nov 11, 2016
@bashtage bashtage force-pushed the stata-max-value-check branch 2 times, most recently from 90d65fe to af41353 Compare November 14, 2016 11:23
@codecov-io
Copy link

codecov-io commented Nov 14, 2016

Current coverage is 85.28% (diff: 100%)

Merging #14637 into master will increase coverage by <.01%

@@             master     #14637   diff @@
==========================================
  Files           140        140          
  Lines         50693      50706    +13   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43235      43247    +12   
- Misses         7458       7459     +1   
  Partials          0          0          

Powered by Codecov. Last update 726efc7...55a98f5

@@ -80,3 +80,5 @@ Performance Improvements

Bug Fixes
~~~~~~~~~

- Explicit check in ``to_stata`` and ````StataWriter `` for out-of-range values when writing doubles (:issue:`14618`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.19.2; too many quotes on StataWriter

@@ -1234,6 +1233,37 @@ def test_stata_111(self):
original = original[['y', 'x', 'w', 'z']]
tm.assert_frame_equal(original, df)

def test_out_of_range_double(self):
# GH 14618
df = DataFrame({'ColumnOk': [0.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u throw some infs (and -infs) in here as well (unless that screws up the test)

@bashtage
Copy link
Contributor Author

Need to check behavior of infs/-inf, as well as NaN in Stata. It might support these values.

@bashtage bashtage force-pushed the stata-max-value-check branch from af41353 to f057d03 Compare November 15, 2016 12:31
@@ -80,3 +80,5 @@ Performance Improvements

Bug Fixes
~~~~~~~~~

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can move to 0.19.2

Add explicit error checking for out-of-range doubles when writing Stata files
Upcasts float32 to float64 if out-of-range values encountered
Tests for infinite values and raises if found

closes pandas-dev#14618
@bashtage bashtage force-pushed the stata-max-value-check branch from f057d03 to 55a98f5 Compare November 17, 2016 11:24
@jreback
Copy link
Contributor

jreback commented Nov 17, 2016

so infs are not allowed in state at all?

@jreback jreback added this to the 0.19.2 milestone Nov 17, 2016
@bashtage
Copy link
Contributor Author

+inf is a missing value, appears in Stata the same as NaN (denoted with a .). Basically Stata always uses the largest representable numbers as missing values, and everything above the upper cutoff for a double is a missing value. I think users who with to express a missing value in Stata should be forced to use NaN which exports fine.

-inf is allowed

@jreback
Copy link
Contributor

jreback commented Nov 17, 2016

ok, that's fine then.

@jreback jreback closed this in fe555db Nov 17, 2016
@jreback
Copy link
Contributor

jreback commented Nov 17, 2016

thanks!

jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Dec 14, 2016
Add explicit error checking for out-of-range doubles when writing Stata files
Upcasts float32 to float64 if out-of-range values encountered
Tests for infinite values and raises if found

closes pandas-dev#14618
closes pandas-dev#14637

(cherry picked from commit fe555db)
@bashtage bashtage deleted the stata-max-value-check branch January 24, 2017 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_stata + read_stata results in NaNs (close to double precision limit)
4 participants