-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
to_stata uint16 #7397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_stata uint16 #7397
Conversation
you need to add some tests in io/tests/test_stata.py you create a frame with several dtypes then round trip and compare with the original these tests should fail (with uint16 dtypes) before your fix and pass after |
@@ -990,11 +990,11 @@ def _dtype_to_stata_type(dtype): | |||
return chr(255) | |||
elif dtype == np.float32: | |||
return chr(254) | |||
elif dtype == np.int32: | |||
elif dtype in (np.int32, np.uint32): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not safe and can lead to data loss. Only supported Stata data types should be used here, and Stata does not support unsigned types. The correct method is to first perform all casting and then use only the case, Stata-safe data types when writing the data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change should not be necessary since this function is called after
_cast_to_stata_types
which cannot return any uint
types.
I'm not sure this is a good idea since Stata doesn't support any unsigned data types. In particular some of the code which uses the numpy datatypes in this commit will result in data loss. The only place where uint* handling should be added is in the casting code. Once the columns have been cast to supported Stata datatypes then only the supported Stata datatypes should be used when writing the Stata data types (e.g. |
@@ -230,13 +230,13 @@ def _cast_to_stata_types(data): | |||
ws = '' | |||
for col in data: | |||
dtype = data[col].dtype | |||
if dtype == np.int8: | |||
if dtype in (np.int8, np.uint8): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the correct behavior for this function. This function ensures that all datatypes after it is run have a trivial mapping to Stata data types. It would be simplest to simply upcast uint
s to the next largest int
which is always safe and then the other changes in the commit are not needed.
Something simple like
if dtype==np.uint8:
data[col] = data[col].astype(np.int16)
elif dtype==np.uint16:
data[col] = data[col].astype(np.int32)
elif dtype in (np.uint32, np.uint64):
# either convert to int32 if max is small enough or float64, warning
Also noticed a clear bug in (my) the code https://github.com/dmsul/pandas/blob/tostata-uint16/pandas/io/stata.py#L244 should be
The first comparison has |
@dmsul ok, then this needs to be changed to coerce |
Simple changes to io/stata.py to write unsigned integers to Stata files. Sorry about the poor commit messages--I'm still new to git and wasn't able to correct it cleanly.
(See issue #7365)