-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Fixed incorrect type in integer conversion in to_stata #6335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Types used for integer conversion where always half the size they should be. Produced a bug when exporting data tables with long integer data (np.int64).
Added test for incorrect integer conversion from int16, int32 and int64
typo in your test...pls test locally! |
This should probably be rejected. I had a look at http://www.stata.com/help.cgi?dta#variabletypes and there does not appear to be any int64 type. What's the correct thing?
I probably thing the former is the correct solution. |
hmm...int64 is kind of a fundamental type in pandas/numpy land. The issue only really arises if the value cannot fit in a int32 right? |
can u look at the current test failures? |
Types used for integer conversion where always half the size they should be. Produced a bug when exporting data tables with long integer data (np.int64).
Added test for incorrect integer conversion from int16, int32 and int64
I have been working backward on this and I'm still not really sure that a fix which actually makes As you say, this is a real problem for a DataFrame since many columns are naturally int64 (e.g. a plain index). I tried removing all references to int64 and all that did was more-or-less break everything. I've found another issue in the writer when it iterates across rows of the DataFrame, which produces a Series. This is a problem with mixed numeric data since the data is first transformed using the largest data type, including coercing integer types to float if the row contains a float, before bring written. I can't imagine that a loss of fidelity that this can induce would be desirable. This shows what is effectively happening inside from pandas import Series,DataFrame
import numpy as np
original = np.int64(2**63-2**32)
s1 = Series(original,dtype=np.int64)
s2 = Series(0.1,dtype=np.float32)
df = DataFrame({'long':s1,'single':s2})
# No problem in data frame
print df['long']-original
# Coercing int64 to float32 creates a problem
print df.iloc[0]['long'] - original |
The conversion is silently happening at https://github.com/bashtage/pandas/blob/stata-export-datatype/pandas/io/stata.py#L1144 since the rows of the DataFrame are being iterated over. |
@jreback I have now tested it on Stata and the dta produced with pandas 0.13.1 is not a correct stata file. This simple code outputs and inputs fine in pandas, but the variables have the wrong values in Stata s1 = pd.Series(1,dtype=np.int8)
s2 = pd.Series(2**8+1,dtype=np.int16)
s3 = pd.Series(2**16+1,dtype=np.int32)
s4 = pd.Series(2**32+1,dtype=np.int64)
df = pd.DataFrame({'s1':s1,'s2':s2,'s3':s3,'s4':s4})
df.to_stata('test.dta') Running the command list returns
I ran a couple of other experiments with other data dtypes (e.g. float64) and they also do not load the correct values into Stata 12.1. |
ok, feel free to make a new version that works with out-of-size values. I was asking whether the more normal case works, e.g. regular sized values. |
Those are all normal sized values. s1 is a byte with the value 1. s2 and int16 with value 257. Stata sees them with the correct type but wrong values. Even the trivial dta df = pd.DataFrame(np.array([[1.0]]),columns=['s1'])
df.to_stata('test.dta')
print df.dtypes does not contain the correct values when loaded into Stata. |
so the test data files are wrong? are they not valid stata files? since these can be roundtripped you are sying that the read is compensating somehow? (for the writing issues) |
is this a version issue? I recall very specific versions of formats that this supports (or doesn't support) @jseabold are you seeing the same thing here? |
That seems to be the case. I need to do some more systematic investigation which may take a little while. It appears that in some cases the reader and writer are internally compatible, but not always compatible with what Stata wants. |
ok compat is obviously important |
I can have a look if needed. I didn't write the writer, so I'm less familiar with it. I can also point you to (or provide you with) the technical documentation if needed. It comes with stata since v. 11.x or so. You are correct though that Stata doesn't have an int64. Only (type, min, max, bytes) byte, -127, 100, 1 |
Adding my comment from #6327 here. Ripping out the int64 stuff is the way to go (except for handling it appropriately on the writer side of things). The reader shouldn't know about it, and it shouldn't be in dtype_maps. Those aren't suggestions, but determined from the stata binary format. Cf. the statsmodels implementation without int64. https://github.com/statsmodels/statsmodels/blob/master/statsmodels/iolib/foreign.py#L289 |
FIX: Remove unnecessary, potentially precision degrading cast to Series when writing data ENH: Added function to cast columns from NumPy data types to Stata data types FIX: Corrected tests for correct Stata datatypes
The extreme values of float and double (Stata, pandas eqiv: float 32 and float64) were not correct. This resulted in incorrect truncation. The handling of missing values have been improved and code to convert missing values in any format has been added. The improvement differentiated between valid ranges for data and missing values. Additional issues were found when handling missing Dates, where missing Dates (NaT) were converted to non-missing dates when written. A test has been added for extreme numeric values as well as missing values.
The extreme values of float and double (Stata, pandas eqiv: float 32 and float64) were not correct. This resulted in incorrect truncation. The handling of missing values have been improved and code to convert missing values in any format has been added. The improvement differentiated between valid ranges for data and missing values. Additional issues were found when handling missing Dates, where missing Dates (NaT) were converted to non-missing dates when written. A test has been added for extreme numeric values as well as missing values.
The extreme values of float and double (Stata, pandas eqiv: float 32 and float64) were not correct. This resulted in incorrect truncation. The handling of missing values have been improved and code to convert missing values in any format has been added. The improvement differentiated between valid ranges for data and missing values. Additional issues were found when handling missing Dates, where missing Dates (NaT) were converted to non-missing dates when written. A test has been added for extreme numeric values as well as missing values.
Added test for 114 files
Renamed Stata data files to include file format
Final PEP8 fixes
if u can squash this down a bit would be great |
I think it would benefit from some more squashing, so please do. Kevin |
There is still at least one issue that hasn't been fixed, and I don't think is practical to fix now. Most Stata dates do not contain times and so using datetime64[ns] really limits the range. One of the tests has the year 2 (as in 2 AD) in it but cannot be converted to a datetime because of the limited range of ns. Once this is resolved, it would make more sense to use [D], [M] or [Y], corresponding to the Stata datatime format (%td, %tm, %ty) which would provide a more accurate range. Not sure if more general datetime64 is planned though. |
http://pandas-docs.github.io/pandas-docs-travis/gotchas.html#timestamp-limitations datetimes were settled long ago |
pls add a release notes and v0.14.0 entry in the API sections referencing the issue that this closes |
I rebased and squashed, see here: jreback@42e3e17 just give a once over I am not sure how you got in that state. You should investigate using |
Would it be best it I just merged it down to a single commit, with a nice commit message? My git strategy is, at best, not very good. |
Already did, that's the commit I posted. It was a bit non-trivial to do this actually. but done now. |
FWIW, the pandas strategy of hide all commit messages isn't preferred everywhere ;). Sometimes a nice, informative commit message about why a particular line changed can save much headbanging against the desk later... It's admittedly a delicate balance and mostly personal/team preference though. |
merged via 689d765 FYI most of your commit messages are their. Its actually pretty informative. thanks for the effort! |
Yes, many thanks for catching this and most importantly, fixing it. |
Re: squashed commits. The commit messages are there in the full merge commit, but not from git log or git blame IIUC. |
@jseabold right i put logical changes usually in a small number of commits rebase - squash - merge I find it makes pieces independent and pretty easy to look at |
Simple fix for bug #6327. All conversions were off by a factor of 2.
Added test for this case.