BUG: Fixed incorrect type in integer conversion in to_stata #6335

bashtage · 2014-02-12T19:40:00Z

Simple fix for bug #6327. All conversions were off by a factor of 2.
Added test for this case.

Types used for integer conversion where always half the size they should be. Produced a bug when exporting data tables with long integer data (np.int64).

Added test for incorrect integer conversion from int16, int32 and int64

jreback · 2014-02-12T21:00:27Z

typo in your test...pls test locally!

bashtage · 2014-02-12T23:39:32Z

This should probably be rejected. I had a look at

http://www.stata.com/help.cgi?dta#variabletypes

and there does not appear to be any int64 type.

What's the correct thing?

Raise an exception and ask the user to coerce int64 to double
Silently coerce long to double.

I probably thing the former is the correct solution.

jreback · 2014-02-12T23:42:31Z

hmm...int64 is kind of a fundamental type in pandas/numpy land.

The issue only really arises if the value cannot fit in a int32 right?
pretty easy to try to coerce and raise if it doesn't work properly.
(of course this will essentially make the round-trip non-exact, but oh well)

jreback · 2014-02-16T12:05:51Z

can u look at the current test failures?

Types used for integer conversion where always half the size they should be. Produced a bug when exporting data tables with long integer data (np.int64).

Added test for incorrect integer conversion from int16, int32 and int64

bashtage · 2014-02-24T18:15:01Z

I have been working backward on this and I'm still not really sure that a fix which actually makes to_stata() produce actual Stata compliant files is ever going to be acceptable since, from what I can tell, Stata does not support any type of 8-byte integers.

As you say, this is a real problem for a DataFrame since many columns are naturally int64 (e.g. a plain index).

I tried removing all references to int64 and all that did was more-or-less break everything.

I've found another issue in the writer when it iterates across rows of the DataFrame, which produces a Series. This is a problem with mixed numeric data since the data is first transformed using the largest data type, including coercing integer types to float if the row contains a float, before bring written. I can't imagine that a loss of fidelity that this can induce would be desirable.

This shows what is effectively happening inside to_stata:

from pandas import Series,DataFrame
import numpy as np

original = np.int64(2**63-2**32)

s1 = Series(original,dtype=np.int64)
s2 = Series(0.1,dtype=np.float32)
df = DataFrame({'long':s1,'single':s2})

# No problem in data frame
print df['long']-original

# Coercing int64 to float32 creates  a problem
print df.iloc[0]['long'] - original

bashtage · 2014-02-24T18:19:30Z

The conversion is silently happening at

https://github.com/bashtage/pandas/blob/stata-export-datatype/pandas/io/stata.py#L1144

since the rows of the DataFrame are being iterated over.

jreback · 2014-02-24T21:06:53Z

cc @PKEuS

@bashtage IIUC this happens for values that don't fit in float32? e.g. large ints. So a warning / raising is prob appropriate. This 'works' it seems if the values fit, though? E.g. if you generate a file with to_stata it is read by stata ok?

bashtage · 2014-02-25T11:17:32Z

@jreback I have now tested it on Stata and the dta produced with pandas 0.13.1 is not a correct stata file.

This simple code outputs and inputs fine in pandas, but the variables have the wrong values in Stata

s1 = pd.Series(1,dtype=np.int8)
s2 = pd.Series(2**8+1,dtype=np.int16)
s3 = pd.Series(2**16+1,dtype=np.int32)
s4 = pd.Series(2**32+1,dtype=np.int64)
df = pd.DataFrame({'s1':s1,'s2':s2,'s3':s3,'s4':s4})
df.to_stata('test.dta')

Running the command

list

returns

index   s1  s2  s3              s4
     0   0   0   0        1.68e+07

I ran a couple of other experiments with other data dtypes (e.g. float64) and they also do not load the correct values into Stata 12.1.

jreback · 2014-02-25T11:35:01Z

ok, feel free to make a new version that works with out-of-size values. I was asking whether the more normal case works, e.g. regular sized values.

bashtage · 2014-02-25T11:50:22Z

Those are all normal sized values. s1 is a byte with the value 1. s2 and int16 with value 257. Stata sees them with the correct type but wrong values.

Even the trivial dta

df = pd.DataFrame(np.array([[1.0]]),columns=['s1'])
df.to_stata('test.dta')
print df.dtypes

does not contain the correct values when loaded into Stata.

jreback · 2014-02-25T11:53:46Z

so the test data files are wrong? are they not valid stata files? since these can be roundtripped you are sying that the read is compensating somehow? (for the writing issues)

jreback · 2014-02-25T11:54:44Z

is this a version issue? I recall very specific versions of formats that this supports (or doesn't support)

@jseabold are you seeing the same thing here?

bashtage · 2014-02-25T11:55:52Z

That seems to be the case. I need to do some more systematic investigation which may take a little while.

It appears that in some cases the reader and writer are internally compatible, but not always compatible with what Stata wants.

jreback · 2014-02-25T12:06:23Z

ok compat is obviously important
do pls let is know

jseabold · 2014-02-25T14:24:02Z

I can have a look if needed. I didn't write the writer, so I'm less familiar with it. I can also point you to (or provide you with) the technical documentation if needed. It comes with stata since v. 11.x or so.

You are correct though that Stata doesn't have an int64. Only (type, min, max, bytes)

byte, -127, 100, 1
int, -32,767, 32,740, 2
long, -2,147,483,647, 2,147,483,620, 4
float, -1.70141173319 x 10^38, 1.70141173319 x 10^38, 4
double, -8.9884656742 x 10^307, 8.9884656743 x 10^307, 8

jseabold · 2014-02-25T14:28:20Z

Adding my comment from #6327 here. Ripping out the int64 stuff is the way to go (except for handling it appropriately on the writer side of things). The reader shouldn't know about it, and it shouldn't be in dtype_maps. Those aren't suggestions, but determined from the stata binary format. Cf. the statsmodels implementation without int64.

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/iolib/foreign.py#L289

FIX: Remove unnecessary, potentially precision degrading cast to Series when writing data ENH: Added function to cast columns from NumPy data types to Stata data types FIX: Corrected tests for correct Stata datatypes

The extreme values of float and double (Stata, pandas eqiv: float 32 and float64) were not correct. This resulted in incorrect truncation. The handling of missing values have been improved and code to convert missing values in any format has been added. The improvement differentiated between valid ranges for data and missing values. Additional issues were found when handling missing Dates, where missing Dates (NaT) were converted to non-missing dates when written. A test has been added for extreme numeric values as well as missing values.

Added test for 114 files

Renamed Stata data files to include file format

Final PEP8 fixes

jreback · 2014-03-04T03:05:53Z

if u can squash this down a bit would be great

bashtage · 2014-03-04T07:45:17Z

I think it would benefit from some more squashing, so please do.

Kevin

bashtage · 2014-03-04T08:03:23Z

There is still at least one issue that hasn't been fixed, and I don't think is practical to fix now. Most Stata dates do not contain times and so using datetime64[ns] really limits the range. One of the tests has the year 2 (as in 2 AD) in it but cannot be converted to a datetime because of the limited range of ns. Once this is resolved, it would make more sense to use [D], [M] or [Y], corresponding to the Stata datatime format (%td, %tm, %ty) which would provide a more accurate range.

Not sure if more general datetime64 is planned though.

jreback · 2014-03-04T10:26:55Z

http://pandas-docs.github.io/pandas-docs-travis/gotchas.html#timestamp-limitations

datetimes were settled long ago
if u need to represent outside the ns range then use a period

jreback · 2014-03-04T11:11:58Z

pls add a release notes and v0.14.0 entry in the API sections referencing the issue that this closes

jreback · 2014-03-04T13:59:40Z

I rebased and squashed, see here: jreback@42e3e17

just give a once over

I am not sure how you got in that state. You should investigate using --amend which rewrites the last commit, so you don't have a gozillion number of commits. Just makes it harder to work with.

bashtage · 2014-03-05T00:54:50Z

Would it be best it I just merged it down to a single commit, with a nice commit message? My git strategy is, at best, not very good.

jreback · 2014-03-05T01:03:14Z

Already did, that's the commit I posted. It was a bit non-trivial to do this actually. but done now.

jseabold · 2014-03-05T01:06:42Z

FWIW, the pandas strategy of hide all commit messages isn't preferred everywhere ;). Sometimes a nice, informative commit message about why a particular line changed can save much headbanging against the desk later... It's admittedly a delicate balance and mostly personal/team preference though.

jreback · 2014-03-05T01:12:05Z

merged via 689d765

FYI most of your commit messages are their. Its actually pretty informative.

thanks for the effort!

jseabold · 2014-03-05T01:13:47Z

Yes, many thanks for catching this and most importantly, fixing it.

jreback · 2014-03-05T01:15:21Z

@bashtage or @jseabold if you have insight into #6545 pls comment

jseabold · 2014-03-05T01:45:01Z

Re: squashed commits. The commit messages are there in the full merge commit, but not from git log or git blame IIUC.

jreback · 2014-03-05T01:49:03Z

@jseabold right

i put logical changes usually in a small number of commits
it prob would work other ways just as well
but that has been the convention

rebase - squash - merge

I find it makes pieces independent and pretty easy to look at
you can always look back the original history via the pr if u want
of course but master is pretty clean

bashtage added 2 commits February 12, 2014 19:15

BUG: Changes types used in packing structs

2448f48

Types used for integer conversion where always half the size they should be. Produced a bug when exporting data tables with long integer data (np.int64).

Added test for integer conversion bug

eeded1f

Added test for incorrect integer conversion from int16, int32 and int64

bashtage mentioned this pull request Feb 12, 2014

BUG: DataFrame.to_stata() uses wrong struct formats and crashes in int64 #6327

Closed

bashtage added 2 commits February 12, 2014 23:27

Removed unintended whitespace

fba4572

Fixed another typo

50f579f

jreback added Bug labels Feb 16, 2014

jreback added this to the 0.14.0 milestone Feb 16, 2014

bashtage added 4 commits February 24, 2014 15:44

BUG: Changes types used in packing structs

77939e0

Types used for integer conversion where always half the size they should be. Produced a bug when exporting data tables with long integer data (np.int64).

Added test for integer conversion bug

fa7faff

Added test for incorrect integer conversion from int16, int32 and int64

Removed unintended whitespace

26b5cdc

Fixed another typo

a572356

bashtage added 11 commits March 2, 2014 11:11

Disabled the big endian skips

adace0f

Fixed legacy date issue with format 114 files

040b736

Added test for 114 files

Added format 114 (Stata 9/10/11) data file

4f719ad

Add test for Stata data with file format 114

266983d

Added additional data files for testing alternative Stata file formats

4ce821b

Added expected result to test

ae34642

Renamed Stata data files to include file format

Fixed final PEP8 issues in test_stata

d83e902

Added changes and enhancements to documentation

27b5278

Final PEP8 fixes

jreback mentioned this pull request Mar 4, 2014

Allow timestamp option for StataWriter.write_file() #6545

Closed

jreback closed this Mar 5, 2014

bashtage deleted the stata-export-datatype branch March 13, 2014 22:25

bashtage restored the stata-export-datatype branch March 21, 2014 19:49

bashtage deleted the stata-export-datatype branch March 23, 2014 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fixed incorrect type in integer conversion in to_stata #6335

BUG: Fixed incorrect type in integer conversion in to_stata #6335

bashtage commented Feb 12, 2014

jreback commented Feb 12, 2014

bashtage commented Feb 12, 2014

jreback commented Feb 12, 2014

jreback commented Feb 16, 2014

bashtage commented Feb 24, 2014

bashtage commented Feb 24, 2014

jreback commented Feb 24, 2014

bashtage commented Feb 25, 2014

jreback commented Feb 25, 2014

bashtage commented Feb 25, 2014

jreback commented Feb 25, 2014

jreback commented Feb 25, 2014

bashtage commented Feb 25, 2014

jreback commented Feb 25, 2014

jseabold commented Feb 25, 2014

jseabold commented Feb 25, 2014

jreback commented Mar 4, 2014

bashtage commented Mar 4, 2014

bashtage commented Mar 4, 2014

jreback commented Mar 4, 2014

jreback commented Mar 4, 2014

jreback commented Mar 4, 2014

bashtage commented Mar 5, 2014

jreback commented Mar 5, 2014

jseabold commented Mar 5, 2014

jreback commented Mar 5, 2014

jseabold commented Mar 5, 2014

jreback commented Mar 5, 2014

jseabold commented Mar 5, 2014

jreback commented Mar 5, 2014

BUG: Fixed incorrect type in integer conversion in to_stata #6335

BUG: Fixed incorrect type in integer conversion in to_stata #6335

Conversation

bashtage commented Feb 12, 2014

jreback commented Feb 12, 2014

bashtage commented Feb 12, 2014

jreback commented Feb 12, 2014

jreback commented Feb 16, 2014

bashtage commented Feb 24, 2014

bashtage commented Feb 24, 2014

jreback commented Feb 24, 2014

bashtage commented Feb 25, 2014

jreback commented Feb 25, 2014

bashtage commented Feb 25, 2014

jreback commented Feb 25, 2014

jreback commented Feb 25, 2014

bashtage commented Feb 25, 2014

jreback commented Feb 25, 2014

jseabold commented Feb 25, 2014

jseabold commented Feb 25, 2014

jreback commented Mar 4, 2014

bashtage commented Mar 4, 2014

bashtage commented Mar 4, 2014

jreback commented Mar 4, 2014

jreback commented Mar 4, 2014

jreback commented Mar 4, 2014

bashtage commented Mar 5, 2014

jreback commented Mar 5, 2014

jseabold commented Mar 5, 2014

jreback commented Mar 5, 2014

jseabold commented Mar 5, 2014

jreback commented Mar 5, 2014

jseabold commented Mar 5, 2014

jreback commented Mar 5, 2014