BUG: is StataReader supposed to assign the index? #3641

jreback · 2013-05-18T00:51:48Z

This is the example in io.rst for Stata (in current master)
coming from PR #3270 and issue #1512

In [5]: from pandas.io.stata import StataWriter

In [6]: df = DataFrame(randn(10,2),columns=list('AB'))

In [7]: writer = StataWriter('stata.dta',df)

In [8]: writer.write_file()

I am not sure if you have enough information saved to know that this needs a
df.set_index('index')
?

In [10]: from pandas.io.stata import StataReader

In [11]: reader = StataReader('stata.dta')

In [12]: reader.data()
Out[12]: 
   index         A         B
0      0  0.818436 -0.616332
1      1 -0.673509  2.209445
2      2 -0.074915  0.444409
3      3  0.984456 -1.397691
4      4 -0.402488  0.884691
5      5  0.407234  0.499808
6      6 -0.041578  0.724288
7      7 -0.110134  0.707406
8      8  0.986992  1.281154
9      9 -1.491163 -0.686034

The text was updated successfully, but these errors were encountered:

jreback · 2013-05-18T00:52:22Z

@jseabold this is more your baliwick

jseabold · 2013-05-18T01:05:58Z

Doesn't to_csv and read_csv etc. roundtrip like this by default too? I don't much like it. I always have to do to_csv(..., index=False).

jreback · 2013-05-18T01:16:37Z

I guess maybe an option is needed as well (like index_col) or something
unless the dta files have the capability of storing 'extra' info (e.g. which column is the index)?

In [1]: df = DataFrame(randn(10,2),columns=list('AB'))

In [2]: df.to_csv('test.csv')

In [3]: !cat 'test.csv'
,A,B
0,1.8463757795202977,0.02226833755852145
1,-0.6325377289799654,-0.5583895530222927
2,-0.6994280301891974,0.07043313224696199
3,2.274301594583385,-0.48431060255714375
4,0.7158408190886268,0.9393355471247178
5,-2.4732861513401976,1.2004168943281102
6,-0.9549642212403157,0.3892774288330209
7,-0.11883101280710488,-0.3369373441794749
8,-0.1821050662028544,-2.183465690070748
9,2.5845633802048282,0.21888726534636566

In [4]: pd.read_csv('test.csv',index_col=0)
Out[4]: 
          A         B
0  1.846376  0.022268
1 -0.632538 -0.558390
2 -0.699428  0.070433
3  2.274302 -0.484311
4  0.715841  0.939336
5 -2.473286  1.200417
6 -0.954964  0.389277
7 -0.118831 -0.336937
8 -0.182105 -2.183466
9  2.584563  0.218887

jreback · 2013-05-19T16:18:26Z

alright, will move this to 0.12, in case want to add the enhancement (index_col and/or index=False)

hmgaudecker · 2013-05-22T14:16:15Z

IIUC, Stata data files should have a 'sorted by' field somewhere, but I don't think it is used anywhere at present inside the StataReader/StataWriter machinery. Since sorting seems to be used infrequently in the Stata community and does not map directly into the index concept of Pandas, I would suggest not to fiddle with that attribute.

Adding both of the suggested enhancements at the Python level would be a very good thing, though. Columns are always named in Stata, so the array of cases to think through would be much smaller than in read_csv.

jreback · 2013-05-22T14:27:25Z

@hmgaudecker this doesn't have anything to do with Stata per se, more of an interface to pandas. In that the index comes out as a column; I am not sure if there is a way to record that the index should be set after recreation in pandas. (it is easy enough to do set_index('index') but round-tripping is impossible currently

hmgaudecker · 2013-05-22T14:34:14Z

That was my point, maybe a bit too short: If Stata had the same concept of an index as Pandas, the natural way would be to use it in both directions (I thought that was what you meant by whether there was a way to record this). But it doesn't. The closest thing it has would be the 'sorted by' thing, but for datasets saved by the average Stata user, one wouldn't want to infer (by default) that this is the index.

jreback · 2013-05-22T14:51:06Z

@hmgaudecker fair enough. To be consistent I actually think that the param index_col should be added; up to the user to specify whether the file was saved with an index (though I don't think there is an equivalent as_index=False option on the writing....), so user beware for now. Maybe users of this can see what they need/want.

hmgaudecker · 2013-05-22T15:34:41Z

I think the concept "saved with an index" just doesn't apply to Stata files. For the round trip, sure - but that is probably more of a use case for tests (important enough, but maybe not for the defaults) than anything in real life.

I see two typical cases.

The Stata file does not have anything like an index, or I couldn't care less about it. Say I just want to run OLS regressions and the order does not matter. Reading the file into Pandas, the index should be automatically set to [0, 1, 2, ...]. Writing it to Stata again, I really do not want to have Pandas' index saved along with my data. So I'd like to pass an index=False attribute to the writer, like for to_csv.
The Stata dataset has some variables that clearly are best seen as the index in Pandas (person identifiers, dates, whatever). Ideally, I would like to tell the reader about this immediately rather than setting it ex post, which gives the use case for the index_col attribute to the reader. However, these columns I definitely do not want to loose when writing to Stata again.

So I think the current "defaults" (i.e. writer with index=True and reader with index_col=None) are right, but it would be very useful to add the two options.

Leaves the question of what to do upon writing if the Index (or its constituents in case of a MultiIndex) does not have a name (required for columns in Stata). I would vote for throwing an error, better be explicit than implicit.

jreback · 2013-05-22T15:45:31Z

you options look right. If there is no name, I believe its just set as 'index'now.

hmgaudecker · 2013-05-22T15:50:26Z

The above looks like it. But that behaviour could be confusing to the casual Pandas and regular Stata user. And probably throw a strange error if a column named index happens to exist...

jreback · 2013-09-22T01:06:04Z

closed as needs to be user defined as stata is not a complete serialization format

jreback mentioned this issue May 22, 2013

ENH: Fix for #1512, added StataReader and StataWriter to pandas.io.parsers #3270

Merged

jreback closed this as completed Sep 22, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: is StataReader supposed to assign the index? #3641

BUG: is StataReader supposed to assign the index? #3641

jreback commented May 18, 2013

jreback commented May 18, 2013

jseabold commented May 18, 2013

jreback commented May 18, 2013

jreback commented May 19, 2013

hmgaudecker commented May 22, 2013

jreback commented May 22, 2013

hmgaudecker commented May 22, 2013

jreback commented May 22, 2013

hmgaudecker commented May 22, 2013

jreback commented May 22, 2013

hmgaudecker commented May 22, 2013

jreback commented Sep 22, 2013

BUG: is StataReader supposed to assign the index? #3641

BUG: is StataReader supposed to assign the index? #3641

Comments

jreback commented May 18, 2013

jreback commented May 18, 2013

jseabold commented May 18, 2013

jreback commented May 18, 2013

jreback commented May 19, 2013

hmgaudecker commented May 22, 2013

jreback commented May 22, 2013

hmgaudecker commented May 22, 2013

jreback commented May 22, 2013

hmgaudecker commented May 22, 2013

jreback commented May 22, 2013

hmgaudecker commented May 22, 2013

jreback commented Sep 22, 2013