Skip to content

BUG: is StataReader supposed to assign the index? #3641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue May 18, 2013 · 12 comments
Closed

BUG: is StataReader supposed to assign the index? #3641

jreback opened this issue May 18, 2013 · 12 comments
Labels
API Design Enhancement IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented May 18, 2013

This is the example in io.rst for Stata (in current master)
coming from PR #3270 and issue #1512

In [5]: from pandas.io.stata import StataWriter

In [6]: df = DataFrame(randn(10,2),columns=list('AB'))

In [7]: writer = StataWriter('stata.dta',df)

In [8]: writer.write_file()

I am not sure if you have enough information saved to know that this needs a
df.set_index('index')
?

In [10]: from pandas.io.stata import StataReader

In [11]: reader = StataReader('stata.dta')

In [12]: reader.data()
Out[12]: 
   index         A         B
0      0  0.818436 -0.616332
1      1 -0.673509  2.209445
2      2 -0.074915  0.444409
3      3  0.984456 -1.397691
4      4 -0.402488  0.884691
5      5  0.407234  0.499808
6      6 -0.041578  0.724288
7      7 -0.110134  0.707406
8      8  0.986992  1.281154
9      9 -1.491163 -0.686034
@jreback
Copy link
Contributor Author

jreback commented May 18, 2013

@jseabold this is more your baliwick

@jseabold
Copy link
Contributor

Doesn't to_csv and read_csv etc. roundtrip like this by default too? I don't much like it. I always have to do to_csv(..., index=False).

@jreback
Copy link
Contributor Author

jreback commented May 18, 2013

I guess maybe an option is needed as well (like index_col) or something
unless the dta files have the capability of storing 'extra' info (e.g. which column is the index)?

In [1]: df = DataFrame(randn(10,2),columns=list('AB'))

In [2]: df.to_csv('test.csv')

In [3]: !cat 'test.csv'
,A,B
0,1.8463757795202977,0.02226833755852145
1,-0.6325377289799654,-0.5583895530222927
2,-0.6994280301891974,0.07043313224696199
3,2.274301594583385,-0.48431060255714375
4,0.7158408190886268,0.9393355471247178
5,-2.4732861513401976,1.2004168943281102
6,-0.9549642212403157,0.3892774288330209
7,-0.11883101280710488,-0.3369373441794749
8,-0.1821050662028544,-2.183465690070748
9,2.5845633802048282,0.21888726534636566

In [4]: pd.read_csv('test.csv',index_col=0)
Out[4]: 
          A         B
0  1.846376  0.022268
1 -0.632538 -0.558390
2 -0.699428  0.070433
3  2.274302 -0.484311
4  0.715841  0.939336
5 -2.473286  1.200417
6 -0.954964  0.389277
7 -0.118831 -0.336937
8 -0.182105 -2.183466
9  2.584563  0.218887

@jreback
Copy link
Contributor Author

jreback commented May 19, 2013

alright, will move this to 0.12, in case want to add the enhancement (index_col and/or index=False)

@hmgaudecker
Copy link

IIUC, Stata data files should have a 'sorted by' field somewhere, but I don't think it is used anywhere at present inside the StataReader/StataWriter machinery. Since sorting seems to be used infrequently in the Stata community and does not map directly into the index concept of Pandas, I would suggest not to fiddle with that attribute.

Adding both of the suggested enhancements at the Python level would be a very good thing, though. Columns are always named in Stata, so the array of cases to think through would be much smaller than in read_csv.

@jreback
Copy link
Contributor Author

jreback commented May 22, 2013

@hmgaudecker this doesn't have anything to do with Stata per se, more of an interface to pandas. In that the index comes out as a column; I am not sure if there is a way to record that the index should be set after recreation in pandas. (it is easy enough to do set_index('index') but round-tripping is impossible currently

@hmgaudecker
Copy link

That was my point, maybe a bit too short: If Stata had the same concept of an index as Pandas, the natural way would be to use it in both directions (I thought that was what you meant by whether there was a way to record this). But it doesn't. The closest thing it has would be the 'sorted by' thing, but for datasets saved by the average Stata user, one wouldn't want to infer (by default) that this is the index.

@jreback
Copy link
Contributor Author

jreback commented May 22, 2013

@hmgaudecker fair enough. To be consistent I actually think that the param index_col should be added; up to the user to specify whether the file was saved with an index (though I don't think there is an equivalent as_index=False option on the writing....), so user beware for now. Maybe users of this can see what they need/want.

@hmgaudecker
Copy link

I think the concept "saved with an index" just doesn't apply to Stata files. For the round trip, sure - but that is probably more of a use case for tests (important enough, but maybe not for the defaults) than anything in real life.

I see two typical cases.

  1. The Stata file does not have anything like an index, or I couldn't care less about it. Say I just want to run OLS regressions and the order does not matter. Reading the file into Pandas, the index should be automatically set to [0, 1, 2, ...]. Writing it to Stata again, I really do not want to have Pandas' index saved along with my data. So I'd like to pass an index=False attribute to the writer, like for to_csv.
  2. The Stata dataset has some variables that clearly are best seen as the index in Pandas (person identifiers, dates, whatever). Ideally, I would like to tell the reader about this immediately rather than setting it ex post, which gives the use case for the index_col attribute to the reader. However, these columns I definitely do not want to loose when writing to Stata again.

So I think the current "defaults" (i.e. writer with index=True and reader with index_col=None) are right, but it would be very useful to add the two options.

Leaves the question of what to do upon writing if the Index (or its constituents in case of a MultiIndex) does not have a name (required for columns in Stata). I would vote for throwing an error, better be explicit than implicit.

@jreback
Copy link
Contributor Author

jreback commented May 22, 2013

you options look right. If there is no name, I believe its just set as 'index'now.

@hmgaudecker
Copy link

The above looks like it. But that behaviour could be confusing to the casual Pandas and regular Stata user. And probably throw a strange error if a column named index happens to exist...

@jreback
Copy link
Contributor Author

jreback commented Sep 22, 2013

closed as needs to be user defined as stata is not a complete serialization format

@jreback jreback closed this as completed Sep 22, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

3 participants