Skip to content

Bug: Export to Stata NaN not converted to "." #6684

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ozak opened this issue Mar 21, 2014 · 9 comments · Fixed by #6685
Closed

Bug: Export to Stata NaN not converted to "." #6684

ozak opened this issue Mar 21, 2014 · 9 comments · Fixed by #6685
Labels
Bug IO Data IO issues that don't fit into a more specific label Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@ozak
Copy link

ozak commented Mar 21, 2014

Hi,

I noticed that when exporting data to stata the NaN values are not always converted to Stata missing values but instead left blank. This somehow confuses Stata which does not allow using the destring command to solve the problem nor using replace value=. if value==..

As an Example I downloaded the World Development Indicators and used the following commands to export National Savings to an the excel and csv file:

import pandas as pd
import os
dfwdi=pd.read_excel('WDI.xlsx','Data')
dfwdi.columns
dfout=dfwdi.ix[dfwdi['Indicator Code']=='NY.GDS.TOTL.ZS']
dfout
cols=['savyr'+str(i) for i in xrange(1960,dfwdi.columns.values[-1]+1)]
dfout.reset_index(inplace=True, drop=True)
dfout.to_csv('sav.csv', index=False)
dfout.to_stata('sav.dta', write_index=False)

If you import the data into Stata (I am using v.13) and run the following commands, things fail.

use "sav.dta", clear

* Correct number of missing values
summ savyr2000
reg savyr2010 savyr 2000

* Correct countries identified as missing
tab code if savyr==.

* replace missing values to "."
* One cannot replace the missing not presented as "."
replace savyr2010==. if savyr==""
* Use "." to identify
replace savyr2010==. if savyr==.

* Perform analysis again
summ savyr2000
reg savyr2010 savyr 2000

* Still fails

As you can see Stata does not perform the analysis, even though it correctly recognizes the missing values. But not all of them are presented as ".". If one imports the the csv version into Stata and runs the same initial commands it works fine.

import delimited "sav.csv"

* Correct number of missing values
summ savyr2000
reg savyr2010 savyr 2000

Furthermore, for some reason the index is still present in the stata file, even though I had used the write_index=False option.

I am using Enthought's Canopy distribution on OSX Mavericks with Pandas '0.13.1'. Haven't tried on other Python dists.

@jreback
Copy link
Contributor

jreback commented Mar 21, 2014

cc @bashtage

we have tests for this?

@ozak can you try on master, have been many fixes in regards to stata reading/writing

@ozak
Copy link
Author

ozak commented Mar 21, 2014

@jreback can you explain a little more what you'd like me to do...I am still a newbie in the GitHub problem solving scheme.

@bashtage
Copy link
Contributor

This will need to be compares to master in pandas (the pre-release of 0.14). I did a lot of work around missing values, and there were (iirc) some issues regarding some dara types (e.g. doubles).

@jreback
Copy link
Contributor

jreback commented Mar 21, 2014

@ozak I was suggesting building with master from the main repo

http://pandas.pydata.org/developers.html#working-with-the-code

then you could help explore where the error is

@bashtage
Copy link
Contributor

@ozak @jreback There is definitely something wrong in master. The nans are appearing as 1.#QNAN. After a bit of looking, it seems that Stata does not support NaNs, and expects a missing value rather than a NaN. This should be simple, at least ignoring performance considerations,

@jreback jreback added this to the 0.14.0 milestone Mar 21, 2014
@jreback
Copy link
Contributor

jreback commented Mar 21, 2014

ok will mark as a bug

@bashtage
Copy link
Contributor

Writing last test for patch now.

@ozak
Copy link
Author

ozak commented Mar 22, 2014

Wow that was fast! I guess this means this is solved?

@bashtage
Copy link
Contributor

@Azak Once the referenced patch gets pulled into master, then the master, and later 0.14, will not have this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants