-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Support writing unicode characters in df.to_stata() #23573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think I can add to this issue with the following ... The current scheme seems to be that pd.to_stata knows it does not support writing unicode, so if it finds unicode it'll helpfully throw an error. However, I believe I've found some characters that Code Sample, a copy-pastable example if possibleimport pandas as pd
# Make demonstration data. This data contains characters that should
# cause Pandas to throw an error when using df.to_stata().
bad_txt_sneaking_through = ''' Multiline text that sneaks by
Here is one __�__
Another one __·__ Another one __½__
Bad bad bad __Á__ Bad bad bad __¦__
Still more __é__ Still more __§__ Still more __®__ '''
data_list = []
data_list.append(['First Record', bad_txt_sneaking_through])
data_list.append(['Second Record', 'This one will be fine'])
# Make DataFrame from demonstration data.
df = pd.DataFrame(data_list, columns=['RecNum', 'Txt'])
# Write data frame to Stata data file. Shouldn't write but does.
# This file will not open in Stata.
df.to_stata('Demo_ShouldNotWork.dta', version=117, convert_strl=['Txt'])
# Write first record which has the offending characters.
# This file will not open in Stata.
df[0:1].to_stata('Demo_DoesNotWork.dta', version=117, convert_strl=['Txt'])
# Write second record which has no offending characters.
# This file will open in Stata.
df[1:2].to_stata('Demo_DoesWork.dta', version=117, convert_strl=['Txt'])
# Define function that tests diagnosis (bad chacter count)
def make_it_work(bad_text):
ret_txt = ''
for item in bad_text:
ret_txt += item if len(item.encode(encoding='utf_8')) == 1 else ''
return(ret_txt)
df['Txt'] = df['Txt'].apply(make_it_work)
# Write data frame to Stata data file. This time it should write and does.
# This file will open in Stata.
df.to_stata('Demo_ShouldWork.dta', version=117, convert_strl=['Txt']) Problem descriptionPandas When writing Stata data files Pandas usually (and helpfully) throws an error if there are non Latin-1 characters in an StrL data field. However, when I was working with a large dataset I scraped from the web I managed to write a data file without getting an error from Pandas. All was going well. But, Stata was unable to read the file. With some assistance from Stata technical support I believe the correct diagnosis was an issue with under counting the total number of characters in the StrL. Stata technical support indicated to me that Unicode characters will throw off the count. At first I thought this shouldn't be a problem because I thought In troubleshooting and documenting the issue I believe the function Expected OutputThus I would vote in favor of future developments finding a method to throw an error for such characters that seem for now to be sneaking through. Here is code and output that produces what would be more helpful. In the alternative, an enhancement as @kylebarron suggested that would accommodate Unicode would also be an appropriate solution. bad_txt_not_sneaking_through = '''Bad text that does not sneak through...
Here you go ► '''
data_list = []
data_list.append(['First Record', bad_txt_not_sneaking_through])
data_list.append(['Second Record', 'This one will be fine'])
df = pd.DataFrame(data_list, columns=['RecNum', 'Txt'])
df.to_stata('Demo_ShouldNotWork.dta', version=117, convert_strl=['Txt']) Output (abridged):
Output of
|
All of those characters are included in the Latin-1 encoding, and seem to work fine in Stata. If you want to test it yourself, run this: set obs 10
gen x = ""
replace x = "Here is one __�__" in 1
replace x = "Another one __·__" in 2
replace x = "Another one __½__" in 3
replace x = "Bad bad bad __Á__" in 4
replace x = "Bad bad bad __¦__" in 5
saveold test.dta, version(13)
use test.dta, clear So those characters should work with version 117, and if they don't it's a bug. |
@adamrossnelson You are actually experiencing a different bug. The files that you can't don't work with Stata don't work because they are only partially written (at least in master). A patch is needed to cleanup writes that fail (ok, but can be difficult to get right, but easy to implement), or to possible reorder the steps so that all of the data checks happen before the file is created (better solution, but may need a lot of work). |
@bashtage ... would you suggest starting a new issue? I'd be happy to do that to help track. |
Not worth it. I have the PR ready. |
@bashtage I might be mistaken, but I felt that the bug was that those characters should be able to be written with the current Pandas writer. The |
� copies and pastes as U+FFFD which is not supported in Latin-1. |
You can run this to sanitize it:
It seems that this file is not loadable which suggests there is a bug in multiline strl encoding. I suppose the first issue is to determine whether strl supports multilines in 117. |
I don't believe the original issue of this thread (unicode Stata file writing support) is actually resolved. @adamrossnelson posted an example which showed both 1) the need for unicode write support and 2) a bug with the current code. #24337 fixed the bug with the current code; unicode write support is a larger endeavor and is still uncompleted. |
Yeah, it should be reopened. It closed an issue in this issue, rather than the issue. |
@bashtage is this actionable? |
Yes. Someone could write a format 118 or 119 writer that supports unicode. |
The spec is available. Non-trivial since need utf8 encode but we use numpy arrays internally which have utf32 blobs |
Code Sample, a copy-pastable example if possible
I picked an arbitrary CJK character to test this with.
Problem description
It would be possible to write Unicode strings to a Stata file by implementing a writer according to version 118 of the
dta
format.I'd be interested in trying to submit a PR for this.(Edit: I don't use Stata anymore)Expected Output
Stata file written to disk.
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: