ERR: validate encoding on to_stata #15723

ozak · 2017-03-17T17:48:10Z

It seems pandas in python3.5 causes issues due to encoding. For example the following generates a corrupt output file

import pandas as pd
df1 = pd.DataFrame(np.array([1,2,3,4]), columns=['var1'])
df1.to_stata('corrupt.dta', write_index=False, encoding='utf8')

while

df1.to_stata('not-corrupt.dta', write_index=False)

generates a correct file. I imagine this may be due to use of encoding and the difference in the treatment between python 2 and python 3, which breaks compatibility of scripts across python versions. I guess it would be nice if it does not take this option into account on python 3, unless the error is caused by something else.

The text was updated successfully, but these errors were encountered:

jreback · 2017-03-17T18:00:48Z

In [8]: df1 = pd.DataFrame(np.array([1,2,3,4]), columns=['var1'])
   ...: df1.to_stata('corrupt.dta', write_index=False, encoding='latin1')
   ...: 
   ...: 
   ...: 

In [9]: pd.read_stata('corrupt.dta')
Out[9]: 
   var1
0     1
1     2
2     3
3     4

doc-string

Signature: df1.to_stata(fname, convert_dates=None, write_index=True, encoding='latin-1', byteorder=None, time_stamp=None, data_label=None, variable_labels=None)
Docstring:
A class for writing Stata binary dta files from array-like objects

Parameters
----------
fname : str or buffer
    String path of file-like object
convert_dates : dict
    Dictionary mapping columns containing datetime types to stata
    internal format to use when wirting the dates. Options are 'tc',
    'td', 'tm', 'tw', 'th', 'tq', 'ty'. Column can be either an integer
    or a name. Datetime columns that do not have a conversion type
    specified will be converted to 'tc'. Raises NotImplementedError if
    a datetime column has timezone information
write_index : bool
    Write the index to Stata dataset.
encoding : str
    Default is latin-1. Unicode is not supported
byteorder : str
    Can be ">", "<", "little", or "big". default is `sys.byteorder`
time_stamp : datetime
    A datetime to use as file creation date.  Default is the current
    time.
dataset_label : str
    A label for the data set.  Must be 80 characters or smaller.
variable_labels : dict
    Dictionary containing columns as keys and variable labels as
    values. Each label must be 80 characters or smaller.

I would say this is technically correct, passing a unicode encoding is invalid. But I think we should simply reject these, rather than actually write an invalid format. want to do a PR to do this? (now I am not sure which encoding stata can actually support, any idea?)

ozak · 2017-03-17T18:09:03Z

I am a it confused now. From the docs it should fail in both python 2 and 3, but in python 2

df1.to_stata('corrupt.dta', write_index=False, encoding='utf8')

generates the correct file. So, I would think it may be better to just ignore the option in python 3, but keep it in python 2.

I think stata 14 now supports UTF8 as a default, but it actually may be more general, not 100% sure (see here). I'll try to find some time to write a PR. I'll see what the code actually does. I'll let you know.

jreback · 2017-03-17T18:25:05Z

cc @bashtage any ideas here?

bashtage · 2017-03-17T20:38:57Z

There is no UTF8 in Stata. Only ASCII And the simple 8 bit encoding Latin-1.

bashtage · 2017-03-17T20:40:30Z

In Stata I mean in to_stata. Adding UTF8 is a major effort since the current format is all fixed width and I don't see much of a case for making the effort.

jreback · 2017-03-17T20:49:21Z

@bashtage so we should then validate encoding='ascii'|'latin1'|None as the only allowed encodings at all.

ozak · 2017-03-17T20:51:38Z

So why doesn't it generate a corrupt file in python 2, but does in python 3? I have the same pandas version in both, so it may not be pandas specific?

jreback · 2017-03-17T20:54:51Z

because its actually encoding it in PY3 with the passed in encoding (utf8), rather than the default of latin1.

ozak · 2017-03-17T21:05:06Z

So in PY2 it is not encoded as UTF8 even when giving the option? How is it saved then? What in pandas is affecting the IO to Stata that corrupts the file? Given that Stata14 uses UTF8 as the default it should not have an issue opening UTF8 encoded files.

ozak · 2017-03-17T21:07:26Z

Just noticed one more thing

import pandas as pd
df1 = pd.DataFrame(np.array([u'á',u'Ö']), columns=['var1'])
df1.to_stata('not-corrupt.dta', write_index=False, encoding='utf8')

df = pd.read_stata('corrupt3.dta')
df == df1

generates a usable file in both PY2 and PY3. Still, the data is wrong as seen in the example.

jreback · 2017-03-17T21:07:45Z

So in PY2 it is not encoded as UTF8 even when giving the option?

not really sure it is actually encoding as utf8 internally. lots of things in py2 are wonky. it probably happens to work.

jreback · 2017-03-17T21:08:27Z

in any event. seems you have a bunch of tests cases! it seems easy enough to simply validate the encoding that is passed and raise if its not valid.

ozak · 2017-03-17T21:09:09Z

Indeed.

I guess my example shows that PY2 is not encoding at all.

Ensure StataReader and StataWriter have the correct encoding. Standardized default encoding to 'latin-1' closes pandas-dev#15723

Ensure StataReader and StataWriter have the correct encoding. Standardized default encoding to 'latin-1' closes pandas-dev#15723 Author: Kevin Sheppard <[email protected]> Closes pandas-dev#15768 from bashtage/limit-stata-encoding and squashes the following commits: 8278be7 [Kevin Sheppard] BUG: Fix limited key range on 32-bit platofrms 2f02697 [Kevin Sheppard] BUG: Enforce correct encoding in stata

TiesdeKok · 2017-09-05T12:14:01Z

I would like to report that with the current encoding options it is not possible for Chinese users to interact with Pandas due to their Stata files containing Chinese characters. This has been reported by the user @lianyonghui123 on my ipystata package: TiesdeKok/ipystata#28

bashtage · 2017-09-05T12:34:28Z

There is no Unicode support in the pandas stata functions. They only support the fixed encoding latin1. It would be a relatively large rewrite to add Unicode support.

…

On Tue, Sep 5, 2017, 13:14 Ties de Kok ***@***.***> wrote: I would like to report that with the current encoding options it is not possible for Chinese users to interact with Pandas due to their Stata files containing Chinese characters. This has been reported by the user @lianyonghui123 <https://github.com/lianyonghui123> on my ipystata package: TiesdeKok/ipystata#28 <TiesdeKok/ipystata#28> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15723 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFU5RSCFIyLpKXRO_sa8CifMaN_QpvHCks5sfTsggaJpZM4Mg5rV> .

jreback added Difficulty Intermediate Error Reporting Incorrect or improved errors from pandas IO Stata read_stata, to_stata Unicode Unicode strings labels Mar 17, 2017

jreback added this to the Next Major Release milestone Mar 17, 2017

jreback changed the title ~~Pandas generates corrupt Stata files in python 3.5 on OSX~~ ERR: validate encoding on to_stata Mar 17, 2017

bashtage added a commit to bashtage/pandas that referenced this issue Mar 21, 2017

BIG: Enforce correc encoding in stata

f549481

Ensure StataReader and StataWriter have the correct encoding. Standardized default encoding to 'latin-1' closes pandas-dev#15723

bashtage mentioned this issue Mar 21, 2017

BUG: Enforce correct encoding in stata #15768

Closed

4 tasks

bashtage added a commit to bashtage/pandas that referenced this issue Mar 21, 2017

BUG: Enforce correct encoding in stata

2f02697

Ensure StataReader and StataWriter have the correct encoding. Standardized default encoding to 'latin-1' closes pandas-dev#15723

jreback modified the milestones: 0.20.0, Next Major Release Mar 21, 2017

jreback closed this as completed in 1c9d46a Mar 21, 2017

TiesdeKok mentioned this issue Sep 5, 2017

Please make ipystata+jupyter display Chinese characters properly~~ TiesdeKok/ipystata#28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERR: validate encoding on to_stata #15723

ERR: validate encoding on to_stata #15723

ozak commented Mar 17, 2017 •

edited

Loading

jreback commented Mar 17, 2017

ozak commented Mar 17, 2017

jreback commented Mar 17, 2017

bashtage commented Mar 17, 2017

bashtage commented Mar 17, 2017

jreback commented Mar 17, 2017

ozak commented Mar 17, 2017

jreback commented Mar 17, 2017

ozak commented Mar 17, 2017

ozak commented Mar 17, 2017

jreback commented Mar 17, 2017

jreback commented Mar 17, 2017

ozak commented Mar 17, 2017

TiesdeKok commented Sep 5, 2017

bashtage commented Sep 5, 2017 via email

ERR: validate encoding on to_stata #15723

ERR: validate encoding on to_stata #15723

Comments

ozak commented Mar 17, 2017 • edited Loading

jreback commented Mar 17, 2017

ozak commented Mar 17, 2017

jreback commented Mar 17, 2017

bashtage commented Mar 17, 2017

bashtage commented Mar 17, 2017

jreback commented Mar 17, 2017

ozak commented Mar 17, 2017

jreback commented Mar 17, 2017

ozak commented Mar 17, 2017

ozak commented Mar 17, 2017

jreback commented Mar 17, 2017

jreback commented Mar 17, 2017

ozak commented Mar 17, 2017

TiesdeKok commented Sep 5, 2017

bashtage commented Sep 5, 2017 via email

ozak commented Mar 17, 2017 •

edited

Loading