DataFrame.to_records dtype shouldn't use unicode for every column #16358

TomAugspurger · 2017-05-15T13:22:59Z

Code Sample, a copy-pastable example if possible

As of 0.20, DataFrame.to_records will use the unicode type for the all dtype identifiers on python 2.

In [36]: pd.DataFrame({u'c/\u03c3': [1, 2], 'c/s': [3, 4]}).to_records()
Out[36]:
rec.array([(0, 3, 1), (1, 4, 2)],
          dtype=[(u'index', '<i8'), (u'c/s', '<i8'), (u'c/\u03c3', '<i8')])

This caused some issues for statsmodels, since they go to_records().dtype -> np.dtype, which doesn't like unicode identifiers on python2 (statsmodels/statsmodels#3658 (comment))

I think the correct behavior is to just use whatever the user has. So the output from above should be

In [36]: pd.DataFrame({u'c/\u03c3': [1, 2], 'c/s': [3, 4]}).to_records()
Out[36]:
rec.array([(0, 3, 1), (1, 4, 2)],
          dtype=[('index', '<i8'), ('c/s', '<i8'), (u'c/\u03c3', '<i8')])

so the python2 str column (which is actually bytes) should just be 'c/s', not u'c/s'.

This thing pandas has to decide is how to handle

the default 'index' when df.index.name is None
non-string columns like numbers

I think the least-surprising there is to use str(), so on py2 that will be bytes, and on py3 it will be unicode. Not sure if it will cause problems elsewhere though.

xref #13462 and #11879

cc @AlexisMignon

The text was updated successfully, but these errors were encountered:

jreback · 2017-05-16T01:39:39Z

originally we were using str in both py2 & py3. we switched to text_type which is unicode in py2 and str in py3.

where exactly did this come up?

TomAugspurger · 2017-05-16T12:01:40Z

where exactly did this come up?

statsmodels was essentially doing

# python 2
In [10]: np.dtype(pd.DataFrame(columns=['a', 'b']).to_records().dtype.descr)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-bf0f86491f9a> in <module>()
----> 1 np.dtype(pd.DataFrame(columns=['a', 'b']).to_records().dtype.descr)

TypeError: data type not understood

here

That's not 100% correct, since if a user actually does have unicode columns, pandas probably should use unicode, and statsmodels will have to work around it

In [11]: np.dtype(pd.DataFrame(columns=[u'a', u'b']).to_records().dtype.descr)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-e90f8e27ae39> in <module>()
----> 1 np.dtype(pd.DataFrame(columns=[u'a', u'b']).to_records().dtype.descr)

TypeError: data type not understood

It seems like this would be best solved by NumPy accepting unicode identifiers on py2 (I still think pandas should export whatever the user has though).

AlexisMignon · 2017-05-16T13:43:19Z

From memory I proposed changes in pandas becaise using str in py2 was raising an error when column names were unicode strings with unicode caracter. Doing this I fell on a bug in numpy which has 2 ways to build a np.dtype:

One is using a list of tuples (name, format), this is the most documented way shown in all examples
the other uses a dictionnary {"name": , "formats": }

The first one does not accept unicode names whereas the second accept them.

This is what is done here

A way to keep compatibility with all projects using pandas might be to have a _to_string fonctions that always returns a str in py2 and does all the needed conversions between str, unicode and others.

TomAugspurger · 2017-05-16T13:55:05Z

A way to keep compatibility with all projects using pandas might be to have a _to_string fonctions that always returns a str in py2 and does all the needed conversions between str, unicode and others.

Yeah, I'm trying to work around this in statsmodels and it's pretty tricky to get right that far down.

Is it fair to say that for DataFrame.to_records().dtype

In python2 the names should always be python 2 strs (bytes)
In python3 the names should always be python 3 strs (unicode)

In that case, I think the solution is to encode every column name as bytes on python2 (we can just choose and document that we're using utf-8).

Described elswhere in: pandas-dev/pandas#16358 Recent versions of numpy/cython solve the issue.

WillAyd · 2019-09-06T21:43:57Z

Looks like a Py2 thing - closing

TomAugspurger added the 2/3 Compat label May 15, 2017

jreback added the Unicode Unicode strings label May 16, 2017

TomAugspurger mentioned this issue Jul 28, 2017

"data type not understood" with numpy 1.12.1 statsmodels/statsmodels#3841

Closed

kif added a commit to kif/pyFAI that referenced this issue Jan 21, 2019

This bug prevent the use of pyFAI with numpy 1.12 or older.

1a4873e

Described elswhere in: pandas-dev/pandas#16358 Recent versions of numpy/cython solve the issue.

WillAyd closed this as completed Sep 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.to_records dtype shouldn't use unicode for every column #16358

DataFrame.to_records dtype shouldn't use unicode for every column #16358

TomAugspurger commented May 15, 2017 •

edited

Loading

jreback commented May 16, 2017

TomAugspurger commented May 16, 2017

AlexisMignon commented May 16, 2017

TomAugspurger commented May 16, 2017

WillAyd commented Sep 6, 2019

DataFrame.to_records dtype shouldn't use unicode for every column #16358

DataFrame.to_records dtype shouldn't use unicode for every column #16358

Comments

TomAugspurger commented May 15, 2017 • edited Loading

Code Sample, a copy-pastable example if possible

jreback commented May 16, 2017

TomAugspurger commented May 16, 2017

AlexisMignon commented May 16, 2017

TomAugspurger commented May 16, 2017

WillAyd commented Sep 6, 2019

TomAugspurger commented May 15, 2017 •

edited

Loading