Skip to content

DataFrame.to_records dtype shouldn't use unicode for every column #16358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue May 15, 2017 · 5 comments
Closed
Labels
Unicode Unicode strings

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 15, 2017

Code Sample, a copy-pastable example if possible

As of 0.20, DataFrame.to_records will use the unicode type for the all dtype identifiers on python 2.

In [36]: pd.DataFrame({u'c/\u03c3': [1, 2], 'c/s': [3, 4]}).to_records()
Out[36]:
rec.array([(0, 3, 1), (1, 4, 2)],
          dtype=[(u'index', '<i8'), (u'c/s', '<i8'), (u'c/\u03c3', '<i8')])

This caused some issues for statsmodels, since they go to_records().dtype -> np.dtype, which doesn't like unicode identifiers on python2 (statsmodels/statsmodels#3658 (comment))

I think the correct behavior is to just use whatever the user has. So the output from above should be

In [36]: pd.DataFrame({u'c/\u03c3': [1, 2], 'c/s': [3, 4]}).to_records()
Out[36]:
rec.array([(0, 3, 1), (1, 4, 2)],
          dtype=[('index', '<i8'), ('c/s', '<i8'), (u'c/\u03c3', '<i8')])

so the python2 str column (which is actually bytes) should just be 'c/s', not u'c/s'.

This thing pandas has to decide is how to handle

  1. the default 'index' when df.index.name is None
  2. non-string columns like numbers

I think the least-surprising there is to use str(), so on py2 that will be bytes, and on py3 it will be unicode. Not sure if it will cause problems elsewhere though.

xref #13462 and #11879

cc @AlexisMignon

@jreback
Copy link
Contributor

jreback commented May 16, 2017

originally we were using str in both py2 & py3. we switched to text_type which is unicode in py2 and str in py3.

where exactly did this come up?

@jreback jreback added the Unicode Unicode strings label May 16, 2017
@TomAugspurger
Copy link
Contributor Author

where exactly did this come up?

statsmodels was essentially doing

# python 2
In [10]: np.dtype(pd.DataFrame(columns=['a', 'b']).to_records().dtype.descr)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-bf0f86491f9a> in <module>()
----> 1 np.dtype(pd.DataFrame(columns=['a', 'b']).to_records().dtype.descr)

TypeError: data type not understood

here

That's not 100% correct, since if a user actually does have unicode columns, pandas probably should use unicode, and statsmodels will have to work around it

In [11]: np.dtype(pd.DataFrame(columns=[u'a', u'b']).to_records().dtype.descr)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-e90f8e27ae39> in <module>()
----> 1 np.dtype(pd.DataFrame(columns=[u'a', u'b']).to_records().dtype.descr)

TypeError: data type not understood

It seems like this would be best solved by NumPy accepting unicode identifiers on py2 (I still think pandas should export whatever the user has though).

@AlexisMignon
Copy link
Contributor

From memory I proposed changes in pandas becaise using str in py2 was raising an error when column names were unicode strings with unicode caracter. Doing this I fell on a bug in numpy which has 2 ways to build a np.dtype:

  • One is using a list of tuples (name, format), this is the most documented way shown in all examples
  • the other uses a dictionnary {"name": , "formats": }

The first one does not accept unicode names whereas the second accept them.

This is what is done here

A way to keep compatibility with all projects using pandas might be to have a _to_string fonctions that always returns a str in py2 and does all the needed conversions between str, unicode and others.

@TomAugspurger
Copy link
Contributor Author

A way to keep compatibility with all projects using pandas might be to have a _to_string fonctions that always returns a str in py2 and does all the needed conversions between str, unicode and others.

Yeah, I'm trying to work around this in statsmodels and it's pretty tricky to get right that far down.

Is it fair to say that for DataFrame.to_records().dtype

  • In python2 the names should always be python 2 strs (bytes)
  • In python3 the names should always be python 3 strs (unicode)

In that case, I think the solution is to encode every column name as bytes on python2 (we can just choose and document that we're using utf-8).

kif added a commit to kif/pyFAI that referenced this issue Jan 21, 2019
Described elswhere in:

pandas-dev/pandas#16358

Recent versions of numpy/cython solve the issue.
@WillAyd
Copy link
Member

WillAyd commented Sep 6, 2019

Looks like a Py2 thing - closing

@WillAyd WillAyd closed this as completed Sep 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Unicode Unicode strings
Projects
None yet
Development

No branches or pull requests

4 participants