-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DataFrame.to_records dtype shouldn't use unicode for every column #16358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
originally we were using where exactly did this come up? |
statsmodels was essentially doing # python 2
In [10]: np.dtype(pd.DataFrame(columns=['a', 'b']).to_records().dtype.descr)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-bf0f86491f9a> in <module>()
----> 1 np.dtype(pd.DataFrame(columns=['a', 'b']).to_records().dtype.descr)
TypeError: data type not understood That's not 100% correct, since if a user actually does have unicode columns, pandas probably should use unicode, and statsmodels will have to work around it In [11]: np.dtype(pd.DataFrame(columns=[u'a', u'b']).to_records().dtype.descr)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-11-e90f8e27ae39> in <module>()
----> 1 np.dtype(pd.DataFrame(columns=[u'a', u'b']).to_records().dtype.descr)
TypeError: data type not understood It seems like this would be best solved by NumPy accepting unicode identifiers on py2 (I still think pandas should export whatever the user has though). |
From memory I proposed changes in pandas becaise using str in py2 was raising an error when column names were unicode strings with unicode caracter. Doing this I fell on a bug in numpy which has 2 ways to build a np.dtype:
The first one does not accept unicode names whereas the second accept them. This is what is done here A way to keep compatibility with all projects using pandas might be to have a _to_string fonctions that always returns a str in py2 and does all the needed conversions between str, unicode and others. |
Yeah, I'm trying to work around this in statsmodels and it's pretty tricky to get right that far down. Is it fair to say that for
In that case, I think the solution is to encode every column name as bytes on python2 (we can just choose and document that we're using utf-8). |
Described elswhere in: pandas-dev/pandas#16358 Recent versions of numpy/cython solve the issue.
Looks like a Py2 thing - closing |
Code Sample, a copy-pastable example if possible
As of 0.20,
DataFrame.to_records
will use theunicode
type for the all dtype identifiers on python 2.This caused some issues for statsmodels, since they go
to_records().dtype
->np.dtype
, which doesn't like unicode identifiers on python2 (statsmodels/statsmodels#3658 (comment))I think the correct behavior is to just use whatever the user has. So the output from above should be
so the python2
str
column (which is actually bytes) should just be'c/s'
, notu'c/s'
.This thing pandas has to decide is how to handle
'index'
when df.index.name is NoneI think the least-surprising there is to use
str()
, so on py2 that will be bytes, and on py3 it will be unicode. Not sure if it will cause problems elsewhere though.xref #13462 and #11879
cc @AlexisMignon
The text was updated successfully, but these errors were encountered: