Skip to content

UnicodeEncodeError from DataFrame.to_records #11879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kynnjo opened this issue Dec 21, 2015 · 7 comments
Closed

UnicodeEncodeError from DataFrame.to_records #11879

kynnjo opened this issue Dec 21, 2015 · 7 comments
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Unicode Unicode strings
Milestone

Comments

@kynnjo
Copy link

kynnjo commented Dec 21, 2015

The DataFrame.to_records method fails with a UnicodeEncodeError for some unicode column names.

(This issue is related to #680. The example below extends the example given in that issue.)

In [322]: df = pandas.DataFrame({u'c/\u03c3':[1,2,3]})

In [323]: df
Out[323]: 
   c/σ
0    1
1    2
2    3

In [324]: df.to_records()
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-324-6d3142e97d2d> in <module>()
----> 1 df.to_records()

/redacted/python2.7/site-packages/pandas/core/frame.pyc in to_records(self, index, convert_datetime64)
   1013             elif index_names[0] is None:
   1014                 index_names = ['index']
-> 1015             names = index_names + lmap(str, self.columns)
   1016         else:
   1017             arrays = [self[c].get_values() for c in self.columns]

UnicodeEncodeError: 'ascii' codec can't encode character u'\u03c3' in position 2: ordinal not in range(128)
@jreback
Copy link
Contributor

jreback commented Dec 23, 2015

you are referring to a VERY old issue FYI. Pls show pd.show_versions(). This a bug in any event so pull-requests are welcome.

this should be: lmap(compat.text_type, self.columns) I think

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Unicode Unicode strings Difficulty Novice labels Dec 23, 2015
@jreback jreback added this to the Next Major Release milestone Dec 23, 2015
@kynnjo
Copy link
Author

kynnjo commented Dec 28, 2015

If you can't be bothered to verify the code I posted, then just delete the issue. I don't give a damn.

@jreback
Copy link
Contributor

jreback commented Dec 28, 2015

@kynnjo I did repro right after you posted that's why I marked it as a bug
I asked nicely to have you post the diagnostic. I even put what I think the fix is.

we don't appreciate rude behavior. please use respectful language.

@kynnjo
Copy link
Author

kynnjo commented Dec 28, 2015

just delete the issue and we're done

@jreback
Copy link
Contributor

jreback commented Dec 28, 2015

I actually find this a valid issue. thank you for reporting. don't you wish to see pandas improved and others helped?

@gliptak
Copy link
Contributor

gliptak commented May 28, 2016

This works on current HEAD:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({u'c/\u03c3':[1,2,3]})

In [3]: df
Out[3]: 
   c/σ
0    1
1    2
2    3

In [4]: df.to_records()
Out[4]: 
rec.array([(0, 1), (1, 2), (2, 3)], 
          dtype=[('index', '<i8'), ('c/σ', '<i8')])

Please consider closing.

@jreback
Copy link
Contributor

jreback commented May 29, 2016

This fails in py2.

In [1]: df = pandas.DataFrame({u'c/\u03c3':[1,2,3]})

In [2]: df.to_records()
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-2-6d3142e97d2d> in <module>()
----> 1 df.to_records()

/Users/jreback/pandas/pandas/core/frame.pyc in to_records(self, index, convert_datetime64)
   1063             elif index_names[0] is None:
   1064                 index_names = ['index']
-> 1065             names = lmap(str, index_names) + lmap(str, self.columns)
   1066         else:
   1067             arrays = [self[c].get_values() for c in self.columns]

UnicodeEncodeError: 'ascii' codec can't encode character u'\u03c3' in position 2: ordinal not in range(128)

@jreback jreback modified the milestones: 0.20.0, Next Major Release Dec 30, 2016
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
NexediGitlab pushed a commit to Nexedi/erp5 that referenced this issue Feb 23, 2023
Pandas 0.20.0 introduced a bug fix [1] which changed the behaviour of
'DataFrame.to_records()', so that the resulting Record objects dtype names are
unicodes if the data frames column names were unicode. Before this bug fix
the dtype names were str, no matter whether the column names were str or unicode.

Unfortunately np unpickle breaks if dtype names are unicode [2]. Since
many of our data frame columns are unicode, loading arrays often
fails. In python3 this isn't a problem anymore, so until then we fix
this by introducing a simple monkey patch to pandas, which basically
reverts the mentioned bug fix.

[1] pandas-dev/pandas#11879
[2] Small example to reproduce this error:

''
import os

import numpy as np
import pandas as pd

r = pd.DataFrame({u'A':[1,2,3]}).to_records()
a = np.ndarray(shape=r.shape, dtype=r.dtype.fields)
p = "t"

try:
  os.remove(p)
except:
  pass

with open(p, 'wb') as f:
  np.save(f, a)
with open(p, 'rb') as f:
  np.load(f)
''
NexediGitlab pushed a commit to Nexedi/erp5 that referenced this issue Feb 23, 2023
Pandas 0.20.0 introduced a bug fix [1] which changed the behaviour of
'DataFrame.to_records()', so that the resulting Record objects dtype names are
unicodes if the data frames column names were unicode. Before this bug fix
the dtype names were str, no matter whether the column names were str or unicode.

Unfortunately np unpickle breaks if dtype names are unicode [2]. Since
many of our data frame columns are unicode, loading arrays often
fails. In python3 this isn't a problem anymore, so until then we fix
this by introducing a simple monkey patch to pandas, which basically
reverts the mentioned bug fix.

[1] pandas-dev/pandas#11879
[2] Small example to reproduce this error:

''
import os

import numpy as np
import pandas as pd

r = pd.DataFrame({u'A':[1,2,3]}).to_records()
a = np.ndarray(shape=r.shape, dtype=r.dtype.fields)
p = "t"

try:
  os.remove(p)
except:
  pass

with open(p, 'wb') as f:
  np.save(f, a)
with open(p, 'rb') as f:
  np.load(f)
''
NexediGitlab pushed a commit to Nexedi/erp5 that referenced this issue Feb 24, 2023
Pandas 0.20.0 introduced a bug fix [1] which changed the behaviour of
'DataFrame.to_records()', so that the resulting Record objects dtype names are
unicodes if the data frames column names were unicode. Before this bug fix
the dtype names were str, no matter whether the column names were str or unicode.

Unfortunately np unpickle breaks if dtype names are unicode [2]. Since
many of our data frame columns are unicode, loading arrays often
fails. In python3 this isn't a problem anymore, so until then we fix
this by introducing a simple monkey patch to pandas, which basically
reverts the mentioned bug fix.

[1] pandas-dev/pandas#11879
[2] Small example to reproduce this error:

''
import os

import numpy as np
import pandas as pd

r = pd.DataFrame({u'A':[1,2,3]}).to_records()
a = np.ndarray(shape=r.shape, dtype=r.dtype.fields)
p = "t"

try:
  os.remove(p)
except:
  pass

with open(p, 'wb') as f:
  np.save(f, a)
with open(p, 'rb') as f:
  np.load(f)
''
NexediGitlab pushed a commit to Nexedi/erp5 that referenced this issue Feb 28, 2023
Pandas 0.20.0 introduced a bug fix [1] which changed the behaviour of
'DataFrame.to_records()', so that the resulting Record objects dtype names are
unicodes if the data frames column names were unicode. Before this bug fix
the dtype names were str, no matter whether the column names were str or unicode.

Unfortunately np unpickle breaks if dtype names are unicode [2]. Since
many of our data frame columns are unicode, loading arrays often
fails. In python3 this isn't a problem anymore, so until then we fix
this by introducing a simple monkey patch to pandas, which basically
reverts the mentioned bug fix.

[1] pandas-dev/pandas#11879
[2] Small example to reproduce this error:

''
import os

import numpy as np
import pandas as pd

r = pd.DataFrame({u'A':[1,2,3]}).to_records()
a = np.ndarray(shape=r.shape, dtype=r.dtype.fields)
p = "t"

try:
  os.remove(p)
except:
  pass

with open(p, 'wb') as f:
  np.save(f, a)
with open(p, 'rb') as f:
  np.load(f)
''

/reviewed-on https://lab.nexedi.com/nexedi/erp5/merge_requests/1738
/reviewed-by @jerome @klaus
NexediGitlab pushed a commit to Nexedi/erp5 that referenced this issue Mar 1, 2023
Pandas 0.20.0 introduced a bug fix [1] which changed the behaviour of
'DataFrame.to_records()', so that the resulting Record objects dtype names are
unicodes if the data frames column names were unicode. Before this bug fix
the dtype names were str, no matter whether the column names were str or unicode.

Unfortunately np unpickle breaks if dtype names are unicode [2]. Since
many of our data frame columns are unicode, loading arrays often
fails. In python3 this isn't a problem anymore, so until then we fix
this by introducing a simple monkey patch to pandas, which basically
reverts the mentioned bug fix.

[1] pandas-dev/pandas#11879
[2] Small example to reproduce this error:

''
import os

import numpy as np
import pandas as pd

r = pd.DataFrame({u'A':[1,2,3]}).to_records()
a = np.ndarray(shape=r.shape, dtype=r.dtype.fields)
p = "t"

try:
  os.remove(p)
except:
  pass

with open(p, 'wb') as f:
  np.save(f, a)
with open(p, 'rb') as f:
  np.load(f)
''

/reviewed-on https://lab.nexedi.com/nexedi/erp5/merge_requests/1738
/reviewed-by @jerome @klaus
NexediGitlab pushed a commit to Nexedi/erp5 that referenced this issue Jul 11, 2024
Pandas 0.20.0 introduced a bug fix [1] which changed the behaviour of
'DataFrame.to_records()', so that the resulting Record objects dtype names are
unicodes if the data frames column names were unicode. Before this bug fix
the dtype names were str, no matter whether the column names were str or unicode.

Unfortunately np unpickle breaks if dtype names are unicode [2]. Since
many of our data frame columns are unicode, loading arrays often
fails. In python3 this isn't a problem anymore, so until then we fix
this by introducing a simple monkey patch to pandas, which basically
reverts the mentioned bug fix.

[1] pandas-dev/pandas#11879
[2] Small example to reproduce this error:

''
import os

import numpy as np
import pandas as pd

r = pd.DataFrame({u'A':[1,2,3]}).to_records()
a = np.ndarray(shape=r.shape, dtype=r.dtype.fields)
p = "t"

try:
  os.remove(p)
except:
  pass

with open(p, 'wb') as f:
  np.save(f, a)
with open(p, 'rb') as f:
  np.load(f)
''

/reviewed-on https://lab.nexedi.com/nexedi/erp5/merge_requests/1738
/reviewed-by @jerome @klaus
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Unicode Unicode strings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants