Skip to content

BUG: .filter with unicode labels when can't encode #13101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
griai opened this issue May 6, 2016 · 2 comments · Fixed by #18238
Closed

BUG: .filter with unicode labels when can't encode #13101

griai opened this issue May 6, 2016 · 2 comments · Fixed by #18238
Milestone

Comments

@griai
Copy link

griai commented May 6, 2016

Edit #10506 breaks if the DataFrame contains unicode column names with non-ASCII characters.

import pandas as pd
df = pd.DataFrame({u'a': [1, 2, 3], u'ä': [4, 5, 6]})
df.filter(regex=u'a')

throws me a

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-10-9de5a19c260e> in <module>()
----> 1 df.filter(regex=u'a')

C:\Users\...\AppData\Local\Continuum\32bit\Anaconda\envs\test\lib\site-packages\pandas\core\generic.pyc in filter(self, items, like, regex, axis)
   2013             matcher = re.compile(regex)
   2014             return self.select(lambda x: matcher.search(str(x)) is not None,
-> 2015                                axis=axis_name)
   2016         else:
   2017             raise TypeError('Must pass either `items`, `like`, or `regex`')

C:\Users\...\AppData\Local\Continuum\32bit\Anaconda\envs\test\lib\site-packages\pandas\core\generic.pyc in select(self, crit, axis)
   1545         if len(axis_values) > 0:
   1546             new_axis = axis_values[
-> 1547                 np.asarray([bool(crit(label)) for label in axis_values])]
   1548         else:
   1549             new_axis = axis_values

C:\Users\...\AppData\Local\Continuum\32bit\Anaconda\envs\test\lib\site-packages\pandas\core\generic.pyc in <lambda>(x)
   2012         elif regex:
   2013             matcher = re.compile(regex)
-> 2014             return self.select(lambda x: matcher.search(str(x)) is not None,
   2015                                axis=axis_name)
   2016         else:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
@griai griai changed the title Edit #10506 breaks if the DataFrame contains unicode column names with non-ASCII characters. BUG: Edit #10506 breaks if the DataFrame contains unicode column names with non-ASCII characters. May 6, 2016
@jreback
Copy link
Contributor

jreback commented May 6, 2016

xref #10384

yeah str(x) will try to encode, so probably easiest to either just catch this (and pass thru if it cannot encode), or just stringify integers (but then that leaves out things like float columns and such).

So I think the former is ok. want to do a PR?

would need to add some tests for other column label types as well

(e.g. the tests should loop thru all of the index types).

@jreback jreback added this to the 0.18.2 milestone May 6, 2016
@jreback jreback changed the title BUG: Edit #10506 breaks if the DataFrame contains unicode column names with non-ASCII characters. BUG: .filter with unicode labels when can't encode May 6, 2016
@griai
Copy link
Author

griai commented May 9, 2016

I don't have an installed git environment at the moment. So I cannot do the Pull Request, unfortunately.
I would support the passing-through solution if the argument cannot be encoded, since it is the easiest and a pretty general fix (although this fallback mechanism might seem a bit intransparent).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants