Skip to content

astype(unicode) does not work as expected #7758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fulmicoton opened this issue Jul 15, 2014 · 11 comments
Closed

astype(unicode) does not work as expected #7758

fulmicoton opened this issue Jul 15, 2014 · 11 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Enhancement Unicode Unicode strings
Milestone

Comments

@fulmicoton
Copy link
Contributor

astype unicode seems to call str, so that the following code throws

import pandas
df = pandas.DataFrame({"somecol": [u"適当"]})
df["somecol"].astype("unicode")

raises :

UnicodeEncodeError: 'ascii' codec can't encode ch
aracters in position 0-1: ordinal not in range(12
8)
@jreback
Copy link
Contributor

jreback commented Jul 15, 2014

you can do: df['somecol'].values.astype('unicode')

what are you doing with this?

pandas keeps all string-likes as object dtype so this is really only for external usage

@fulmicoton
Copy link
Contributor Author

I have a method that detects whether a column should be considered as a category based on its type and cardinality. Columns that are considered as categories are casted into unicode object.

I know how to workaround this issue, but I thought I should report what I thought was a bug.

Let me know if you need more information.

@jreback
Copy link
Contributor

jreback commented Jul 15, 2014

ok, this could be more informative, but its fundamentally an issue. This would return a numpy array (and NOT a series, and that would simply recast, and lose the cast to unicode).

I think that is a bit odd though. What do you think should happen?

@fulmicoton
Copy link
Contributor Author

Ideally, I would have either wanted the cast to work as python unicode() function.
That is : returned object are always of the "unicode" type.

  • Unicode objects are left unchanged.
  • Numbers are stringified into unicode strings.
  • str object are decoded using the default encoding and a unicode object is returned.

Does that make sense in Pandas?

@cpcloud
Copy link
Member

cpcloud commented Jul 15, 2014

@fulmicoton Why do you need to convert to unicode? Do you have things that are convertible to unicode but aren't already converted? Can you give a more detailed example that illustrates why you need to do this. I think I'm just missing something.

@jreback
Copy link
Contributor

jreback commented Jul 15, 2014

This could all be done I think (may need to allow an encoding argument for your 3rd bullet.
Keep in mind that current pandas does not have a unicode type per-se (str and unicode are stored as object dtype), but its really not a big deal, as when a unicode dtype is presented it can simply be inferred.

here's a picture of the internal structure:

In [16]: df
Out[16]: 
  somecol
0      適当

In [17]: df._data
Out[17]: 
BlockManager
Items: Index([u'somecol'], dtype='object')
Axis 1: Int64Index([0], dtype='int64')
ObjectBlock: slice(0, 1, 1), 1 x 1, dtype: object

In [18]: df._data.blocks[0]
Out[18]: ObjectBlock: slice(0, 1, 1), 1 x 1, dtype: object

In [19]: df._data.blocks[0].values
Out[19]: array([[u'\u9069\u5f53']], dtype=object)

In [20]: pd.lib.infer_dtype(df._data.blocks[0].values)
Out[20]: 'unicode'

@jreback jreback added this to the 0.15.0 milestone Jul 15, 2014
@jreback
Copy link
Contributor

jreback commented Jul 15, 2014

@fulmicoton interested in doing a pull-request for this?

@fulmicoton
Copy link
Contributor Author

@cpcloud Just having a piece of code trying to coerce a bunch of columns marked as categorical into unicode strings. Some of them are already unicode, some of them have been detected as int but have such a low cardinality I want to handle them as categories.
They are getting dummified after... So it's important they all end up as unicode string at one point or another.

@fulmicoton
Copy link
Contributor Author

@jreback I'll take a look at that tonight.

@jreback
Copy link
Contributor

jreback commented Jul 15, 2014

@fulmicoton you might wasn to explore this as well (just merged in): http://pandas-docs.github.io/pandas-docs-travis/categorical.html. Prob not a lot of tests for unicode (but it should work)

fulmicoton added a commit to fulmicoton/pandas that referenced this issue Jul 15, 2014
Just calls numpy.unicode on all the values.
Seems to work alright on python2 and python3.
@fulmicoton
Copy link
Contributor Author

Here is the pull requests. I didn't have to use infer_dtype, so I hope I didn't do anything wrong.

fulmicoton added a commit to fulmicoton/pandas that referenced this issue Jul 15, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Enhancement Unicode Unicode strings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants