-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Unicode char as delimiter won't use C engine #14065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Copy-pastable example (python3)
|
By the way, for you first example, I think you want |
Apologies for not including a cut and paste example. Actually, I do need the delimiter with two backslashes. In fact, your copy-and-paste example doesn't work for me as written:
With the doubled form:
And with single form:
|
That's odd, I wonder if we escape something improperly... Compare these two: In [22]: '\u00A7'
Out[22]: '§'
In [24]: '\\u00A7'
Out[24]: '\\u00A7' When you have the double |
So those work:
Right - I don't think I've seen this behavior with unicode outside of pandas, but I rarely venture into unicode. |
The reason for the error is that the data is getting encoded as >>> data = "a§b\n1§2\n3§4"
>>> data.encode('utf-8')
b'a\xc2\xa7b\n1\xc2\xa72\n3\xc2\xa74' However, Now technically, we should be splittng by In the long run, the solution would be to somehow support multi-char delimiters (tricky since we parse byte by byte with the C engine). In the short-term, I think we should check the separator to see if it would be a multi-char when encoded, and if so, raise an error. Thoughts? |
Trying to detect such separators and raising an informative message sounds fine IMO |
The system file encoding can cause a separator to be encoded as more than one character even though it maybe provided as one character. Multi-char separators are not supported by the C engine, so we need to catch this case. Closes pandas-devgh-14065.
I have the following code:
dfEL = pd.read_csv(IN_PATH, delimiter='\\u00A7', encoding='utf-8')
which I've also tried with other ways of writing the delimiter, e.g.:
dfEL = pd.read_csv(IN_PATH, delimiter='§', encoding='utf-8')
These other methods don't work, and generate a
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 10: unexpected end of data
.The first method, though, won't use the C regex engine:
Shouldn't this only be considered one character and still use (the I presume faster) C engine?
Sample data - there's a lot of messiness in the rightmost column, which is why an unusual separator was used:
The text was updated successfully, but these errors were encountered: