-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+ #43183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @nwilliams-lw - from git bisect it looks like this was caused by #41622 , cc @jbrockmendel |
Best guess looking at #41622 is that |
Should I give a stab to it |
I am trying to solve the bug. |
This issue is present in on pandas 1.3.1 |
In the same code, print(converted_dtypes['data'][0]) gives b'binary-string', which makes sense. I fail to understand that what is the point of decoding? |
I got it, it gives b'binary-string', but changes the type assignment, so we need to rectify that. |
Actually bytes data type is not supported by numpy and pandas. So, I will assign the data type to bytes |
A pandas string array is a extension array that uses python objects for the memory model. I don't think the string array should be allowed to contain bytes objects. so somewhere we are allowing the conversion from object dtype to EA string dtype without an appropriate check I think the following should probably not be allowed either.
|
I agree with your point @simonjayhawkins, but this is different from the convert_dtypes issue mentioned, I can open up another PR and we can have a conversation to solve this. |
Code Sample, a copy-pastable example
This test will pass on pandas 1.2.5 but fail on 1.3 onwards (checked 1.3, 1.3.2).
Problem description
Error on 1.3 onwards:
The issue is that converted_dtypes() will convert the
data
column values to the stringb'binary-data'
, almost like it has hadstr(val)
called on it. On 1.2.5 it is left correctly as a byte array.Expected Output
Test should pass. Byte strings should not be converted to string.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : 5f648bf
python : 3.8.11.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:31 PDT 2021; root:xnu-7195.141.2~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 1.3.2
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.1.3
setuptools : 57.4.0
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.1 (dt dec pq3 ext lo64)
jinja2 : 3.0.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.07.0
fastparquet : 0.7.0
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : 2021.07.0
scipy : None
sqlalchemy : 1.4.23
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: