BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+ #43183

nwilliams-lw · 2021-08-23T15:30:56Z

Code Sample, a copy-pastable example

This test will pass on pandas 1.2.5 but fail on 1.3 onwards (checked 1.3, 1.3.2).

def test_convert_dtypes_on_byte_string():
    byte_str = b'binary-string'
    dataframe = pd.DataFrame(
        data={
            "data": byte_str,
        },
        index=[0]
    )
    converted_dtypes = dataframe.convert_dtypes()
    assert converted_dtypes['data'][0].decode('ascii') == "binary-string"

Problem description

Error on 1.3 onwards:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 10, in test_convert_dtypes_on_byte_string
AttributeError: 'str' object has no attribute 'decode'

The issue is that converted_dtypes() will convert the data column values to the string b'binary-data', almost like it has had str(val) called on it. On 1.2.5 it is left correctly as a byte array.

Expected Output

Test should pass. Byte strings should not be converted to string.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 5f648bf
python : 3.8.11.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:31 PDT 2021; root:xnu-7195.141.2~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.3.2
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.1.3
setuptools : 57.4.0
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.1 (dt dec pq3 ext lo64)
jinja2 : 3.0.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.07.0
fastparquet : 0.7.0
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : 2021.07.0
scipy : None
sqlalchemy : 1.4.23
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2021-08-23T16:10:31Z

Thanks @nwilliams-lw - from git bisect it looks like this was caused by #41622 , cc @jbrockmendel

jbrockmendel · 2021-08-23T20:44:41Z

Best guess looking at #41622 is that is_string_dtype(inferred_dtype) check returns True even if the inferred_dtype is bytes

PurnashisHazra · 2021-08-23T20:46:23Z

Should I give a stab to it

shubham11941140 · 2021-08-24T10:10:18Z

I am trying to solve the bug.

shubham11941140 · 2021-08-24T10:16:13Z

This issue is present in on pandas 1.3.1

shubham11941140 · 2021-08-24T10:23:35Z

In the same code, print(converted_dtypes['data'][0]) gives b'binary-string', which makes sense. I fail to understand that what is the point of decoding?

shubham11941140 · 2021-08-24T10:28:25Z

I got it, it gives b'binary-string', but changes the type assignment, so we need to rectify that.

shubham11941140 · 2021-08-24T13:38:56Z

Actually bytes data type is not supported by numpy and pandas. So, I will assign the data type to bytes

simonjayhawkins · 2021-08-25T18:48:18Z

Best guess looking at #41622 is that is_string_dtype(inferred_dtype) check returns True even if the inferred_dtype is bytes

A pandas string array is a extension array that uses python objects for the memory model. I don't think the string array should be allowed to contain bytes objects.

so somewhere we are allowing the conversion from object dtype to EA string dtype without an appropriate check

I think the following should probably not be allowed either.

byte_str = b"binary-string"
df = pd.DataFrame(
    data={
        "data": byte_str,
    },
    index=[0],
)
df.astype("string")

shubham11941140 · 2021-09-12T05:10:31Z

I agree with your point @simonjayhawkins, but this is different from the convert_dtypes issue mentioned, I can open up another PR and we can have a conversation to solve this.

nwilliams-lw added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 23, 2021

MarcoGorelli added Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 23, 2021

mroeschke added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Aug 23, 2021

shubham11941140 mentioned this issue Aug 24, 2021

BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+ #43199

Merged

4 tasks

simonjayhawkins added this to the 1.3.3 milestone Aug 25, 2021

simonjayhawkins added the Strings String extension data type and string data label Aug 25, 2021

simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021

simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021

jreback modified the milestones: 1.3.5, 1.3.4 Oct 16, 2021

simonjayhawkins closed this as completed in #43199 Oct 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+ #43183

BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+ #43183

nwilliams-lw commented Aug 23, 2021

INSTALLED VERSIONS

MarcoGorelli commented Aug 23, 2021

jbrockmendel commented Aug 23, 2021

PurnashisHazra commented Aug 23, 2021

shubham11941140 commented Aug 24, 2021

shubham11941140 commented Aug 24, 2021

shubham11941140 commented Aug 24, 2021

shubham11941140 commented Aug 24, 2021

shubham11941140 commented Aug 24, 2021

simonjayhawkins commented Aug 25, 2021 •

edited

Loading

shubham11941140 commented Sep 12, 2021

BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+ #43183

BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+ #43183

Comments

nwilliams-lw commented Aug 23, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

MarcoGorelli commented Aug 23, 2021

jbrockmendel commented Aug 23, 2021

PurnashisHazra commented Aug 23, 2021

shubham11941140 commented Aug 24, 2021

shubham11941140 commented Aug 24, 2021

shubham11941140 commented Aug 24, 2021

shubham11941140 commented Aug 24, 2021

shubham11941140 commented Aug 24, 2021

simonjayhawkins commented Aug 25, 2021 • edited Loading

shubham11941140 commented Sep 12, 2021

Output of `pd.show_versions()`

simonjayhawkins commented Aug 25, 2021 •

edited

Loading