Skip to content

BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+ #43183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nwilliams-lw opened this issue Aug 23, 2021 · 10 comments · Fixed by #43199
Closed

BUG: convert_dtypes incorrectly converts byte strings to strings in 1.3+ #43183

nwilliams-lw opened this issue Aug 23, 2021 · 10 comments · Fixed by #43199
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data
Milestone

Comments

@nwilliams-lw
Copy link

Code Sample, a copy-pastable example

This test will pass on pandas 1.2.5 but fail on 1.3 onwards (checked 1.3, 1.3.2).

def test_convert_dtypes_on_byte_string():
    byte_str = b'binary-string'
    dataframe = pd.DataFrame(
        data={
            "data": byte_str,
        },
        index=[0]
    )
    converted_dtypes = dataframe.convert_dtypes()
    assert converted_dtypes['data'][0].decode('ascii') == "binary-string"

Problem description

Error on 1.3 onwards:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 10, in test_convert_dtypes_on_byte_string
AttributeError: 'str' object has no attribute 'decode'

The issue is that converted_dtypes() will convert the data column values to the string b'binary-data', almost like it has had str(val) called on it. On 1.2.5 it is left correctly as a byte array.

Expected Output

Test should pass. Byte strings should not be converted to string.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 5f648bf
python : 3.8.11.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:31 PDT 2021; root:xnu-7195.141.2~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.3.2
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.1.3
setuptools : 57.4.0
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.1 (dt dec pq3 ext lo64)
jinja2 : 3.0.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.07.0
fastparquet : 0.7.0
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : 2021.07.0
scipy : None
sqlalchemy : 1.4.23
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@nwilliams-lw nwilliams-lw added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 23, 2021
@MarcoGorelli
Copy link
Member

Thanks @nwilliams-lw - from git bisect it looks like this was caused by #41622 , cc @jbrockmendel

@MarcoGorelli MarcoGorelli added Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 23, 2021
@jbrockmendel
Copy link
Member

Best guess looking at #41622 is that is_string_dtype(inferred_dtype) check returns True even if the inferred_dtype is bytes

@PurnashisHazra
Copy link

Should I give a stab to it

@mroeschke mroeschke added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Aug 23, 2021
@shubham11941140
Copy link
Contributor

I am trying to solve the bug.

@shubham11941140
Copy link
Contributor

This issue is present in on pandas 1.3.1

@shubham11941140
Copy link
Contributor

In the same code, print(converted_dtypes['data'][0]) gives b'binary-string', which makes sense. I fail to understand that what is the point of decoding?

@shubham11941140
Copy link
Contributor

I got it, it gives b'binary-string', but changes the type assignment, so we need to rectify that.

@shubham11941140
Copy link
Contributor

Actually bytes data type is not supported by numpy and pandas. So, I will assign the data type to bytes

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Aug 25, 2021

Best guess looking at #41622 is that is_string_dtype(inferred_dtype) check returns True even if the inferred_dtype is bytes

A pandas string array is a extension array that uses python objects for the memory model. I don't think the string array should be allowed to contain bytes objects.

so somewhere we are allowing the conversion from object dtype to EA string dtype without an appropriate check

I think the following should probably not be allowed either.

byte_str = b"binary-string"
df = pd.DataFrame(
    data={
        "data": byte_str,
    },
    index=[0],
)
df.astype("string")

@simonjayhawkins simonjayhawkins added the Strings String extension data type and string data label Aug 25, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021
@shubham11941140
Copy link
Contributor

I agree with your point @simonjayhawkins, but this is different from the convert_dtypes issue mentioned, I can open up another PR and we can have a conversation to solve this.

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021
@jreback jreback modified the milestones: 1.3.5, 1.3.4 Oct 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants