BUG: Join changes dtypes #36665

AlbertoEAF · 2020-09-26T12:53:50Z

Hello,

Problem description

I have two pandas dataframes that I want to join using pandas (1.1.12).

However, when I join them, "alerts", in the table B gets its type changed from int64 to float64 ("alerts_cards"):

(Pdb) A.dtypes
threshold_step      int64
precision         float64
alerts              int64
alerted_money     float64
dtype: object

(Pdb) B.dtypes
threshold_step      int64
precision         float64
alerts              int64
dtype: object

(Pdb) A.join(B, on="threshold_step", rsuffix="_cards", sort=True).dtypes
threshold_step            int64
precision               float64
alerts                    int64
alerted_money           float64
threshold_step_cards    float64
precision_cards         float64
alerts_cards            float64
dtype: object

I usually then remove the join key ("threshold_step_cards"), but now I'm noticing it became a float as well.

The join key column has the same entries in both tables (all integers in range 0 to 100) and there are no NaN's in my dataframes.

Expected Output

The join should preserve the datatypes of the original columns. I spent several hours looking for a bug in my code until with the debugger I found it came from this pandas join.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 2a7d332
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-118-generic
Version : #119-Ubuntu SMP Tue Sep 8 12:30:01 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : pt_PT.UTF-8

pandas : 1.1.2
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.2.0.post20200511
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.16
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

Thank you

The text was updated successfully, but these errors were encountered:

AlbertoEAF · 2020-09-26T13:26:42Z

Update:

I had to do:

A.set_index("threshold_step", inplace=True)
B.set_index("threshold_step", inplace=True)
stats_df = A.join(B, rsuffix="_cards", sort=True).reset_index()

Otherwise it would use the index even though I set on as the join column.

I'm reading the docs and it says this about join's on parameter:
on: str, list of str, or array-like, optional Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index.

Meaning there's no way to join by column names on both sides? One must always set the index?

dimithras · 2020-09-26T14:50:28Z

Could you please provide a code sample defining A and B? I tried reproducing the issue yet I get expected results.

>>> A = pd.DataFrame({'threshold_step' : [0,1], 'precision' : [0.0,1.1], 'alerts' : [0,1], 'alerted_money' : [0.0,0.1]})
>>> B = pd.DataFrame({'threshold_step' : [0,1], 'precision' : [0.0,1.1], 'alerts' : [0,1]})

>>> print(A.dtypes)
threshold_step      int64
precision         float64
alerts              int64
alerted_money     float64
dtype: object

>>> print(B.dtypes)
threshold_step      int64
precision         float64
alerts              int64
dtype: object

>>> print(A.join(B, on="threshold_step", rsuffix="_cards", sort=True).dtypes)
threshold_step            int64
precision               float64
alerts                    int64
alerted_money           float64
threshold_step_cards      int64
precision_cards         float64
alerts_cards              int64
dtype: object

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : d9fff27
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.0
pip : 20.2.3
setuptools : 40.8.0
Cython : 0.29.21
pytest : 6.0.1
hypothesis : 5.36.0
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.3.4
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.0
numexpr : None
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.19
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

dimithras · 2020-09-26T14:56:51Z

Tried it also on pandas 1.1.2, getting expected output.

rhshadrach · 2020-09-26T15:17:57Z

@AlbertoEAF It looks like you're wanting to use merge instead of join. pandas join is specifically for joining on other's index, never a column. When you specify the on argument, you are joining that column of the caller to the index of other.

Likely this created missing values (NaN), which coerced the dtype to being a float.

AlbertoEAF · 2020-09-26T16:01:36Z

Hello,

There are no NaN's, however as extra-processing for normalization of the output shape I do add extra rows to fill all the integers in a range (let's say 100) and then fill the table with the statistics for all those points. This is that pre-processing step at which the indexes go out of sync:

import pandas as pd
import numpy as np

def extend_table_to_all_thresholds(df, score_int_field, threshold_steps):
    """Adds rows even for thresholds with no data."""
    missing_threshold_steps = np.where(
        np.isin(
            np.arange(threshold_steps + 1),
            df[score_int_field],
            invert=True
    ))[0]
    df = pd.concat([df, pd.DataFrame(data=missing_threshold_steps, columns=[score_int_field])])
    df.fillna(0, inplace=True)
    return df

# A and B don't start with all the same thresholds:
A = pd.DataFrame({'threshold_step' : [0,4], 'precision' : [0.0,1.1], 'alerts' : [0,1], 'alerted_money' : [0.0,0.1]})
B = pd.DataFrame({'threshold_step' : [0,1], 'precision' : [0.0,1.1], 'alerts' : [0,1]})

# Then I fill more steps in (which results in broken indices it seems). Comment this and it works:
A = extend_table_to_all_thresholds(A, "threshold_step", 100)
B = extend_table_to_all_thresholds(B, "threshold_step", 100)

# Join doesn't work without setting indices:
print(A.join(B, on="threshold_step", rsuffix="_cards", sort=True).dtypes)

# But merge doesn't either:
print(A.merge(B, on="threshold_step", sort=True).dtypes)

This "extend thresholds" call results in repeated indices (don't know how thats possible):

which then breaks the join unles I set "threshold_step" as index for both tables.

However, not only does merge not allow using rsuffix to disambiguate my columns, but also results in broken dtypes.

rhshadrach · 2020-09-26T16:16:11Z

After the call to concat within extend_table_to_all_thresholds, there are null values within the alerts column of df. This coerces the dtype to be float. Filling in these values with 0, they still remain floats. You can use .astype(int) after the fillna to make them integers again.

AlbertoEAF · 2020-09-26T16:25:48Z

You're right @rhshadrach, my bad, but in my real code I actually generate those columns after not before and the only way it works is with the indices.
Even so, when I make the "dummy" example similar to my code I don't see the bug here and I cannot send you the real code for it needs a lot of data and it's a huge script. Well I guess we can close this as I'm not being able to reproduce it here with minimal code 🤷 .

Regarding the index in the image above though, does it make sense to be repeated? 0, 1, 0, 1, 2, .. ?

Thanks for your time @rhshadrach @dimithras ;)

rhshadrach · 2020-09-26T16:32:04Z

It looks like the index [0, 1, 0, 1, 2, ...] is from using concat. This will stack the two DataFrames on top of either other, indices and all. The first [0, 1] is from A, the rest of the indices are from missing_threshold_steps.

rhshadrach · 2020-09-26T16:33:43Z

Closing for now. If you can reproduce the issue, reply here and we can reopen.

AlbertoEAF added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 26, 2020

AlbertoEAF changed the title ~~BUG: Join changes type~~ BUG: Join changes dtypes Sep 26, 2020

dsaxton added Needs Info Clarification about behavior needed to assess issue and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 26, 2020

rhshadrach added Usage Question and removed Needs Info Clarification about behavior needed to assess issue labels Sep 26, 2020

rhshadrach closed this as completed Sep 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Join changes dtypes #36665

BUG: Join changes dtypes #36665

AlbertoEAF commented Sep 26, 2020 •

edited

Loading

INSTALLED VERSIONS

AlbertoEAF commented Sep 26, 2020

dimithras commented Sep 26, 2020

INSTALLED VERSIONS

dimithras commented Sep 26, 2020

rhshadrach commented Sep 26, 2020

AlbertoEAF commented Sep 26, 2020

rhshadrach commented Sep 26, 2020

AlbertoEAF commented Sep 26, 2020

rhshadrach commented Sep 26, 2020

rhshadrach commented Sep 26, 2020

BUG: Join changes dtypes #36665

BUG: Join changes dtypes #36665

Comments

AlbertoEAF commented Sep 26, 2020 • edited Loading

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

AlbertoEAF commented Sep 26, 2020

dimithras commented Sep 26, 2020

Output of pd.show_versions()

INSTALLED VERSIONS

dimithras commented Sep 26, 2020

rhshadrach commented Sep 26, 2020

AlbertoEAF commented Sep 26, 2020

rhshadrach commented Sep 26, 2020

AlbertoEAF commented Sep 26, 2020

rhshadrach commented Sep 26, 2020

rhshadrach commented Sep 26, 2020

AlbertoEAF commented Sep 26, 2020 •

edited

Loading

Output of `pd.show_versions()`

Output of `pd.show_versions()`