-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Join changes dtypes #36665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Update: I had to do: A.set_index("threshold_step", inplace=True)
B.set_index("threshold_step", inplace=True)
stats_df = A.join(B, rsuffix="_cards", sort=True).reset_index() Otherwise it would use the index even though I set I'm reading the docs and it says this about join's Meaning there's no way to join by column names on both sides? One must always set the index? |
Could you please provide a code sample defining A and B? I tried reproducing the issue yet I get expected results. >>> A = pd.DataFrame({'threshold_step' : [0,1], 'precision' : [0.0,1.1], 'alerts' : [0,1], 'alerted_money' : [0.0,0.1]})
>>> B = pd.DataFrame({'threshold_step' : [0,1], 'precision' : [0.0,1.1], 'alerts' : [0,1]})
>>> print(A.dtypes)
threshold_step int64
precision float64
alerts int64
alerted_money float64
dtype: object
>>> print(B.dtypes)
threshold_step int64
precision float64
alerts int64
dtype: object
>>> print(A.join(B, on="threshold_step", rsuffix="_cards", sort=True).dtypes)
threshold_step int64
precision float64
alerts int64
alerted_money float64
threshold_step_cards int64
precision_cards float64
alerts_cards int64
dtype: object Output of
|
Tried it also on pandas 1.1.2, getting expected output. |
@AlbertoEAF It looks like you're wanting to use Likely this created missing values (NaN), which coerced the dtype to being a float. |
Hello, There are no NaN's, however as extra-processing for normalization of the output shape I do add extra rows to fill all the integers in a range (let's say 100) and then fill the table with the statistics for all those points. This is that pre-processing step at which the indexes go out of sync: import pandas as pd
import numpy as np
def extend_table_to_all_thresholds(df, score_int_field, threshold_steps):
"""Adds rows even for thresholds with no data."""
missing_threshold_steps = np.where(
np.isin(
np.arange(threshold_steps + 1),
df[score_int_field],
invert=True
))[0]
df = pd.concat([df, pd.DataFrame(data=missing_threshold_steps, columns=[score_int_field])])
df.fillna(0, inplace=True)
return df
# A and B don't start with all the same thresholds:
A = pd.DataFrame({'threshold_step' : [0,4], 'precision' : [0.0,1.1], 'alerts' : [0,1], 'alerted_money' : [0.0,0.1]})
B = pd.DataFrame({'threshold_step' : [0,1], 'precision' : [0.0,1.1], 'alerts' : [0,1]})
# Then I fill more steps in (which results in broken indices it seems). Comment this and it works:
A = extend_table_to_all_thresholds(A, "threshold_step", 100)
B = extend_table_to_all_thresholds(B, "threshold_step", 100)
# Join doesn't work without setting indices:
print(A.join(B, on="threshold_step", rsuffix="_cards", sort=True).dtypes)
# But merge doesn't either:
print(A.merge(B, on="threshold_step", sort=True).dtypes) This "extend thresholds" call results in repeated indices (don't know how thats possible): which then breaks the join unles I set "threshold_step" as index for both tables. However, not only does |
After the call to |
You're right @rhshadrach, my bad, but in my real code I actually generate those columns after not before and the only way it works is with the indices. Regarding the index in the image above though, does it make sense to be repeated? 0, 1, 0, 1, 2, .. ? Thanks for your time @rhshadrach @dimithras ;) |
It looks like the index |
Closing for now. If you can reproduce the issue, reply here and we can reopen. |
Hello,
Problem description
I have two pandas dataframes that I want to join using pandas (1.1.12).
However, when I join them, "alerts", in the table B gets its type changed from int64 to float64 ("alerts_cards"):
I usually then remove the join key ("threshold_step_cards"), but now I'm noticing it became a float as well.
The join key column has the same entries in both tables (all integers in range 0 to 100) and there are no NaN's in my dataframes.
Expected Output
The join should preserve the datatypes of the original columns. I spent several hours looking for a bug in my code until with the debugger I found it came from this pandas join.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : 2a7d332
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-118-generic
Version : #119-Ubuntu SMP Tue Sep 8 12:30:01 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : pt_PT.UTF-8
pandas : 1.1.2
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.2.0.post20200511
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.16
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None
Thank you
The text was updated successfully, but these errors were encountered: