Skip to content

Left join on an integer column casts the column to float if it is in one of the indexes #9958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bembom opened this issue Apr 20, 2015 · 11 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@bembom
Copy link

bembom commented Apr 20, 2015

When I left-join two data frames on an int64 column, this column is cast to a float if it is the index of one of the two data frames:

>>> id = [1, 143349558619761320]
>>> a = pd.DataFrame({'a': [1, 2]}, index=pd.Int64Index(id, name='id'))
>>> b = pd.DataFrame({'id': id[1], 'b': [2]})
>>> merge_index = pd.merge(a, b, left_index=True, right_on='id', how='left')
>>> print(merge_index.dtypes)
a       int64
b     float64
id    float64
dtype: object

In this case this is a problem because casting back to an int changes the value of this column:

>>> int(float(143349558619761320))
143349558619761312

If the int64 column is not in the index, the column is not cast to a float:

>>> merge_column = pd.merge(a.reset_index(), b, on='id', how='left')
>>> print(merge_column.dtypes)
id      int64
a       int64
b     float64
dtype: object

I understand that integer columns get cast to a float if NaNs are introduced (like column b here), but in this case the final column contains no missing values, so casting to a float can be avoided.

Output from pd.show_versions():

INSTALLED VERSIONS

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-5-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.0
nose: 1.3.4
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.0.0
sphinx: None
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.2
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

@midnightradio
Copy link

I just met exactly the same problem with having casted float64 after left joining with two int64 fields which one of them is an index of left dataframe.

And here's output from show_versions.

INSTALLED VERSIONS

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-49-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.2
nose: 1.3.6
Cython: 0.22
numpy: 1.9.2
scipy: 0.15.1
statsmodels: None
IPython: 3.1.0
sphinx: 1.3.1
patsy: None
dateutil: 2.4.2
pytz: 2015.2
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: 0.9.1
apiclient: 1.4.0
sqlalchemy: 1.0.8
pymysql: None
psycopg2: None

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate labels Oct 8, 2015
@jreback jreback added this to the Next Major Release milestone Oct 8, 2015
@jiamo
Copy link

jiamo commented Apr 27, 2016

merge has the same error:

testb = pd.DataFrame([{"a":2,"c":2}])
testa = pd.DataFrame([{"a":1,"b":2}])
result = pd.merge(testa, testb, on=["a"], how="outer")

@wendykan
Copy link

Just hit the same problem. However, if how='inner, it's not an issue.

@PedroMDuarte
Copy link

Just encountered the same issue. Here is how I fixed it

# old code - casts 'eCountryID' to float64
merged = pd.merge(
    analyzed, ecountry_data[['CountryCode', 'eCountryID']],
    how='left', on=['CountryCode'])


# new code
merged_country_id = pd.merge(
    analyzed, ecountry_data[['CountryCode', 'eCountryID']],
    how='inner', on=['CountryCode']).astype(object)

no_country_id = analyzed[pd.isnull(analyzed['CountryCode'])]

merged = pd.concat([merged_country_id, no_country_id])

@jreback
Copy link
Contributor

jreback commented Jun 12, 2016

this is a dupe of #8596, closed by #13170 which will be in 0.18.2

see http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#merging-changes

@jreback jreback closed this as completed Jun 12, 2016
@jreback jreback added the Duplicate Report Duplicate issue or pull request label Jun 12, 2016
@flipdazed
Copy link

Should a new issue be opened for the outer join issue highlighted by @jiamo ?

@jreback
Copy link
Contributor

jreback commented Jun 1, 2017

@flipdazed this looks ok

In [1]: testb = pd.DataFrame([{"a":2,"c":2}])
   ...: testa = pd.DataFrame([{"a":1,"b":2}])
   ...: result = pd.merge(testa, testb, on=["a"], how="outer")
   ...: 

In [2]: testa
Out[2]: 
   a  b
0  1  2

In [3]: testb
Out[3]: 
   a  c
0  2  2

In [4]: result
Out[4]: 
   a    b    c
0  1  2.0  NaN
1  2  NaN  2.0

In [5]: result.dtypes
Out[5]: 
a      int64
b    float64
c    float64
dtype: object

@verahan
Copy link

verahan commented Jul 24, 2017

How to compare result['a'][1] and result['c'][1] after outer merge? they should be equal.

In [1]: testb = pd.DataFrame([{"a":2,"c":2}])
 ...: testa = pd.DataFrame([{"a":1,"b":2}])
 ...: result = pd.merge(testa, testb, on=["a"], how="outer")
 ...: 

In [2]: testa
Out[2]: 
   a  b
0  1  2

In [3]: testb
Out[3]: 
   a  c
0  2  2

In [4]: result
Out[4]: 
   a    b    c
0  1  2.0  NaN
1  2  NaN  2.0

@teoguso
Copy link

teoguso commented Jul 7, 2018

Edit: What I see happening is actually a join casting ints to floats if the result of the join contains NaN. I'm not 100% sure, but I think this is the expected behavior. Sorry for the confusion.

I see this still happening in 0.23.2. Did it sneak in again? (Left join with int index as described above)

@willis-richard
Copy link

willis-richard commented Jul 20, 2018

It is happening for me in version 0.23.3, using outer join.

Edit: I agree with teoguso's edit, the casting only occurs if the join subsequently contains NaNs. That explains the casting then.

Edit2: https://pandas.pydata.org/pandas-docs/stable/missing_data.html explains the casting rules for anyone (such as myself) who was unaware of them.

@teoguso
Copy link

teoguso commented Jul 20, 2018

@Muff2n please see my edit above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

10 participants