Description
When I left-join two data frames on an int64 column, this column is cast to a float if it is the index of one of the two data frames:
>>> id = [1, 143349558619761320]
>>> a = pd.DataFrame({'a': [1, 2]}, index=pd.Int64Index(id, name='id'))
>>> b = pd.DataFrame({'id': id[1], 'b': [2]})
>>> merge_index = pd.merge(a, b, left_index=True, right_on='id', how='left')
>>> print(merge_index.dtypes)
a int64
b float64
id float64
dtype: object
In this case this is a problem because casting back to an int changes the value of this column:
>>> int(float(143349558619761320))
143349558619761312
If the int64 column is not in the index, the column is not cast to a float:
>>> merge_column = pd.merge(a.reset_index(), b, on='id', how='left')
>>> print(merge_column.dtypes)
id int64
a int64
b float64
dtype: object
I understand that integer columns get cast to a float if NaNs are introduced (like column b here), but in this case the final column contains no missing values, so casting to a float can be avoided.
Output from pd.show_versions():
INSTALLED VERSIONS
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-5-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.16.0
nose: 1.3.4
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.0.0
sphinx: None
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.2
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None