Integers coerced to float on Series construction from dictionary. #8211

ssanderson · 2014-09-08T05:34:48Z

When constructing a Series from a dictionary, if the dictionary contains floats, all integer values in the dictionary are coerced to float unless the dictionary contains an integer value > sys.maxint.

Minimal repro case:

import pandas as pd
import sys

if __name__ == '__main__':

    pd.show_versions()

    # Works as expected.                                                                                                                                                                                    
    t0 = {'A': 1,
          'B': 10}

    # B gets coerced to float.                                                                                                                                                                              
    t1 = {'A': 1.5,
          'B': 10}

    # B gets coerced to float.                                                                                                                                                                              
    t2 = {'A': 1.5,
          'B': 10,
          'C': sys.maxint}

    # B is not coerced if the dictionary contains a value > maxint.                                                                                                                                         
    t3 = {'A': 1.5,
          'B': 10,
          'C': sys.maxint + 1}

    test_0 = pd.Series(t0)
    test_1 = pd.Series(t1)
    test_2 = pd.Series(t2)
    test_3 = pd.Series(t3)

    for test in test_0, test_1, test_2, test_3:
        print "Type at index B is %s" % format(type(test.B))

Output:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Darwin
OS-release: 13.1.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1
nose: 1.3.3
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.13.2
statsmodels: 0.5.0
IPython: 2.2.0
sphinx: 1.2.1
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.4
bottleneck: None
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: None
xlrd: 0.9.3
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.8
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

Type at index B is <type 'numpy.int64'>
Type at index B is <type 'numpy.float64'>
Type at index B is <type 'numpy.float64'>
Type at index B is <type 'int'>

Stepping through the code, the coercion looks like it's happening in the call to lib.fast_multiget at https://github.com/pydata/pandas/blob/v0.14.1/pandas/core/series.py#L191.

The text was updated successfully, but these errors were encountered:

shoyer · 2014-09-08T06:45:00Z

For a Series to be performant, it needs to contain data with a homogeneous type. So this is indeed working as intended.

If you really want to faithfully preserve type distinctions like float/int, you can pass in an numpy.ndarray with dtype=object directly into the series constructor. But beware that everything you do with that series that involves arithmetic will be much slower (roughly 20x-100x slower if I recall correctly).

immerrr · 2014-09-08T12:52:02Z

I tend to think this question is rather about the fact that in Python 2.X:

In [3]: type(sys.maxint)
Out[3]: int

In [4]: type(sys.maxint + 1)
Out[4]: long

and the fact that the two cases are inconsistent between each other.

ssanderson · 2014-09-08T14:17:19Z

Sorry, I should have put this in the post, but the truncation behavior is occurs even when dtype=object is explicitly passed.

I fully expect that sys.maxint and sys.maxint + 1 should have different types (that's the point of maxint). What's surprising is that the presence of a long in the input dictionary determines the truncation behavior of other elements in the passed collection of values, though I suppose it makes sense if the inferred datatype of long is object. At any rate, the coercion behavior when object is explicitly passed still seems incorrect?

I also realize that mathematical operations on a heterogenously-typed Series will be slower. The values I'm loading here are being loaded out of a database, and the particular values that are being truncated are integer-representations of Timestamps, which will eventually become a DatetimeIndex on a DataFrame built from a sequence of these Series objects.

jreback · 2014-09-08T14:22:11Z

This is related to the maxint rollover on osx (only), see here: #3922

Not sure what if anything can be done about this.

osx is just weird here and doesn't behave properly (in its python impl).

jreback added Dtype Conversions Unexpected or buggy dtype conversions Usage Question labels Sep 8, 2014

jreback closed this as completed Sep 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integers coerced to float on Series construction from dictionary. #8211

Integers coerced to float on Series construction from dictionary. #8211

ssanderson commented Sep 8, 2014

shoyer commented Sep 8, 2014

immerrr commented Sep 8, 2014

ssanderson commented Sep 8, 2014

jreback commented Sep 8, 2014

Integers coerced to float on Series construction from dictionary. #8211

Integers coerced to float on Series construction from dictionary. #8211

Comments

ssanderson commented Sep 8, 2014

shoyer commented Sep 8, 2014

immerrr commented Sep 8, 2014

ssanderson commented Sep 8, 2014

jreback commented Sep 8, 2014