Skip to content

Integers coerced to float on Series construction from dictionary. #8211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ssanderson opened this issue Sep 8, 2014 · 4 comments
Closed
Labels
Dtype Conversions Unexpected or buggy dtype conversions Usage Question

Comments

@ssanderson
Copy link
Contributor

When constructing a Series from a dictionary, if the dictionary contains floats, all integer values in the dictionary are coerced to float unless the dictionary contains an integer value > sys.maxint.

Minimal repro case:

import pandas as pd
import sys

if __name__ == '__main__':

    pd.show_versions()

    # Works as expected.                                                                                                                                                                                    
    t0 = {'A': 1,
          'B': 10}

    # B gets coerced to float.                                                                                                                                                                              
    t1 = {'A': 1.5,
          'B': 10}

    # B gets coerced to float.                                                                                                                                                                              
    t2 = {'A': 1.5,
          'B': 10,
          'C': sys.maxint}

    # B is not coerced if the dictionary contains a value > maxint.                                                                                                                                         
    t3 = {'A': 1.5,
          'B': 10,
          'C': sys.maxint + 1}

    test_0 = pd.Series(t0)
    test_1 = pd.Series(t1)
    test_2 = pd.Series(t2)
    test_3 = pd.Series(t3)

    for test in test_0, test_1, test_2, test_3:
        print "Type at index B is %s" % format(type(test.B))

Output:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Darwin
OS-release: 13.1.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1
nose: 1.3.3
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.13.2
statsmodels: 0.5.0
IPython: 2.2.0
sphinx: 1.2.1
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.4
bottleneck: None
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: None
xlrd: 0.9.3
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.8
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

Type at index B is <type 'numpy.int64'>
Type at index B is <type 'numpy.float64'>
Type at index B is <type 'numpy.float64'>
Type at index B is <type 'int'>

Stepping through the code, the coercion looks like it's happening in the call to lib.fast_multiget at https://github.com/pydata/pandas/blob/v0.14.1/pandas/core/series.py#L191.

@shoyer
Copy link
Member

shoyer commented Sep 8, 2014

For a Series to be performant, it needs to contain data with a homogeneous type. So this is indeed working as intended.

If you really want to faithfully preserve type distinctions like float/int, you can pass in an numpy.ndarray with dtype=object directly into the series constructor. But beware that everything you do with that series that involves arithmetic will be much slower (roughly 20x-100x slower if I recall correctly).

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Usage Question labels Sep 8, 2014
@jreback jreback closed this as completed Sep 8, 2014
@immerrr
Copy link
Contributor

immerrr commented Sep 8, 2014

I tend to think this question is rather about the fact that in Python 2.X:

In [3]: type(sys.maxint)
Out[3]: int

In [4]: type(sys.maxint + 1)
Out[4]: long

and the fact that the two cases are inconsistent between each other.

@ssanderson
Copy link
Contributor Author

Sorry, I should have put this in the post, but the truncation behavior is occurs even when dtype=object is explicitly passed.

I fully expect that sys.maxint and sys.maxint + 1 should have different types (that's the point of maxint). What's surprising is that the presence of a long in the input dictionary determines the truncation behavior of other elements in the passed collection of values, though I suppose it makes sense if the inferred datatype of long is object. At any rate, the coercion behavior when object is explicitly passed still seems incorrect?

I also realize that mathematical operations on a heterogenously-typed Series will be slower. The values I'm loading here are being loaded out of a database, and the particular values that are being truncated are integer-representations of Timestamps, which will eventually become a DatetimeIndex on a DataFrame built from a sequence of these Series objects.

@jreback
Copy link
Contributor

jreback commented Sep 8, 2014

This is related to the maxint rollover on osx (only), see here: #3922

Not sure what if anything can be done about this.

osx is just weird here and doesn't behave properly (in its python impl).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Usage Question
Projects
None yet
Development

No branches or pull requests

4 participants