Skip to content

BUG: Wrong dtype using range in DataFrame constructor on Windows #16804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fjetter opened this issue Jun 30, 2017 · 6 comments · Fixed by #17840
Closed

BUG: Wrong dtype using range in DataFrame constructor on Windows #16804

fjetter opened this issue Jun 30, 2017 · 6 comments · Fixed by #17840
Labels
Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@fjetter
Copy link
Member

fjetter commented Jun 30, 2017

When using range (python 3.5) in the DataFrame constructor I get different dtypes depending on the system I'm running on:

>>> import pandas as pd
>>> from collections import OrderedDict
>>> data = OrderedDict([
...    ('a', range(5)),
...    ('b', [-10, -5, 0, 5, 10])
... ])
>>> df = pd.DataFrame(data)
>>> df.info()

Problem description

On Unix:

RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
a    5 non-null int64
b    5 non-null int64
dtypes: int64(2)

On Windows:

RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
a    5 non-null int32
b    5 non-null int64
dtypes: int32(1), int64(1)

The problem appeared in an arrow PR unittest
apache/arrow#790
Windows tests on appveyor:
https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/build/1.0.2169
Unix tests on travis:
https://travis-ci.org/apache/arrow/builds/248514835

Expected Output

All systems create the same dtypes. In this case int64

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Windows
OS-release: 2012ServerR2
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.2
pytest: 3.1.2
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.0
scipy: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

I believe we're following the behavior of numpy here, where the size of int depends on the platform. From https://docs.scipy.org/doc/numpy-1.10.1/user/basics.types.html

Some types, such as int and intp, have differing bitsizes, dependent on the platforms (e.g. 32-bit vs. 64-bit machines). This should be taken into account when interfacing with low-level code (such as C or Fortran) where the raw memory is addressed.

@TomAugspurger
Copy link
Contributor

Though that not really a good explanation to the user for why the list would be int64, while the range is int32.

@chris-b1
Copy link
Contributor

chris-b1 commented Jun 30, 2017

This is annoying, but agree with @TomAugspurger it's correct. Ultimately the range object is passed on to numpy which expands using the int dtype.

In [14]: np.array(range(100))
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

@chris-b1
Copy link
Contributor

I suppose we could intercept it and force int64 like we do with lists.

@jreback
Copy link
Contributor

jreback commented Jun 30, 2017

we structure our tests to not use range at all for this reason

however it is possible to change this by explicitly intercepting a range object in the Series constructor and introspecting the indices (we do this for RangeIndex already)

so we could mark this if someone wants to do it
it's not a bug rather a numpy work around

@chris-b1
Copy link
Contributor

chris-b1 commented Jun 30, 2017

Yeah, I actually think this would be a good idea. Not a big deal, but also perf to be picked up, as apparently numpy expands the list.

In [24]: r = range(1000000)

In [25]: %timeit np.array(r)
139 ms ± 755 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [27]: %timeit np.arange(r.start, r.stop, r.step)
1.36 ms ± 8.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@chris-b1 chris-b1 added Compat pandas objects compatability with Numpy or Python functions Difficulty Novice Dtype Conversions Unexpected or buggy dtype conversions labels Jun 30, 2017
@chris-b1 chris-b1 added this to the Next Major Release milestone Jun 30, 2017
@jreback jreback modified the milestones: Next Major Release, 0.21.0 Oct 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants