BUG: Wrong dtype using `range` in DataFrame constructor on Windows #16804

fjetter · 2017-06-30T09:26:41Z

When using range (python 3.5) in the DataFrame constructor I get different dtypes depending on the system I'm running on:

>>> import pandas as pd
>>> from collections import OrderedDict
>>> data = OrderedDict([
...    ('a', range(5)),
...    ('b', [-10, -5, 0, 5, 10])
... ])
>>> df = pd.DataFrame(data)
>>> df.info()

Problem description

On Unix:

RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
a    5 non-null int64
b    5 non-null int64
dtypes: int64(2)

On Windows:

RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
a    5 non-null int32
b    5 non-null int64
dtypes: int32(1), int64(1)

The problem appeared in an arrow PR unittest
apache/arrow#790
Windows tests on appveyor:
https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/build/1.0.2169
Unix tests on travis:
https://travis-ci.org/apache/arrow/builds/248514835

Expected Output

All systems create the same dtypes. In this case int64

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Windows
OS-release: 2012ServerR2
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.2
pytest: 3.1.2
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.0
scipy: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-06-30T13:31:08Z

I believe we're following the behavior of numpy here, where the size of int depends on the platform. From https://docs.scipy.org/doc/numpy-1.10.1/user/basics.types.html

Some types, such as int and intp, have differing bitsizes, dependent on the platforms (e.g. 32-bit vs. 64-bit machines). This should be taken into account when interfacing with low-level code (such as C or Fortran) where the raw memory is addressed.

TomAugspurger · 2017-06-30T13:32:37Z

Though that not really a good explanation to the user for why the list would be int64, while the range is int32.

chris-b1 · 2017-06-30T14:24:05Z

This is annoying, but agree with @TomAugspurger it's correct. Ultimately the range object is passed on to numpy which expands using the int dtype.

In [14]: np.array(range(100))
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
       68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
       85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

chris-b1 · 2017-06-30T14:28:42Z

I suppose we could intercept it and force int64 like we do with lists.

jreback · 2017-06-30T14:30:12Z

we structure our tests to not use range at all for this reason

however it is possible to change this by explicitly intercepting a range object in the Series constructor and introspecting the indices (we do this for RangeIndex already)

so we could mark this if someone wants to do it
it's not a bug rather a numpy work around

chris-b1 · 2017-06-30T14:44:42Z

Yeah, I actually think this would be a good idea. Not a big deal, but also perf to be picked up, as apparently numpy expands the list.

In [24]: r = range(1000000)

In [25]: %timeit np.array(r)
139 ms ± 755 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [27]: %timeit np.arange(r.start, r.stop, r.step)
1.36 ms ± 8.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

chris-b1 added Compat pandas objects compatability with Numpy or Python functions Difficulty Novice Dtype Conversions Unexpected or buggy dtype conversions labels Jun 30, 2017

chris-b1 added this to the Next Major Release milestone Jun 30, 2017

jreback added Difficulty Intermediate and removed Difficulty Novice labels Jul 1, 2017

jschendel mentioned this issue Oct 10, 2017

BUG: Fix range dtype in Series/DataFrame constructor on Windows #17840

Merged

4 tasks

jreback modified the milestones: Next Major Release, 0.21.0 Oct 18, 2017

jreback closed this as completed in #17840 Oct 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Wrong dtype using `range` in DataFrame constructor on Windows #16804

BUG: Wrong dtype using `range` in DataFrame constructor on Windows #16804

fjetter commented Jun 30, 2017

INSTALLED VERSIONS

TomAugspurger commented Jun 30, 2017

TomAugspurger commented Jun 30, 2017

chris-b1 commented Jun 30, 2017 •

edited

Loading

chris-b1 commented Jun 30, 2017

jreback commented Jun 30, 2017

chris-b1 commented Jun 30, 2017 •

edited

Loading

BUG: Wrong dtype using range in DataFrame constructor on Windows #16804

BUG: Wrong dtype using range in DataFrame constructor on Windows #16804

Comments

fjetter commented Jun 30, 2017

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Jun 30, 2017

TomAugspurger commented Jun 30, 2017

chris-b1 commented Jun 30, 2017 • edited Loading

chris-b1 commented Jun 30, 2017

jreback commented Jun 30, 2017

chris-b1 commented Jun 30, 2017 • edited Loading

BUG: Wrong dtype using `range` in DataFrame constructor on Windows #16804

BUG: Wrong dtype using `range` in DataFrame constructor on Windows #16804

Output of `pd.show_versions()`

chris-b1 commented Jun 30, 2017 •

edited

Loading

chris-b1 commented Jun 30, 2017 •

edited

Loading