strange dtype behaviour as function of series length #7332

dsm054 · 2014-06-04T05:47:55Z

Found when tracking down what was going on with this question about performance.

First the case that makes sense:

>>> s = pd.Series(range(10**3), dtype=np.int32)
>>> s.dtype
dtype('int32')
>>> s.dtype.type
<type 'numpy.int32'>
>>> s.dtype.type in pd.lib._TYPE_MAP
True
>>> 
>>> orig_sum_type = (s+s).dtype.type
>>> orig_sum_type
<type 'numpy.int32'>
>>> orig_sum_type in pd.lib._TYPE_MAP
True

Now let's increase the length of the series.

>>> s = pd.Series(range(10**5), dtype=np.int32)
>>> s.dtype
dtype('int32')
>>> s.dtype.type
<type 'numpy.int32'>
>>> s.dtype.type in pd.lib._TYPE_MAP
True
>>> 
>>> new_sum_type = (s+s).dtype.type
>>> new_sum_type
<type 'numpy.int32'>
>>> new_sum_type in pd.lib._TYPE_MAP
False

.. wait, what?

>>> orig_sum_type, new_sum_type
(<type 'numpy.int32'>, <type 'numpy.int32'>)
>>> orig_sum_type == new_sum_type
False
>>> orig_sum_type is new_sum_type
False
>>> np.int32 is orig_sum_type
True
>>> np.int32 is new_sum_type
False

We've now got a new numpy.int32 type floating around, not equal to the one in numpy. The crossover seems to be at 10k:

>>> def find_first():
...         for i in range(1, 10**5):
...                 s = pd.Series(range(i), dtype=np.int32)
...                 if (s+s).dtype.type not in pd.lib._TYPE_MAP:
...                         return i
...         
>>> find_first()
10001

ISTM that this lack of recognition of the dtype as in _TYPE_MAP prevents the early exit from being taken in infer_dtype upon recognition that it's an integer dtype, and that slows things down considerably.

The text was updated successfully, but these errors were encountered:

jreback · 2014-06-04T13:14:55Z

this is really odd, happens with basically anything in the _TYPE_MAP. I think easiest just to hash by the name instead of an object. maybe some kind of translation issue with the hashing (e.g. _TYPE_MAP is actually populated with numpy c-dtype definitions).

dsm054 · 2014-06-04T13:26:37Z

For me it doesn't seem to happen in the same way with int64, but I agree that using the name should work here.

The problem with that is that as long as Series can wind up with dtypes which look like the numpy dtypes but aren't equal to the numpy version it's hard to trust dtype checks anywhere in the code. Where we're only doing a fastpath check we should still get the right answer, but we could have unexpected and hard-to-track down coercion issues elsewhere.

jreback · 2014-06-04T13:33:38Z

yep...going to bench and fix, should be pretty straightforward

jreback · 2014-06-04T17:03:48Z

@dsm054 pls give a try with #7342 and see if you can validate the results.

dsm054 · 2014-06-04T17:57:16Z

@jreback: will do, but I admit to still being a little puzzled about what's going on. I don't understand how we wind up with orig_sum_type != new_sum_type in these obscure circumstances-- everything else flows from that.

I wasn't able to come up with a pure numpy demo, but I think there might be one.

(PS: note that my issues were with int32, not int64, so it's probably build-dependent.)

jreback · 2014-06-04T17:59:07Z

gr8. I agree. somehow the actual object np.int64 (or in your case np.int32) is DIFFERENT in cython and what is passed in.

jreback · 2014-06-04T18:55:29Z

well the hash on the np.dtype is DIFFERENT, really really odd

python test.py
type: <type 'numpy.int64'> : 8744395974726

type: <type 'numpy.object_'> : 8744395974518
type: <type 'numpy.int64'> : 8744395974726
type: <type 'numpy.int64'> : 8744395974726
type: <type 'numpy.object_'> : 8744395974518
type: <type 'numpy.int64'> : 8744395974752

diff --git a/pandas/src/inference.pyx b/pandas/src/inference.pyx
index 3aa71ad..fa6b554 100644
--- a/pandas/src/inference.pyx
+++ b/pandas/src/inference.pyx
@@ -61,6 +61,8 @@ try:
 except AttributeError:
     pass

+print("type: {type} : {id}".format(type=np.int64,id=hash(np.int64)))
+
 def infer_dtype(object _values):
     cdef:
         Py_ssize_t i, n
@@ -78,6 +80,7 @@ def infer_dtype(object _values):

     values = getattr(values, 'values', values)

+    print("type: {type} : {id}".format(type=values.dtype.type,id=hash(values.dtype.type)))
     val_name = values.dtype.name
     if val_name in _TYPE_MAP:
         return _TYPE_MAP[val_name]

more test.py

import pandas as pd
import numpy as np

N = 100000

df = pd.DataFrame(dict(A = np.arange(N), B = np.arange(N)))
df['A'] + df['B']

jreback · 2014-06-04T18:56:45Z

hmm....i'll bet numexpr has a different hash for its int64 dtype (and that's the returned dtype I think)

maybe no guarantees on that (although it IS odd)

jreback · 2014-06-04T19:01:50Z

@dsm054 should these be the same (the bottom result)?

In [20]: import numexpr as ne

In [21]: import numpy as np

In [22]: ne.__version__
Out[22]: '2.4'

In [23]: np.__version__
Out[23]: '1.8.1'

In [24]: a = np.arange(10,dtype='int64')

In [25]: b = np.arange(10,dtype='int64')

In [26]: result_ne = ne.evaluate('a+b')

In [27]: result_numpy = a+b

In [28]: (result_ne == result_numpy).all()
Out[28]: True

In [29]: result_ne.dtype.type
Out[29]: numpy.int64

In [30]: result_numpy.dtype.type
Out[30]: numpy.int64

In [31]: hash(result_ne.dtype.type)
Out[31]: 8768103730016

In [32]: hash(result_numpy.dtype.type)
Out[32]: 8768103729990

For the floats the same though

In [1]: a = np.arange(10.)

In [2]: b = np.arange(10.)

n [4]: hash(ne.evaluate('a+b').dtype.type)
Out[4]: 8751212391216

In [5]: hash((a+b).dtype.type)
Out[5]: 8751212391216

jreback · 2014-06-04T19:12:14Z

cross posted to numexpr:
https://code.google.com/p/numexpr/issues/detail?id=126&thanks=126&ts=1401909090

numpy:
numpy/numpy#4779

jreback · 2014-06-04T19:23:06Z

@dsm054 @hayd

side issue, I think you have a nice soln for this: http://stackoverflow.com/questions/24044492/python-pandas-transforming-moving-values-from-diagonal?noredirect=1#comment37072272_24044492 (maybe add to cookbook)?

dsm054 · 2014-06-04T19:25:51Z

If it's a numexpr thing, that might explain why I couldn't find a purely numpy-based example..

And yes, I'd argue that the two objects should be equal so as not to drive end users bonkers, and two equal objects have to have the same hash or dictionaries won't work. The dtype objects seem to be equal, even though the dtype.types aren't.

hayd added Bug labels Jun 4, 2014

jreback mentioned this issue Jun 4, 2014

PERF: better dtype inference for perf gains (GH7332) #7342

Merged

jreback added this to the 0.14.1 milestone Jun 4, 2014

jreback mentioned this issue Jun 4, 2014

arr.dtype.type has different hashes numpy/numpy#4779

Closed

jreback closed this as completed in #7342 Jun 4, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strange dtype behaviour as function of series length #7332

strange dtype behaviour as function of series length #7332

dsm054 commented Jun 4, 2014

jreback commented Jun 4, 2014

dsm054 commented Jun 4, 2014

jreback commented Jun 4, 2014

jreback commented Jun 4, 2014

dsm054 commented Jun 4, 2014

jreback commented Jun 4, 2014

jreback commented Jun 4, 2014

jreback commented Jun 4, 2014

jreback commented Jun 4, 2014

jreback commented Jun 4, 2014

jreback commented Jun 4, 2014

dsm054 commented Jun 4, 2014

strange dtype behaviour as function of series length #7332

strange dtype behaviour as function of series length #7332

Comments

dsm054 commented Jun 4, 2014

jreback commented Jun 4, 2014

dsm054 commented Jun 4, 2014

jreback commented Jun 4, 2014

jreback commented Jun 4, 2014

dsm054 commented Jun 4, 2014

jreback commented Jun 4, 2014

jreback commented Jun 4, 2014

jreback commented Jun 4, 2014

jreback commented Jun 4, 2014

jreback commented Jun 4, 2014

jreback commented Jun 4, 2014

dsm054 commented Jun 4, 2014