Skip to content

strange dtype behaviour as function of series length #7332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dsm054 opened this issue Jun 4, 2014 · 12 comments · Fixed by #7342
Closed

strange dtype behaviour as function of series length #7332

dsm054 opened this issue Jun 4, 2014 · 12 comments · Fixed by #7342
Labels
Bug Performance Memory or execution speed performance
Milestone

Comments

@dsm054
Copy link
Contributor

dsm054 commented Jun 4, 2014

Found when tracking down what was going on with this question about performance.

First the case that makes sense:

>>> s = pd.Series(range(10**3), dtype=np.int32)
>>> s.dtype
dtype('int32')
>>> s.dtype.type
<type 'numpy.int32'>
>>> s.dtype.type in pd.lib._TYPE_MAP
True
>>> 
>>> orig_sum_type = (s+s).dtype.type
>>> orig_sum_type
<type 'numpy.int32'>
>>> orig_sum_type in pd.lib._TYPE_MAP
True

Now let's increase the length of the series.

>>> s = pd.Series(range(10**5), dtype=np.int32)
>>> s.dtype
dtype('int32')
>>> s.dtype.type
<type 'numpy.int32'>
>>> s.dtype.type in pd.lib._TYPE_MAP
True
>>> 
>>> new_sum_type = (s+s).dtype.type
>>> new_sum_type
<type 'numpy.int32'>
>>> new_sum_type in pd.lib._TYPE_MAP
False

.. wait, what?

>>> orig_sum_type, new_sum_type
(<type 'numpy.int32'>, <type 'numpy.int32'>)
>>> orig_sum_type == new_sum_type
False
>>> orig_sum_type is new_sum_type
False
>>> np.int32 is orig_sum_type
True
>>> np.int32 is new_sum_type
False

We've now got a new numpy.int32 type floating around, not equal to the one in numpy. The crossover seems to be at 10k:

>>> def find_first():
...         for i in range(1, 10**5):
...                 s = pd.Series(range(i), dtype=np.int32)
...                 if (s+s).dtype.type not in pd.lib._TYPE_MAP:
...                         return i
...         
>>> find_first()
10001

ISTM that this lack of recognition of the dtype as in _TYPE_MAP prevents the early exit from being taken in infer_dtype upon recognition that it's an integer dtype, and that slows things down considerably.

@hayd hayd added Bug labels Jun 4, 2014
@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

this is really odd, happens with basically anything in the _TYPE_MAP. I think easiest just to hash by the name instead of an object. maybe some kind of translation issue with the hashing (e.g. _TYPE_MAP is actually populated with numpy c-dtype definitions).

@dsm054
Copy link
Contributor Author

dsm054 commented Jun 4, 2014

For me it doesn't seem to happen in the same way with int64, but I agree that using the name should work here.

The problem with that is that as long as Series can wind up with dtypes which look like the numpy dtypes but aren't equal to the numpy version it's hard to trust dtype checks anywhere in the code. Where we're only doing a fastpath check we should still get the right answer, but we could have unexpected and hard-to-track down coercion issues elsewhere.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

yep...going to bench and fix, should be pretty straightforward

@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

@dsm054 pls give a try with #7342 and see if you can validate the results.

@dsm054
Copy link
Contributor Author

dsm054 commented Jun 4, 2014

@jreback: will do, but I admit to still being a little puzzled about what's going on. I don't understand how we wind up with orig_sum_type != new_sum_type in these obscure circumstances-- everything else flows from that.

I wasn't able to come up with a pure numpy demo, but I think there might be one.

(PS: note that my issues were with int32, not int64, so it's probably build-dependent.)

@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

gr8. I agree. somehow the actual object np.int64 (or in your case np.int32) is DIFFERENT in cython and what is passed in.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

well the hash on the np.dtype is DIFFERENT, really really odd

python test.py
type: <type 'numpy.int64'> : 8744395974726

type: <type 'numpy.object_'> : 8744395974518
type: <type 'numpy.int64'> : 8744395974726
type: <type 'numpy.int64'> : 8744395974726
type: <type 'numpy.object_'> : 8744395974518
type: <type 'numpy.int64'> : 8744395974752
diff --git a/pandas/src/inference.pyx b/pandas/src/inference.pyx
index 3aa71ad..fa6b554 100644
--- a/pandas/src/inference.pyx
+++ b/pandas/src/inference.pyx
@@ -61,6 +61,8 @@ try:
 except AttributeError:
     pass

+print("type: {type} : {id}".format(type=np.int64,id=hash(np.int64)))
+
 def infer_dtype(object _values):
     cdef:
         Py_ssize_t i, n
@@ -78,6 +80,7 @@ def infer_dtype(object _values):

     values = getattr(values, 'values', values)

+    print("type: {type} : {id}".format(type=values.dtype.type,id=hash(values.dtype.type)))
     val_name = values.dtype.name
     if val_name in _TYPE_MAP:
         return _TYPE_MAP[val_name]

more test.py

import pandas as pd
import numpy as np

N = 100000

df = pd.DataFrame(dict(A = np.arange(N), B = np.arange(N)))
df['A'] + df['B']

@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

hmm....i'll bet numexpr has a different hash for its int64 dtype (and that's the returned dtype I think)

maybe no guarantees on that (although it IS odd)

@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

@dsm054 should these be the same (the bottom result)?

In [20]: import numexpr as ne

In [21]: import numpy as np

In [22]: ne.__version__
Out[22]: '2.4'

In [23]: np.__version__
Out[23]: '1.8.1'

In [24]: a = np.arange(10,dtype='int64')

In [25]: b = np.arange(10,dtype='int64')

In [26]: result_ne = ne.evaluate('a+b')

In [27]: result_numpy = a+b

In [28]: (result_ne == result_numpy).all()
Out[28]: True

In [29]: result_ne.dtype.type
Out[29]: numpy.int64

In [30]: result_numpy.dtype.type
Out[30]: numpy.int64

In [31]: hash(result_ne.dtype.type)
Out[31]: 8768103730016

In [32]: hash(result_numpy.dtype.type)
Out[32]: 8768103729990

For the floats the same though

In [1]: a = np.arange(10.)

In [2]: b = np.arange(10.)

n [4]: hash(ne.evaluate('a+b').dtype.type)
Out[4]: 8751212391216

In [5]: hash((a+b).dtype.type)
Out[5]: 8751212391216

@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

@jreback
Copy link
Contributor

jreback commented Jun 4, 2014

@dsm054
Copy link
Contributor Author

dsm054 commented Jun 4, 2014

If it's a numexpr thing, that might explain why I couldn't find a purely numpy-based example..

And yes, I'd argue that the two objects should be equal so as not to drive end users bonkers, and two equal objects have to have the same hash or dictionaries won't work. The dtype objects seem to be equal, even though the dtype.types aren't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants