API Change for arithmetic methods for non-pandas objects #17767

TomAugspurger · 2017-10-03T17:46:49Z

Possible API breakage for dask in #16821

import dask.dataframe as dd
import pandas as pd

from dask.dataframe.utils import assert_eq


s = dd.core.Scalar({('s', 0): 10}, 's', 'i8')
pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7],
                    'b': [7, 6, 5, 4, 3, 2, 1]})
result = (pdf + s).dtypes  # This now casts to object, used to retain int64
expected = pdf.dtypes
assert_eq(result, expected)

In master, result is now (object, object). Before it was (int64, int64).

I'm looking into #16821 to see if this was unintentional, and can be avoided.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-10-03T19:05:54Z

So, IIUC, it comes down to the fact that

In [7]: from pandas.api.types import is_integer

In [8]: import dask.dataframe as dd

In [9]: is_integer(dd.core.Scalar(1, 'name', 'i8'))
Out[9]: False

Which I suppose means pandas is doing everything correctly here?

TomAugspurger · 2017-10-03T19:39:42Z

@jreback thoughts on this patch?

modified   pandas/core/internals.py
@@ -1880,7 +1880,8 @@ class IntBlock(NumericBlock):
             return (issubclass(tipo, np.integer) and
                     not issubclass(tipo, (np.datetime64, np.timedelta64)) and
                     self.dtype.itemsize >= element.dtype.itemsize)
-        return is_integer(element)
+        return (is_integer(element) or
+                hasattr(element, 'dtype') and element.dtype == 'int')
 
     def should_store(self, value):
         return is_integer_dtype(value) and value.dtype == self.dtype

AFAICT, there's nothing an object can do to duck type is_integer into passing (or is there?).

TomAugspurger · 2017-10-03T19:45:16Z

Actually, nvm. The better fix is to have the dask version take the if clause... something like

@@ -1874,7 +1874,7 @@ class IntBlock(NumericBlock):
     _can_hold_na = False
 
     def _can_hold_element(self, element):
-        if is_list_like(element):
+        if is_list_like(element) or hasattr(element, 'dtype'):
             element = np.array(element)
             tipo = element.dtype.type
             return (issubclass(tipo, np.integer) and

Or I could register dask.dataframe.core.Scalar with Iterable. I'll see which is less hacky.

See pandas-dev/pandas#17767 and pandas-dev/pandas#16821

TomAugspurger · 2017-10-03T20:15:54Z

So we essentially we do if isinstance(element, Iterable), since we're going to call np.array(element) to extract the dtype. dask.dataframe.core.Scalar is a bit odd in that it implements __array__, but not __iter__ (the only reason it implements __array__ in the first place was to make this stuff work anyway).

I'd propose something like

@@ -1874,9 +1874,14 @@ class IntBlock(NumericBlock):
     _can_hold_na = False
 
     def _can_hold_element(self, element):
-        if is_list_like(element):
+        tipo = None
+
+        if hasattr(element, 'dtype'):
+            tipo = element.dtype.type
+        elif is_list_like(element):
             element = np.array(element)
             tipo = element.dtype.type
+        if tipo:
             return (issubclass(tipo, np.integer) and
                     not issubclass(tipo, (np.datetime64, np.timedelta64)) and
                     self.dtype.itemsize >= element.dtype.itemsize)

jreback · 2017-10-03T20:32:15Z

I think used to have these hasattr(element, 'dtype') checks before. So not against putting them back (but would have to test this a bit I think).

Master: ```python >>> import dask.dataframe as dd >>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8') >>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7], ... 'b': [7, 6, 5, 4, 3, 2, 1]}) >>> (pdf + s).dtypes a object b object dtype: object Head: ``` >>> (pdf + s).dtypes a int64 b int64 dtype: object ``` This is more consistent with 0.20.3, while still most of the changes in pandas-dev#16821 Closes pandas-dev#17767

* Use argument dtype to inform coercion Master: ```python >>> import dask.dataframe as dd >>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8') >>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7], ... 'b': [7, 6, 5, 4, 3, 2, 1]}) >>> (pdf + s).dtypes a object b object dtype: object Head: ``` >>> (pdf + s).dtypes a int64 b int64 dtype: object ``` This is more consistent with 0.20.3, while still most of the changes in #16821 Closes #17767 * Compat for older numpy where bool(dtype) is False * Added timedelta

* Use argument dtype to inform coercion Master: ```python >>> import dask.dataframe as dd >>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8') >>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7], ... 'b': [7, 6, 5, 4, 3, 2, 1]}) >>> (pdf + s).dtypes a object b object dtype: object Head: ``` >>> (pdf + s).dtypes a int64 b int64 dtype: object ``` This is more consistent with 0.20.3, while still most of the changes in pandas-dev#16821 Closes pandas-dev#17767 * Compat for older numpy where bool(dtype) is False * Added timedelta

TomAugspurger added API Design Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations labels Oct 3, 2017

TomAugspurger added this to the 0.21.0 milestone Oct 3, 2017

TomAugspurger changed the title ~~API? Adding a~~ API Change for arithmetic methods for non-pandas objects Oct 3, 2017

TomAugspurger mentioned this issue Oct 3, 2017

Pandas 0.21 compatability dask/dask#2737

Merged

TomAugspurger added a commit to TomAugspurger/dask that referenced this issue Oct 3, 2017

COMPAT: Pandas 0.21.0 type coercion

466e51e

See pandas-dev/pandas#17767 and pandas-dev/pandas#16821

TomAugspurger mentioned this issue Oct 4, 2017

Use argument dtype to inform coercion #17779

Merged

TomAugspurger closed this as completed in #17779 Oct 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Change for arithmetic methods for non-pandas objects #17767

API Change for arithmetic methods for non-pandas objects #17767

TomAugspurger commented Oct 3, 2017 •

edited

Loading

TomAugspurger commented Oct 3, 2017

TomAugspurger commented Oct 3, 2017

TomAugspurger commented Oct 3, 2017

TomAugspurger commented Oct 3, 2017

jreback commented Oct 3, 2017

API Change for arithmetic methods for non-pandas objects #17767

API Change for arithmetic methods for non-pandas objects #17767

Comments

TomAugspurger commented Oct 3, 2017 • edited Loading

TomAugspurger commented Oct 3, 2017

TomAugspurger commented Oct 3, 2017

TomAugspurger commented Oct 3, 2017

TomAugspurger commented Oct 3, 2017

jreback commented Oct 3, 2017

TomAugspurger commented Oct 3, 2017 •

edited

Loading