Skip to content

API Change for arithmetic methods for non-pandas objects #17767

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Oct 3, 2017 · 5 comments
Closed

API Change for arithmetic methods for non-pandas objects #17767

TomAugspurger opened this issue Oct 3, 2017 · 5 comments
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 3, 2017

Possible API breakage for dask in #16821

import dask.dataframe as dd
import pandas as pd

from dask.dataframe.utils import assert_eq


s = dd.core.Scalar({('s', 0): 10}, 's', 'i8')
pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7],
                    'b': [7, 6, 5, 4, 3, 2, 1]})
result = (pdf + s).dtypes  # This now casts to object, used to retain int64
expected = pdf.dtypes
assert_eq(result, expected)

In master, result is now (object, object). Before it was (int64, int64).

I'm looking into #16821 to see if this was unintentional, and can be avoided.

@TomAugspurger TomAugspurger added API Design Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations labels Oct 3, 2017
@TomAugspurger TomAugspurger added this to the 0.21.0 milestone Oct 3, 2017
@TomAugspurger TomAugspurger changed the title API? Adding a API Change for arithmetic methods for non-pandas objects Oct 3, 2017
@TomAugspurger
Copy link
Contributor Author

So, IIUC, it comes down to the fact that

In [7]: from pandas.api.types import is_integer

In [8]: import dask.dataframe as dd

In [9]: is_integer(dd.core.Scalar(1, 'name', 'i8'))
Out[9]: False

Which I suppose means pandas is doing everything correctly here?

@TomAugspurger
Copy link
Contributor Author

@jreback thoughts on this patch?

modified   pandas/core/internals.py
@@ -1880,7 +1880,8 @@ class IntBlock(NumericBlock):
             return (issubclass(tipo, np.integer) and
                     not issubclass(tipo, (np.datetime64, np.timedelta64)) and
                     self.dtype.itemsize >= element.dtype.itemsize)
-        return is_integer(element)
+        return (is_integer(element) or
+                hasattr(element, 'dtype') and element.dtype == 'int')
 
     def should_store(self, value):
         return is_integer_dtype(value) and value.dtype == self.dtype

AFAICT, there's nothing an object can do to duck type is_integer into passing (or is there?).

@TomAugspurger
Copy link
Contributor Author

Actually, nvm. The better fix is to have the dask version take the if clause... something like

@@ -1874,7 +1874,7 @@ class IntBlock(NumericBlock):
     _can_hold_na = False
 
     def _can_hold_element(self, element):
-        if is_list_like(element):
+        if is_list_like(element) or hasattr(element, 'dtype'):
             element = np.array(element)
             tipo = element.dtype.type
             return (issubclass(tipo, np.integer) and

Or I could register dask.dataframe.core.Scalar with Iterable. I'll see which is less hacky.

TomAugspurger added a commit to TomAugspurger/dask that referenced this issue Oct 3, 2017
@TomAugspurger
Copy link
Contributor Author

So we essentially we do if isinstance(element, Iterable), since we're going to call np.array(element) to extract the dtype. dask.dataframe.core.Scalar is a bit odd in that it implements __array__, but not __iter__ (the only reason it implements __array__ in the first place was to make this stuff work anyway).

I'd propose something like

@@ -1874,9 +1874,14 @@ class IntBlock(NumericBlock):
     _can_hold_na = False
 
     def _can_hold_element(self, element):
-        if is_list_like(element):
+        tipo = None
+
+        if hasattr(element, 'dtype'):
+            tipo = element.dtype.type
+        elif is_list_like(element):
             element = np.array(element)
             tipo = element.dtype.type
+        if tipo:
             return (issubclass(tipo, np.integer) and
                     not issubclass(tipo, (np.datetime64, np.timedelta64)) and
                     self.dtype.itemsize >= element.dtype.itemsize)

@jreback
Copy link
Contributor

jreback commented Oct 3, 2017

I think used to have these hasattr(element, 'dtype') checks before. So not against putting them back (but would have to test this a bit I think).

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Oct 4, 2017
Master:

```python
>>> import dask.dataframe as dd

>>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8')
>>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7],
...                     'b': [7, 6, 5, 4, 3, 2, 1]})
>>> (pdf + s).dtypes
a    object
b    object
dtype: object

Head:

```
>>> (pdf + s).dtypes
a    int64
b    int64
dtype: object
```

This is more consistent with 0.20.3, while still most of the changes in
pandas-dev#16821

Closes pandas-dev#17767
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Oct 4, 2017
Master:

```python
>>> import dask.dataframe as dd

>>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8')
>>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7],
...                     'b': [7, 6, 5, 4, 3, 2, 1]})
>>> (pdf + s).dtypes
a    object
b    object
dtype: object

Head:

```
>>> (pdf + s).dtypes
a    int64
b    int64
dtype: object
```

This is more consistent with 0.20.3, while still most of the changes in
pandas-dev#16821

Closes pandas-dev#17767
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Oct 4, 2017
Master:

```python
>>> import dask.dataframe as dd

>>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8')
>>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7],
...                     'b': [7, 6, 5, 4, 3, 2, 1]})
>>> (pdf + s).dtypes
a    object
b    object
dtype: object

Head:

```
>>> (pdf + s).dtypes
a    int64
b    int64
dtype: object
```

This is more consistent with 0.20.3, while still most of the changes in
pandas-dev#16821

Closes pandas-dev#17767
TomAugspurger added a commit that referenced this issue Oct 5, 2017
* Use argument dtype to inform coercion

Master:

```python
>>> import dask.dataframe as dd

>>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8')
>>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7],
...                     'b': [7, 6, 5, 4, 3, 2, 1]})
>>> (pdf + s).dtypes
a    object
b    object
dtype: object

Head:

```
>>> (pdf + s).dtypes
a    int64
b    int64
dtype: object
```

This is more consistent with 0.20.3, while still most of the changes in
#16821

Closes #17767

* Compat for older numpy where bool(dtype) is False

* Added timedelta
ghost pushed a commit to reef-technologies/pandas that referenced this issue Oct 16, 2017
* Use argument dtype to inform coercion

Master:

```python
>>> import dask.dataframe as dd

>>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8')
>>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7],
...                     'b': [7, 6, 5, 4, 3, 2, 1]})
>>> (pdf + s).dtypes
a    object
b    object
dtype: object

Head:

```
>>> (pdf + s).dtypes
a    int64
b    int64
dtype: object
```

This is more consistent with 0.20.3, while still most of the changes in
pandas-dev#16821

Closes pandas-dev#17767

* Compat for older numpy where bool(dtype) is False

* Added timedelta
alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017
* Use argument dtype to inform coercion

Master:

```python
>>> import dask.dataframe as dd

>>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8')
>>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7],
...                     'b': [7, 6, 5, 4, 3, 2, 1]})
>>> (pdf + s).dtypes
a    object
b    object
dtype: object

Head:

```
>>> (pdf + s).dtypes
a    int64
b    int64
dtype: object
```

This is more consistent with 0.20.3, while still most of the changes in
pandas-dev#16821

Closes pandas-dev#17767

* Compat for older numpy where bool(dtype) is False

* Added timedelta
No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017
* Use argument dtype to inform coercion

Master:

```python
>>> import dask.dataframe as dd

>>> s = dd.core.Scalar({('s', 0): 10}, 's', 'i8')
>>> pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7],
...                     'b': [7, 6, 5, 4, 3, 2, 1]})
>>> (pdf + s).dtypes
a    object
b    object
dtype: object

Head:

```
>>> (pdf + s).dtypes
a    int64
b    int64
dtype: object
```

This is more consistent with 0.20.3, while still most of the changes in
pandas-dev#16821

Closes pandas-dev#17767

* Compat for older numpy where bool(dtype) is False

* Added timedelta
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

2 participants