Skip to content

REGR: comparison op with dask data structure fails #38946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jorisvandenbossche opened this issue Jan 4, 2021 · 10 comments
Open

REGR: comparison op with dask data structure fails #38946

jorisvandenbossche opened this issue Jan 4, 2021 · 10 comments
Labels
Bug Closing Candidate May be closeable, needs more eyeballs Compat pandas objects compatability with Numpy or Python functions Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version

Comments

@jorisvandenbossche
Copy link
Member

Using pandas 1.0.5 and latest dask 2020.12.0:

In [2]: import dask.dataframe as dd                                                                                                                                                                                

In [3]: df = pd.DataFrame({"x": ["a", "b", "c"] * 100}, dtype="category") 
   ...: ddf = dd.from_pandas(df, npartitions=3)                                                                                                                                                                

In [4]: df.x                                                                                                                                                                                                       
Out[4]: 
0      a
1      b
2      c
3      a
4      b
      ..
295    b
296    c
297    a
298    b
299    c
Name: x, Length: 300, dtype: category
Categories (3, object): [a, b, c]

In [5]: ddf.x                                                                                                                                                                                                      
Out[5]: 
Dask Series Structure:
npartitions=3
0      category[known]
100                ...
200                ...
299                ...
Name: x, dtype: category
Dask Name: getitem, 6 tasks

In [6]: df.x == ddf.x                                                                                                                                                                                              
Out[6]: 
0      True
1      True
2      True
3      True
4      True
       ... 
295    True
296    True
297    True
298    True
299    True
Name: x, Length: 300, dtype: bool

In [9]: (df.x == ddf.x).all()                                                                                                                                                                                      
Out[9]: True

But with master (using same dask version), this gives:

In [3]: df.x == ddf.x
Out[3]: 
0      False
1      False
2      False
3      False
4      False
       ...  
295    False
296    False
297    False
298    False
299    False
Name: x, Length: 300, dtype: bool
@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Jan 4, 2021
@jorisvandenbossche jorisvandenbossche added this to the 1.2.1 milestone Jan 4, 2021
@jorisvandenbossche
Copy link
Member Author

Checked and this was still working in 1.1.x, and started to fail for pandas 1.2.0

@jorisvandenbossche
Copy link
Member Author

Bisecting gives

40daf00 is the first bad commit
commit 40daf00
Author: jbrockmendel [email protected]
Date: Fri Oct 2 15:43:23 2020 -0700

BUG: Categorical setitem, comparison with tuple category (#36623)

cc @jbrockmendel

@jorisvandenbossche
Copy link
Member Author

Hmm, it seems that this before only worked for categorical (at least testing the above example but using int64 or object dtype, I get all False or an error, respectively, also on previous released versions of pandas). So probably not a priority for 1.2.x

@jorisvandenbossche jorisvandenbossche added the Numeric Operations Arithmetic, Comparison, and Logical operations label Jan 4, 2021
@jbrockmendel
Copy link
Member

Looks like ddf is hashable so we treat it as scalar-like. I think deep down the thing to do is return NotImplemented and let dask handle alignment

@simonjayhawkins simonjayhawkins modified the milestones: 1.2.1, 1.2.2 Jan 17, 2021
@simonjayhawkins
Copy link
Member

moved to 1.2.2

@simonjayhawkins
Copy link
Member

moved to 1.2.3

@simonjayhawkins simonjayhawkins modified the milestones: 1.2.2, 1.2.3 Feb 8, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.2.3, 1.2.4 Mar 2, 2021
@jbrockmendel
Copy link
Member

@TomAugspurger suggestions for what attribute to check to determine we should return NotImplemented?

@simonjayhawkins simonjayhawkins modified the milestones: 1.2.4, 1.2.5 Apr 12, 2021
@simonjayhawkins
Copy link
Member

moved to 1.2.5

@simonjayhawkins
Copy link
Member

So probably not a priority for 1.2.x

removing milestone.

@jbrockmendel
Copy link
Member

Following #48347, it is up to dask to define __pandas_priority__ to fix this

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Closing Candidate May be closeable, needs more eyeballs Compat pandas objects compatability with Numpy or Python functions Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

4 participants