Skip to content

BUG: CategoricalIndex.get_indexer with #45361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Jan 14, 2022 · 7 comments
Closed

BUG: CategoricalIndex.get_indexer with #45361

jbrockmendel opened this issue Jan 14, 2022 · 7 comments
Labels
Bug Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Jan 14, 2022

ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])

res = ci.get_indexer(other)

>>> res
array([1, 3, 2])

The 4 in other is getting mapped to the nan in ci. Best guess is that this is passing other to the Categorical constructor which will return Categorical([2, np.nan, 3], dtype=ci.dtype). If correct, this would be avoided by #40996.

Expected Behavior

>>> res
array([1, -1, 2])
@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 14, 2022
@Shashank-Shet
Copy link
Contributor

@jbrockmendel Shouldn't expected output be array([1, 2, -1])

@Shashank-Shet
Copy link
Contributor

I think I understand what is going on here. Internally, [2, 3, 4] is treated as [2, 3, nan] since 4 is not a category in ci.
So instead of 4 being mapped to -1, the nan in its place is mapped to 2.

@Shashank-Shet
Copy link
Contributor

Also, may I have permission to work on this?

@jbrockmendel
Copy link
Member Author

PR would be welcome!

Shashank-Shet added a commit to Shashank-Shet/pandas that referenced this issue Jan 14, 2022
categorical_index_obj.get_indexer(target) yields incorrect results when
categorical_index_obj contains NaNs, and target does not. The reason
for this is that, if target contains elements which do not match any
category in categorical_index_obj, they are replaced by NaNs. In such
a situation, if categorical_index_obj also has NaNs, then the corresp
elements in target are mapped to an index which is not -1

eg:
ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])

ci.get_indexer(other)
In the implementation of get_indexer, other becomes [2, 3, NaN]
which is mapped to index 2, in ci
@Shashank-Shet
Copy link
Contributor

I have just submitted a PR.

Shashank-Shet added a commit to Shashank-Shet/pandas that referenced this issue Jan 14, 2022
categorical_index_obj.get_indexer(target) yields incorrect results when
categorical_index_obj contains NaNs, and target does not. The reason
for this is that, if target contains elements which do not match any
category in categorical_index_obj, they are replaced by NaNs. In such
a situation, if categorical_index_obj also has NaNs, then the corresp
elements in target are mapped to an index which is not -1

eg:
ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])

ci.get_indexer(other)
In the implementation of get_indexer, other becomes [2, 3, NaN]
which is mapped to index 2, in ci

Update:
np.isnan(target) was breaking the existing codebase.
As a solution, I have enclosed this line in a try-except block
@mroeschke mroeschke added Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 14, 2022
Shashank-Shet added a commit to Shashank-Shet/pandas that referenced this issue Jan 21, 2022
Update: Replaced `np.isnan` with `pandas.core.dtypes.missing.isna`
@Shashank-Shet
Copy link
Contributor

Shashank-Shet commented Jan 22, 2022

@jbrockmendel I just needed a clarification on this piece of code:

ci = pd.CategoricalIndex([1,2,3,np.nan])
ci.get_indexer([1,2,4,np.nan])

Should the output be [0,1,-1,-1]? Or [0,1,-1,3]?

jbrockmendel pushed a commit that referenced this issue Jan 28, 2022
* BUG: CategoricalIndex.get_indexer issue with NaNs (#45361)

categorical_index_obj.get_indexer(target) yields incorrect results when
categorical_index_obj contains NaNs, and target does not. The reason
for this is that, if target contains elements which do not match any
category in categorical_index_obj, they are replaced by NaNs. In such
a situation, if categorical_index_obj also has NaNs, then the corresp
elements in target are mapped to an index which is not -1

eg:
ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])

ci.get_indexer(other)
In the implementation of get_indexer, other becomes [2, 3, NaN]
which is mapped to index 2, in ci

* BUG: CategoricalIndex.get_indexer issue with NaNs (#45361)

categorical_index_obj.get_indexer(target) yields incorrect results when
categorical_index_obj contains NaNs, and target does not. The reason
for this is that, if target contains elements which do not match any
category in categorical_index_obj, they are replaced by NaNs. In such
a situation, if categorical_index_obj also has NaNs, then the corresp
elements in target are mapped to an index which is not -1

eg:
ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])

ci.get_indexer(other)
In the implementation of get_indexer, other becomes [2, 3, NaN]
which is mapped to index 2, in ci

Update:
np.isnan(target) was breaking the existing codebase.
As a solution, I have enclosed this line in a try-except block

* BUG: CategoricalIndex.get_indexer issue with NaNs (#45361)

Update: Replaced `np.isnan` with `pandas.core.dtypes.missing.isna`

* Added a testcase to verify output behaviour

* Made pre-commit changes

* Added a test case without NaNs

* Moved NaN test to avoid unnecessary execution

* Re-aligned test cases

* Removed try-except block

* Cleaned up base.py

* Add GH#45361 comment to code

* Added whatsnew entry

* Resolved merge conflict

* Moved whatsnew entry to indexing section
@jbrockmendel
Copy link
Member Author

closed by #45373

phofl pushed a commit to phofl/pandas that referenced this issue Feb 14, 2022
…andas-dev#45373)

* BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361)

categorical_index_obj.get_indexer(target) yields incorrect results when
categorical_index_obj contains NaNs, and target does not. The reason
for this is that, if target contains elements which do not match any
category in categorical_index_obj, they are replaced by NaNs. In such
a situation, if categorical_index_obj also has NaNs, then the corresp
elements in target are mapped to an index which is not -1

eg:
ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])

ci.get_indexer(other)
In the implementation of get_indexer, other becomes [2, 3, NaN]
which is mapped to index 2, in ci

* BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361)

categorical_index_obj.get_indexer(target) yields incorrect results when
categorical_index_obj contains NaNs, and target does not. The reason
for this is that, if target contains elements which do not match any
category in categorical_index_obj, they are replaced by NaNs. In such
a situation, if categorical_index_obj also has NaNs, then the corresp
elements in target are mapped to an index which is not -1

eg:
ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])

ci.get_indexer(other)
In the implementation of get_indexer, other becomes [2, 3, NaN]
which is mapped to index 2, in ci

Update:
np.isnan(target) was breaking the existing codebase.
As a solution, I have enclosed this line in a try-except block

* BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361)

Update: Replaced `np.isnan` with `pandas.core.dtypes.missing.isna`

* Added a testcase to verify output behaviour

* Made pre-commit changes

* Added a test case without NaNs

* Moved NaN test to avoid unnecessary execution

* Re-aligned test cases

* Removed try-except block

* Cleaned up base.py

* Add GH#45361 comment to code

* Added whatsnew entry

* Resolved merge conflict

* Moved whatsnew entry to indexing section
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this issue Jul 13, 2022
…andas-dev#45373)

* BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361)

categorical_index_obj.get_indexer(target) yields incorrect results when
categorical_index_obj contains NaNs, and target does not. The reason
for this is that, if target contains elements which do not match any
category in categorical_index_obj, they are replaced by NaNs. In such
a situation, if categorical_index_obj also has NaNs, then the corresp
elements in target are mapped to an index which is not -1

eg:
ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])

ci.get_indexer(other)
In the implementation of get_indexer, other becomes [2, 3, NaN]
which is mapped to index 2, in ci

* BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361)

categorical_index_obj.get_indexer(target) yields incorrect results when
categorical_index_obj contains NaNs, and target does not. The reason
for this is that, if target contains elements which do not match any
category in categorical_index_obj, they are replaced by NaNs. In such
a situation, if categorical_index_obj also has NaNs, then the corresp
elements in target are mapped to an index which is not -1

eg:
ci = pd.CategoricalIndex([1, 2, np.nan, 3])
other = pd.Index([2, 3, 4])

ci.get_indexer(other)
In the implementation of get_indexer, other becomes [2, 3, NaN]
which is mapped to index 2, in ci

Update:
np.isnan(target) was breaking the existing codebase.
As a solution, I have enclosed this line in a try-except block

* BUG: CategoricalIndex.get_indexer issue with NaNs (pandas-dev#45361)

Update: Replaced `np.isnan` with `pandas.core.dtypes.missing.isna`

* Added a testcase to verify output behaviour

* Made pre-commit changes

* Added a test case without NaNs

* Moved NaN test to avoid unnecessary execution

* Re-aligned test cases

* Removed try-except block

* Cleaned up base.py

* Add GH#45361 comment to code

* Added whatsnew entry

* Resolved merge conflict

* Moved whatsnew entry to indexing section
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

3 participants