-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG (string dtype): comparison of string column to mixed object column fails #60228 (fixed) #60392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
Added a few suggestions
pandas/core/ops/array_ops.py
Outdated
if (is_string_dtype(lvalues) and is_object_dtype(rvalues)) or ( | ||
is_object_dtype(lvalues) and is_string_dtype(rvalues) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking for string dtype for an array can be expensive in case the array is object dtype (at that point it will scan all values to check if they are strings). So we might want to try avoid that at this level.
I think we could handle the issue specifically for the ArrowExtensionArray itself (see the code I referenced in #60228 (comment))
pandas/core/ops/array_ops.py
Outdated
): | ||
if lvalues.dtype.name == "string" and rvalues.dtype == object: | ||
lvalues = lvalues.astype("string") | ||
rvalues = pd_array(rvalues, dtype="string") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to do the casting the other way around. Instead of casting the object to string and then compare both as strings, I think we have to cast the string to object and compare both as object dtype.
The reason for this is that casting to string might actually convert values to strings, and then we are no longer doing the comparison for the original values.
>>> ser_string = pd.Series(["1", "b"])
>>> ser_mixed = pd.Series([1, "b"])
>>> ser_string == ser_mixed
0 False
1 True
dtype: bool
>>> ser_string == ser_mixed.astype("string")
0 True
1 True
dtype: bool
So if we would do that casting under the hood, the result would change in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we should add this case to the tests!
|
||
def test_comparison_string_mixed_object(): | ||
# Issue https://github.com/pandas-dev/pandas/issues/60228 | ||
pd.options.future.infer_string = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to add this for CI, because we have a separate CI build that enables this option for the full test suite.
Now, this can still be useful to test locally, but the way you can do this is with setting an environment variable (on linux I can do PANDAS_FUTURE_INFER_STRING=1 pytest ...
to run the test with the option enabled.
Fix formatting of complex numbers with exponents
Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.21.3 to 2.22.0. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v2.21.3...v2.22.0) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matthew Roeschke <[email protected]>
* fix docstring api.types.is_re_compilable * fix lint error
…e" by "reshape" (pandas-dev#60420) Clarifying pandas.melt method documentation by replacing "massage" by "reshape" Meanwhile, "massage" is correct in a colloquial sense to mean transforming or reshaping data. This is far from accessible for a non-English speaker (as I am). Using the term `reshape` or `transform` is more meaningful while being accurate.
Co-authored-by: Matthew Roeschke <[email protected]>
…pyarrow]" while PyArrow is not installed (pandas-dev#60413) * Add test * Fix * Add note * Update pandas/tests/dtypes/test_common.py Co-authored-by: Matthew Roeschke <[email protected]> * update * Fix doc warning --------- Co-authored-by: Matthew Roeschke <[email protected]>
@TEARFEAR git workflow related comment: it seems you have picked up some unrelated commits. In general doing something like
should fix that (assuming |
(small ping to see if you have some time to update this PR) |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
@TEARFEAR would you have time to update this PR? (otherwise I am also happy to update it, as we would like to include this in the 2.3 release) |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
Changes :
comparison_op
function inpandas/core/ops/array_ops.py
:string
andobject
dtypes.string
andobject
types are casted to a common type (string
) before comparison.string
vsobject
with homogeneous and mixed types.pd.options.future.infer_string = True
).--
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.