Skip to content

BUG: Fixed DataFrame.__repr__ Type Error when column dtype is np.record (GH 48526) #48637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Apr 21, 2023

Conversation

RaphSku
Copy link
Contributor

@RaphSku RaphSku commented Sep 19, 2022

@RaphSku RaphSku changed the title BUG: Fixed DataFrame.__repr__ Type Error when column dtype is np.record BUG: Fixed DataFrame.__repr__ Type Error when column dtype is np.record (GH 48526) Sep 19, 2022
@@ -292,7 +292,7 @@ def _isna_array(values: ArrayLike, inf_as_na: bool = False):
# "Union[ndarray[Any, Any], ExtensionArraySupportsAnyAll]", variable has
# type "ndarray[Any, dtype[bool_]]")
result = values.isna() # type: ignore[assignment]
elif is_string_or_object_np_dtype(values.dtype):
elif is_string_or_object_np_dtype(values.dtype) or isinstance(values, np.recarray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm appears this function doesn't deal with output formatting? Why is the fixed specifically needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put it there because when the dtype is a np.recarray, then result = np.isnan(values) in core/dtypes/missing.py will be executed and np.isnan cannot be evaluated on a np.recarray, it will throw the error which was also described in the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix still appears like a band-aid fix. pandas doesn't have great support for np.recarrays and it doesn't look like _isna_string_dtype was specifically meant to deal with them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, then I have a look what one can do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke
Since _isna_string_dtype is not meant to deal with np.recarray, I have introduced a function that deals with that, can you have a look and tell me if this is okay? Thanks.

@mroeschke mroeschke added Output-Formatting __repr__ of pandas objects, to_string Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). labels Sep 19, 2022
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Oct 21, 2022
@RaphSku RaphSku requested a review from mroeschke October 22, 2022 11:29
@RaphSku
Copy link
Contributor Author

RaphSku commented Oct 22, 2022

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

still interested

@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Dec 6, 2022
@RaphSku
Copy link
Contributor Author

RaphSku commented Mar 10, 2023

@mroeschke
Can we re-open the pull request? I merged the main branch and your first and second review comment were already considered: I moved the test to pandas/tests/frame/test_repr_info.py and also checked against repr(df). I commented on your last review comment but never got an answer if my explanation was sufficient or not.
Thank you in advance.

@mroeschke mroeschke reopened this Mar 10, 2023
@mroeschke
Copy link
Member

Sure, reopened this pull request

@RaphSku
Copy link
Contributor Author

RaphSku commented Mar 10, 2023

@mroeschke
It passes the CI tests, but there is some connection problem with codecov:
"['error'] There was an error running the uploader: Error uploading to https://codecov.io: Error: There was an error fetching the storage URL during POST: 503 - upstream connect error or disconnect/reset before headers. reset reason: connection failure"

def _isna_recarray_dtype(values: np.recarray, inf_as_na: bool) -> npt.NDArray[np.bool_]:
result = np.zeros(values.shape, dtype=bool)
for i, record in enumerate(values):
result[i] = libmissing.isnaobj(np.array(record), inf_as_na=inf_as_na)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think isnaobj is well defined for a record. Do all elements in the record need to be na-like? Does the record itself need to be na-like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, isnaobj doesn't work on records.
I added in my latest commit 2 more tests to see if this is what you would expect from it and also refined the _isna_recarray_dtype method and made it return False if any value in the record is nan.
Can you have a look if this is reasonable? If you wish that I add tests for this function, I can add them, but could you tell me where I should place these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke What is your opinion on this?

result = np.zeros(values.shape, dtype=bool)
for i, record in enumerate(values):
does_record_contain_nan = np.zeros(len(record), dtype=bool)
if inf_as_na:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of if inf_as_na everywhere could you create a checker function initially like

if inf_as_na:
    is_missing = lambda x: np.isnan(x) or np.isinf(x)
else:
    is_missing = lambda x: np.isnan(x)

And use that in the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke I rewrote the function to increase readbility. What do you think about the change?

result = repr(df)
assert result == expected

def test_to_records_with_na_record_value(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have tests for inf as well here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke I added 2 tests for inf, is the expected result what you would also expect?


def test_to_records_with_inf_as_na_record(self):
# GH 48526
options.mode.use_inf_as_na = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the with option_context(...) context manager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


def test_to_records_with_inf_record(self):
# GH 48526
options.mode.use_inf_as_na = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1229,6 +1229,7 @@ Conversion
- Bug in :meth:`Series.to_numpy` converting to NumPy array before applying ``na_value`` (:issue:`48951`)
- Bug in :meth:`DataFrame.astype` not copying data when converting to pyarrow dtype (:issue:`50984`)
- Bug in :func:`to_datetime` was not respecting ``exact`` argument when ``format`` was an ISO8601 format (:issue:`12649`)
- Bug in :meth:`DataFrame.__repr__` incorrectly raising a ``TypeError`` when the dtype of a column is ``np.record`` (:issue:`48526`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to 2.1

Copy link
Contributor Author

@RaphSku RaphSku Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, done

result = np.zeros(values.shape, dtype=bool)
for i, record in enumerate(values):
does_record_contain_nan = np.zeros(len(record), dtype=bool)
for j, element in enumerate(record):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just call isnan and/or isinf on record?

Copy link
Contributor Author

@RaphSku RaphSku Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke I used isnan_all for the record and I rewrote a little bit the isinf part but I cannot use isinfdirectly on the record since it will throw a TypeError when the record value is a string. But I hope the way I rewrote the method still improved on the previous version.

@mroeschke mroeschke added this to the 2.1 milestone Apr 21, 2023
@mroeschke mroeschke merged commit c3f0aac into pandas-dev:main Apr 21, 2023
@mroeschke
Copy link
Member

Thanks for sticking with this @RaphSku

@RaphSku RaphSku deleted the frame_repr_bugfix branch April 21, 2023 18:20
Rylie-W pushed a commit to Rylie-W/pandas that referenced this pull request May 19, 2023
…rd (GH 48526) (pandas-dev#48637)

* BUG: Fixed repr type error when column in np.record

* Reallocated the test to proposed destination and test the repr of the frame

* Assert has gone missing due to merge conflict

* Added new function that handles np.recarrays in _isna_array

* Addes 2 more tests + refined _isna_recarray_dtype + formatting fixed

* Reworked isna_recarray_dtype method + Added test for inf value

* Pre-commit fix

* Fixing typing

* Fixed formatting

* Fixed pyright error

---------

Co-authored-by: RaphSku <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). Output-Formatting __repr__ of pandas objects, to_string Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: TypeError when printing dataframe with numpy.record dtype column
2 participants