Skip to content

BUG: Fix reading of multi_index dataframe with NaN #57070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

CYHSM
Copy link

@CYHSM CYHSM commented Jan 25, 2024

See #56929 for some of the discussion around this bug. Instead of removing the additional header row as suggested in the issue, I fixed the bug after reading all three rows as a header. The code that errors will have these headers: header: ['0', 'Unnamed: 1_level_2'], the code where we explicitly set empty values to NaN before writing has these: header: ['0', 'NaN']. The problem with the empty value stems from this line:

if name == "":
if self.has_mi_columns:
name = f"Unnamed: {i}_level_{level}"
else:
name = f"Unnamed: {i}"
which checks for an empty string and then assigns Unnamed level to it (this is where the paths for the csv with explicit NaN values diverges as its not assigned the unnamed level)

I think there are several ways of fixing this. In this PR I tried to make sure that the first header is treated like the second by simply changing the comparison in this line:

if (lc != unnamed_count and lc - ic > unnamed_count) or ic == 0:

No other tests are failing after recompiling the code, therefore I am fairly confident that this fixes the issue and does not introduce others. However, it would still be good if someone else with more parsing experience looks over it.

Edit: It seems some tests are breaking due to this, will check it out now.

@CYHSM CYHSM requested a review from WillAyd as a code owner January 25, 2024 12:53
@CYHSM CYHSM changed the title Fix bug when reading multi_index dataframe with NaN BUG: Fix reading of multi_index dataframe with NaN Jan 25, 2024
@simonjayhawkins simonjayhawkins added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate IO CSV read_csv, to_csv labels Feb 7, 2024
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: KeyError when loading csv with NaNs
3 participants