BUG: Fix reading of multi_index dataframe with NaN #57070

CYHSM · 2024-01-25T12:53:50Z

closes BUG: KeyError when loading csv with NaNs #56929
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

See #56929 for some of the discussion around this bug. Instead of removing the additional header row as suggested in the issue, I fixed the bug after reading all three rows as a header. The code that errors will have these headers: header: ['0', 'Unnamed: 1_level_2'], the code where we explicitly set empty values to NaN before writing has these: header: ['0', 'NaN']. The problem with the empty value stems from this line:

pandas/pandas/_libs/parsers.pyx

Lines 694 to 698 in 7368686

    
           if name == "": 
        
               if self.has_mi_columns: 
        
                   name = f"Unnamed: {i}_level_{level}" 
        
               else: 
        
                   name = f"Unnamed: {i}"

which checks for an empty string and then assigns Unnamed level to it (this is where the paths for the csv with explicit NaN values diverges as its not assigned the unnamed level)

I think there are several ways of fixing this. In this PR I tried to make sure that the first header is treated like the second by simply changing the comparison in this line:

pandas/pandas/_libs/parsers.pyx

Line 749 in 7368686

if (lc != unnamed_count and lc - ic > unnamed_count) or ic == 0:

No other tests are failing after recompiling the code, therefore I am fairly confident that this fixes the issue and does not introduce others. However, it would still be good if someone else with more parsing experience looks over it.

Edit: It seems some tests are breaking due to this, will check it out now.

mroeschke · 2024-02-28T17:46:33Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

CYHSM added 2 commits January 25, 2024 13:39

Fix for pandas issue pandas-dev#56929, including new test

0826212

after precommit change of files

79edb62

CYHSM requested a review from WillAyd as a code owner January 25, 2024 12:53

CYHSM changed the title ~~Fix bug when reading multi_index dataframe with NaN~~ BUG: Fix reading of multi_index dataframe with NaN Jan 25, 2024

Fix line too long pre-commit error

85f8fb9

simonjayhawkins added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate IO CSV read_csv, to_csv labels Feb 7, 2024

mroeschke closed this Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix reading of multi_index dataframe with NaN #57070

BUG: Fix reading of multi_index dataframe with NaN #57070

CYHSM commented Jan 25, 2024 •

edited

Loading

mroeschke commented Feb 28, 2024

	if name == "":
	if self.has_mi_columns:
	name = f"Unnamed: {i}_level_{level}"
	else:
	name = f"Unnamed: {i}"

BUG: Fix reading of multi_index dataframe with NaN #57070

BUG: Fix reading of multi_index dataframe with NaN #57070

Conversation

CYHSM commented Jan 25, 2024 • edited Loading

mroeschke commented Feb 28, 2024

CYHSM commented Jan 25, 2024 •

edited

Loading