Skip to content

BUG: Patch handling of keep_default_na=False #19260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 18, 2018

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Jan 16, 2018

Patches very buggy behavior of keep_default_na=False whenever na_values is a dict

  • Respect keep_default_na for column that doesn't exist in na_values dictionary
  • Don't crash / break when na_value is a scalar in the na_values dictionary.

In addition, clarifies documentation on the handling of the keep keep_default_na parameter with respect to na_filter and na_values.

Closes #19227.

cc @neilser

@gfyoung gfyoung added Bug IO CSV read_csv, to_csv labels Jan 16, 2018
@gfyoung gfyoung added this to the 0.23.0 milestone Jan 16, 2018
the default NaN values are used for parsing.
* If `keep_default_na` is False, and `na_values` are not specified, only
the NaN values specified `na_values` are used for parsing.
* If `keep_default_na` is False, and `na_values` are not specified, no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the same as the 3rd bullet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that first line is a just copy-paste fail.

If na_values are specified and keep_default_na is ``False`` the default NaN
values are overridden, otherwise they're appended to.
Whether or not to include the default NaN values when parsing the data.
Provided that `na_filter` is True, depending on whether `na_values` is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 2nd sentence is slightly confusing here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -224,6 +224,47 @@ def test_na_values_keep_default(self):
'seven']})
tm.assert_frame_equal(xp.reindex(columns=df.columns), df)

# see gh-19227: keep_default_na=False should be enforced
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you split this test up, getting a bit long

Copy link
Member Author

@gfyoung gfyoung Jan 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

If na_values are specified and keep_default_na is ``False`` the default NaN
values are overridden, otherwise they're appended to.
Whether or not to include the default NaN values when parsing the data.
Depending on whether `na_values` is passed in, the behavior is as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@@ -3098,15 +3110,17 @@ def _clean_na_values(na_values, keep_default_na=True):
na_fvalues = set()
elif isinstance(na_values, dict):
na_values = na_values.copy() # Prevent aliasing.
if keep_default_na:
for k, v in compat.iteritems(na_values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be nice to have a comment or 2 to explain the logic here for future readers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Done.

@codecov
Copy link

codecov bot commented Jan 16, 2018

Codecov Report

Merging #19260 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19260      +/-   ##
==========================================
+ Coverage   91.56%   91.56%   +<.01%     
==========================================
  Files         148      148              
  Lines       48874    48878       +4     
==========================================
+ Hits        44751    44755       +4     
  Misses       4123     4123
Flag Coverage Δ
#multiple 89.94% <100%> (ø) ⬆️
#single 41.69% <21.42%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.49% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04a6815...63faf8d. Read the comment docs.

(k, _floatify_na_values(v)) for k, v in na_values.items() # noqa
)

na_values[k] = v
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an anti-pattern to modify the dict that you are iterating. can you create you create a new one here?

Copy link
Member Author

@gfyoung gfyoung Jan 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : I realize that, but I think that's why up above, someone wrote na_values = na_values.copy(). The reference to the original input is destroyed and created a "new" dictionary. Do you just want me to change the assigned variable name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you should create a new empty dict and then assign to it (it doesn't have to be a copy). iterating and modifying is a no-no.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to create a copy because at the bottom, we return na_values and na_fvalues regardless of the logic branch. However, I'll iterate over the old_na_values instead.

Patches very buggy behavior of keep_default_na=False
whenever na_values is a dict

* Respect keep_default_na for column that doesn't
exist in na_values dictionary
* Don't crash / break when na_value is a scalar in
the na_values dictionary.

In addition, clarifies documentation on behavior of
keep_default_na with respect to na_filter and na_values.

Closes pandas-devgh-19227.
@gfyoung
Copy link
Member Author

gfyoung commented Jan 17, 2018

@jreback : The OSX builds on Travis seemed to have stalled completely (all of the other tests have been done for a LONG time now).

@jreback jreback merged commit c9532f0 into pandas-dev:master Jan 18, 2018
@jreback
Copy link
Contributor

jreback commented Jan 18, 2018

thanks @gfyoung

@gfyoung gfyoung deleted the na-values-csv branch January 18, 2018 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants