-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Inconsistent Handling of na_values and converters in read_csv #13302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have to say that decision to >>> """A B
-999 1.200
2 -999.000
3 4.500"""
>>> f = lambda x: x
>>> read_csv(StringIO(data), sep=' ', header=0,
na_values=[-999.0, -999], converters={'B':f})
A B
0 NaN 1.200
1 2.0 -999.000
2 3.0 4.500 Per the original test, that |
xref #15144 (closed): >>> data = 'A,B,C\n10,-1,10\n10,10,10'
>>> read_fwf(StringIO(data), converters={'B': lambda x: x / 10.}, na_values={'B': [-1]})
A B C
10 -0.1 10
10 1 10 Expected: A B C
10 NaN 10
10 1 10 The current behavior is due to the fact that converters are applied BEFORE |
This is an insidious problem with the C engine! In 0.24.2, I observe implicit conversions also cause the parser to ignore the
Note these are numeric values, but their representation probably requires an implicit converter be called. The C engine will not locate NaNs unless the verbatim per-character value is presented to
This situation highlights the shortcomings of picking a simple 'before' or 'after' for converters to be applied. Consensus so far is to apply na_values before converters but I suspect certain converters should always be applied first. Note the Python engine handles the situation identically (and "correctly") in all scenarios:
|
@patricktokeeffe @gfyoung I agree, this bug is insidious, very difficult to find and highly confusing. Here is a snippet to trace it down. It took me a couple of hours.
This is my output:
Some values of my data are correctly interpreted as NAN and some are not. Apperently at the two dates I selected above, is where the switch actually happens. Before "200709010900" every value of -999 is not interpreted as NAN. After "200709010910" all -999 values are correctly interpreted as NAN. I checked the csv-file manually and couldn't find any inconsistencies. However, once I change engine="python", it all works as expected. Am I missing something or is this indeed a weird bug? |
On
master
(commit 40b4bb4):I expect both to give the same output, though I believe the Python output is more correct because it respects
na_values
unlike the C engine. I thought the simple fix would be to remove thecontinue
statement here, but that causes test failures, so probably a more involved refactoring might be needed to align the order of converter application,NaN
value conversion, anddtype
conversion.IMO this should be added to #12686, as this is a difference in behaviour between the two engines.
xref #5232
The text was updated successfully, but these errors were encountered: