Skip to content

Inconsistent Handling of na_values and converters in read_csv #13302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gfyoung opened this issue May 26, 2016 · 4 comments
Open

Inconsistent Handling of na_values and converters in read_csv #13302

gfyoung opened this issue May 26, 2016 · 4 comments
Labels
Bug IO CSV read_csv, to_csv

Comments

@gfyoung
Copy link
Member

gfyoung commented May 26, 2016

On master (commit 40b4bb4):

>>> data = """A
1
CAT
3"""
>>> f = lambda x: x
>>> read_csv(StringIO(data), na_values='CAT', converters={'A': f}, engine='c')
     A
0    1
1  CAT
2    3
>>> read_csv(StringIO(data), na_values='CAT', converters={'A': f}, engine='python')
     A
0    1
1  NaN
2    3

I expect both to give the same output, though I believe the Python output is more correct because it respects na_values unlike the C engine. I thought the simple fix would be to remove the continue statement here, but that causes test failures, so probably a more involved refactoring might be needed to align the order of converter application, NaN value conversion, and dtype conversion.

IMO this should be added to #12686, as this is a difference in behaviour between the two engines.

xref #5232

@gfyoung
Copy link
Member Author

gfyoung commented May 28, 2016

I have to say that decision to continue after applying a converter in the C-engine is really bugged. It causes a lot of other post-processing to be skipped. Another example (adapted from the parser tests):

>>> """A B
-999 1.200
2 -999.000
3 4.500"""
>>> f = lambda x: x
>>> read_csv(StringIO(data), sep=' ', header=0,
             na_values=[-999.0, -999], converters={'B':f})
     A         B
0  NaN     1.200
1  2.0  -999.000
2  3.0     4.500

Per the original test, that -999.000 should become a NaN, but it doesn't because that post-processing is completely skipped because of this innocuous converter.

@gfyoung
Copy link
Member Author

gfyoung commented Jan 24, 2017

xref #15144 (closed):

>>> data = 'A,B,C\n10,-1,10\n10,10,10'
>>> read_fwf(StringIO(data), converters={'B': lambda x: x / 10.}, na_values={'B': [-1]})
 A    B  C
10 -0.1 10
10    1 10

Expected:

 A    B  C
10  NaN 10
10    1 10

The current behavior is due to the fact that converters are applied BEFORE na_values are, but consensus appears to be that we should apply na_values first.

@patricktokeeffe
Copy link
Contributor

This is an insidious problem with the C engine! In 0.24.2, I observe implicit conversions also cause the parser to ignore the na_values parameter:

import io, pandas as pd
data = """\
Time,ConcRT(ug/m3),ConcHR(ug/m3),Flow(lpm),WS(m/s),WD(Deg),AT(C),RH(%),BP(mmHg),FT(C),FRH(%),BV(V),PM,Status
2018-06-27 21:00:00,+000012,+000007,+16.7,00.0,000,+022.6,025,704,+030.1,012,12.5,0,00000
2018-06-27 22:00:00,+000003,+000009,+16.7,00.0,000,+021.2,033,705,+028.6,015,12.5,0,00000
2018-06-27 23:00:00,+099999,+099999,+00.0,00.0,000,+019.6,038,705,+024.0,015,12.5,0,00001

"""

Note these are numeric values, but their representation probably requires an implicit converter be called.

The C engine will not locate NaNs unless the verbatim per-character value is presented to na_values:

pd.read_csv(io.StringIO(data), na_values=['99999'])
  Time ConcRT(ug/m3) ConcHR(ug/m3) Flow(lpm) WS(m/s) WD(Deg) AT(C) RH(%) BP(mmHg) FT(C) FRH(%) BV(V) PM Status
2018-06-27 21:00:00 12 7 16.7 0.0 0 22.6 25 704 30.1 12 12.5 0 0
2018-06-27 22:00:00 3 9 16.7 0.0 0 21.2 33 705 28.6 15 12.5 0 0
2018-06-27 23:00:00 99999 99999 0.0 0.0 0 19.6 38 705 24.0 15 12.5 0 1
pd.read_csv(io.StringIO(data), na_values=[99999])
Time ConcRT(ug/m3) ConcHR(ug/m3) Flow(lpm) WS(m/s) WD(Deg) AT(C) RH(%) BP(mmHg) FT(C) FRH(%) BV(V) PM Status
2018-06-27 21:00:00 12 7 16.7 0.0 0 22.6 25 704 30.1 12 12.5 0 0
2018-06-27 22:00:00 3 9 16.7 0.0 0 21.2 33 705 28.6 15 12.5 0 0
2018-06-27 23:00:00 99999 99999 0.0 0.0 0 19.6 38 705 24.0 15 12.5 0 1
pd.read_csv(io.StringIO(data), na_values=['+099999'])
Time ConcRT(ug/m3) ConcHR(ug/m3) Flow(lpm) WS(m/s) WD(Deg) AT(C) RH(%) BP(mmHg) FT(C) FRH(%) BV(V) PM Status
2018-06-27 21:00:00 12.0 7.0 16.7 0.0 0 22.6 25 704 30.1 12 12.5 0 0
2018-06-27 22:00:00 3.0 9.0 16.7 0.0 0 21.2 33 705 28.6 15 12.5 0 0
2018-06-27 23:00:00 NaN NaN 0.0 0.0 0 19.6 38 705 24.0 15 12.5 0 1

This situation highlights the shortcomings of picking a simple 'before' or 'after' for converters to be applied. Consensus so far is to apply na_values before converters but I suspect certain converters should always be applied first.


Note the Python engine handles the situation identically (and "correctly") in all scenarios:

pd.read_csv(io.StringIO(data), na_values=['99999'], engine='python')
Time ConcRT(ug/m3) ConcHR(ug/m3) Flow(lpm) WS(m/s) WD(Deg) AT(C) RH(%) BP(mmHg) FT(C) FRH(%) BV(V) PM Status
2018-06-27 21:00:00 12.0 7.0 16.7 0.0 0 22.6 25 704 30.1 12 12.5 0 0
2018-06-27 22:00:00 3.0 9.0 16.7 0.0 0 21.2 33 705 28.6 15 12.5 0 0
2018-06-27 23:00:00 NaN NaN 0.0 0.0 0 19.6 38 705 24.0 15 12.5 0 1
pd.read_csv(io.StringIO(data), na_values=[99999], engine='python')
Time ConcRT(ug/m3) ConcHR(ug/m3) Flow(lpm) WS(m/s) WD(Deg) AT(C) RH(%) BP(mmHg) FT(C) FRH(%) BV(V) PM Status
2018-06-27 21:00:00 12.0 7.0 16.7 0.0 0 22.6 25 704 30.1 12 12.5 0 0
2018-06-27 22:00:00 3.0 9.0 16.7 0.0 0 21.2 33 705 28.6 15 12.5 0 0
2018-06-27 23:00:00 NaN NaN 0.0 0.0 0 19.6 38 705 24.0 15 12.5 0 1
pd.read_csv(io.StringIO(data), na_values=['+099999'], engine='python')
Time ConcRT(ug/m3) ConcHR(ug/m3) Flow(lpm) WS(m/s) WD(Deg) AT(C) RH(%) BP(mmHg) FT(C) FRH(%) BV(V) PM Status
2018-06-27 21:00:00 12.0 7.0 16.7 0.0 0 22.6 25 704 30.1 12 12.5 0 0
2018-06-27 22:00:00 3.0 9.0 16.7 0.0 0 21.2 33 705 28.6 15 12.5 0 0
2018-06-27 23:00:00 NaN NaN 0.0 0.0 0 19.6 38 705 24.0 15 12.5 0 1

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@fynnwz
Copy link

fynnwz commented Sep 15, 2023

@patricktokeeffe @gfyoung I agree, this bug is insidious, very difficult to find and highly confusing. Here is a snippet to trace it down. It took me a couple of hours.

import requests 
import zipfile
from io import BytesIO
import pandas as pd
from datetime import datetime

zip_url = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/10_minutes/solar/historical/10minutenwerte_SOLAR_00704_20020801_20091231_hist.zip"
response = requests.get(zip_url)
response.raise_for_status()

# Open the .zip file in memory
with zipfile.ZipFile(BytesIO(response.content)) as z:
    # Extract the .txt file containing the data
    txt_file_name = z.namelist()[0]
    with z.open(txt_file_name) as file:
        df = pd.read_csv(file, sep=";", na_values="-999", engine="c")
        
print(df.loc[df['MESS_DATUM'] == 200709010900]['GS_10'])
print(df.loc[df['MESS_DATUM'] == 200709010910]['GS_10'])

This is my output:

262143   -999.0
Name: GS_10, dtype: float64
262144   NaN
Name: GS_10, dtype: float64

Some values of my data are correctly interpreted as NAN and some are not. Apperently at the two dates I selected above, is where the switch actually happens. Before "200709010900" every value of -999 is not interpreted as NAN. After "200709010910" all -999 values are correctly interpreted as NAN. I checked the csv-file manually and couldn't find any inconsistencies. However, once I change engine="python", it all works as expected. Am I missing something or is this indeed a weird bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

6 participants