Inconsistent Handling of na_values and converters in read_csv #13302

gfyoung · 2016-05-26T23:58:42Z

On master (commit 40b4bb4):

>>> data = """A
1
CAT
3"""
>>> f = lambda x: x
>>> read_csv(StringIO(data), na_values='CAT', converters={'A': f}, engine='c')
     A
0    1
1  CAT
2    3
>>> read_csv(StringIO(data), na_values='CAT', converters={'A': f}, engine='python')
     A
0    1
1  NaN
2    3

I expect both to give the same output, though I believe the Python output is more correct because it respects na_values unlike the C engine. I thought the simple fix would be to remove the continue statement here, but that causes test failures, so probably a more involved refactoring might be needed to align the order of converter application, NaN value conversion, and dtype conversion.

IMO this should be added to #12686, as this is a difference in behaviour between the two engines.

xref #5232

The text was updated successfully, but these errors were encountered:

gfyoung · 2016-05-28T18:56:16Z

I have to say that decision to continue after applying a converter in the C-engine is really bugged. It causes a lot of other post-processing to be skipped. Another example (adapted from the parser tests):

>>> """A B
-999 1.200
2 -999.000
3 4.500"""
>>> f = lambda x: x
>>> read_csv(StringIO(data), sep=' ', header=0,
             na_values=[-999.0, -999], converters={'B':f})
     A         B
0  NaN     1.200
1  2.0  -999.000
2  3.0     4.500

Per the original test, that -999.000 should become a NaN, but it doesn't because that post-processing is completely skipped because of this innocuous converter.

gfyoung · 2017-01-24T04:33:42Z

xref #15144 (closed):

>>> data = 'A,B,C\n10,-1,10\n10,10,10'
>>> read_fwf(StringIO(data), converters={'B': lambda x: x / 10.}, na_values={'B': [-1]})
 A    B  C
10 -0.1 10
10    1 10

Expected:

 A    B  C
10  NaN 10
10    1 10

The current behavior is due to the fact that converters are applied BEFORE na_values are, but consensus appears to be that we should apply na_values first.

patricktokeeffe · 2019-06-11T20:41:25Z

This is an insidious problem with the C engine! In 0.24.2, I observe implicit conversions also cause the parser to ignore the na_values parameter:

import io, pandas as pd
data = """\
Time,ConcRT(ug/m3),ConcHR(ug/m3),Flow(lpm),WS(m/s),WD(Deg),AT(C),RH(%),BP(mmHg),FT(C),FRH(%),BV(V),PM,Status
2018-06-27 21:00:00,+000012,+000007,+16.7,00.0,000,+022.6,025,704,+030.1,012,12.5,0,00000
2018-06-27 22:00:00,+000003,+000009,+16.7,00.0,000,+021.2,033,705,+028.6,015,12.5,0,00000
2018-06-27 23:00:00,+099999,+099999,+00.0,00.0,000,+019.6,038,705,+024.0,015,12.5,0,00001

"""

Note these are numeric values, but their representation probably requires an implicit converter be called.

The C engine will not locate NaNs unless the verbatim per-character value is presented to na_values:

pd.read_csv(io.StringIO(data), na_values=['99999'])

	Time	ConcRT(ug/m3)	ConcHR(ug/m3)	WD(Deg)	AT(C)	RH(%)	BP(mmHg)	FT(C)	FRH(%)	PM
2018-06-27 21:00:00	12	7	16.7	22.6	25	704	30.1	12	12.5	0
2018-06-27 22:00:00	3	9	16.7	21.2	33	705	28.6	15	12.5	0
2018-06-27 23:00:00	99999	99999	0.0	19.6	38	705	24.0	15	12.5	1

pd.read_csv(io.StringIO(data), na_values=[99999])

Time	ConcRT(ug/m3)	ConcHR(ug/m3)	Flow(lpm)	AT(C)	RH(%)	BP(mmHg)	FT(C)	FRH(%)	BV(V)	Status
2018-06-27 21:00:00	12	7	16.7	22.6	25	704	30.1	12	12.5	0
2018-06-27 22:00:00	3	9	16.7	21.2	33	705	28.6	15	12.5	0
2018-06-27 23:00:00	99999	99999	0.0	19.6	38	705	24.0	15	12.5	1

pd.read_csv(io.StringIO(data), na_values=['+099999'])

Time	ConcRT(ug/m3)	ConcHR(ug/m3)	Flow(lpm)	AT(C)	RH(%)	BP(mmHg)	FT(C)	FRH(%)	BV(V)	Status
2018-06-27 21:00:00	12.0	7.0	16.7	22.6	25	704	30.1	12	12.5	0
2018-06-27 22:00:00	3.0	9.0	16.7	21.2	33	705	28.6	15	12.5	0
2018-06-27 23:00:00	NaN	NaN	0.0	19.6	38	705	24.0	15	12.5	1

This situation highlights the shortcomings of picking a simple 'before' or 'after' for converters to be applied. Consensus so far is to apply na_values before converters but I suspect certain converters should always be applied first.

Note the Python engine handles the situation identically (and "correctly") in all scenarios:

pd.read_csv(io.StringIO(data), na_values=['99999'], engine='python')

Time	ConcRT(ug/m3)	ConcHR(ug/m3)	Flow(lpm)	AT(C)	RH(%)	BP(mmHg)	FT(C)	FRH(%)	BV(V)	Status
2018-06-27 21:00:00	12.0	7.0	16.7	22.6	25	704	30.1	12	12.5	0
2018-06-27 22:00:00	3.0	9.0	16.7	21.2	33	705	28.6	15	12.5	0
2018-06-27 23:00:00	NaN	NaN	0.0	19.6	38	705	24.0	15	12.5	1

pd.read_csv(io.StringIO(data), na_values=[99999], engine='python')

Time	ConcRT(ug/m3)	ConcHR(ug/m3)	Flow(lpm)	AT(C)	RH(%)	BP(mmHg)	FT(C)	FRH(%)	BV(V)	Status
2018-06-27 21:00:00	12.0	7.0	16.7	22.6	25	704	30.1	12	12.5	0
2018-06-27 22:00:00	3.0	9.0	16.7	21.2	33	705	28.6	15	12.5	0
2018-06-27 23:00:00	NaN	NaN	0.0	19.6	38	705	24.0	15	12.5	1

pd.read_csv(io.StringIO(data), na_values=['+099999'], engine='python')

Time	ConcRT(ug/m3)	ConcHR(ug/m3)	Flow(lpm)	AT(C)	RH(%)	BP(mmHg)	FT(C)	FRH(%)	BV(V)	Status
2018-06-27 21:00:00	12.0	7.0	16.7	22.6	25	704	30.1	12	12.5	0
2018-06-27 22:00:00	3.0	9.0	16.7	21.2	33	705	28.6	15	12.5	0
2018-06-27 23:00:00	NaN	NaN	0.0	19.6	38	705	24.0	15	12.5	1

fynnwz · 2023-09-15T19:14:55Z

@patricktokeeffe @gfyoung I agree, this bug is insidious, very difficult to find and highly confusing. Here is a snippet to trace it down. It took me a couple of hours.

import requests 
import zipfile
from io import BytesIO
import pandas as pd
from datetime import datetime

zip_url = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/10_minutes/solar/historical/10minutenwerte_SOLAR_00704_20020801_20091231_hist.zip"
response = requests.get(zip_url)
response.raise_for_status()

# Open the .zip file in memory
with zipfile.ZipFile(BytesIO(response.content)) as z:
    # Extract the .txt file containing the data
    txt_file_name = z.namelist()[0]
    with z.open(txt_file_name) as file:
        df = pd.read_csv(file, sep=";", na_values="-999", engine="c")
        
print(df.loc[df['MESS_DATUM'] == 200709010900]['GS_10'])
print(df.loc[df['MESS_DATUM'] == 200709010910]['GS_10'])

This is my output:

262143   -999.0
Name: GS_10, dtype: float64
262144   NaN
Name: GS_10, dtype: float64

Some values of my data are correctly interpreted as NAN and some are not. Apperently at the two dates I selected above, is where the switch actually happens. Before "200709010900" every value of -999 is not interpreted as NAN. After "200709010910" all -999 values are correctly interpreted as NAN. I checked the csv-file manually and couldn't find any inconsistencies. However, once I change engine="python", it all works as expected. Am I missing something or is this indeed a weird bug?

kawochen mentioned this issue May 27, 2016

ENH/DOC/CLN: Document arguments and reconcile C and Python engines for read_csv #12686

Open

22 tasks

jreback added Bug IO CSV read_csv, to_csv Difficulty Intermediate labels May 27, 2016

jreback added this to the Next Major Release milestone May 27, 2016

gfyoung mentioned this issue May 27, 2016

read_fwf does not support dtype argument #7141

Closed

gfyoung mentioned this issue Jul 8, 2016

BUG: python and c engines for read_csv treat blank spaces differently, when using converters #13576

Closed

gfyoung mentioned this issue Jan 22, 2017

read_fwf with passing converters and na_values at the same time #15144

Closed

gfyoung mentioned this issue Dec 3, 2017

BUG: Converters handling inconsistent with usecols #18566

Open

jbrockmendel removed the Difficulty Intermediate label Oct 21, 2019

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Handling of na_values and converters in read_csv #13302

Inconsistent Handling of na_values and converters in read_csv #13302

gfyoung commented May 26, 2016 •

edited

Loading

gfyoung commented May 28, 2016 •

edited

Loading

gfyoung commented Jan 24, 2017

patricktokeeffe commented Jun 11, 2019

fynnwz commented Sep 15, 2023 •

edited

Loading

Inconsistent Handling of na_values and converters in read_csv #13302

Inconsistent Handling of na_values and converters in read_csv #13302

Comments

gfyoung commented May 26, 2016 • edited Loading

gfyoung commented May 28, 2016 • edited Loading

gfyoung commented Jan 24, 2017

patricktokeeffe commented Jun 11, 2019

fynnwz commented Sep 15, 2023 • edited Loading

gfyoung commented May 26, 2016 •

edited

Loading

gfyoung commented May 28, 2016 •

edited

Loading

fynnwz commented Sep 15, 2023 •

edited

Loading