QST: handle lines with less separators than main lines in read_csv #48728

LuisBL · 2022-09-23T06:26:15Z

Research

I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/73820090/make-pandas-read-csv-to-not-add-lines-with-less-columns-delimiters-than-the-main

Question about pandas

The on_bad_lines=warn question me. Pandas team added the functionality to directly handle lines with more separators than the main lines, that's why it don't seems strange to me that some other pandas option could handle in the same way lines with less separators than main lines

Using pandas.read_csv with on_bad_lines='warn' option the case there is too many
columns delimiters work well, bad lines are not loaded and stderr catch the bad lines
numbers:

    import pandas as pd
    from io import StringIO
    data = StringIO("""
    nom,f,nb
    bat,F,52
    cat,M,66,
    caw,F,15
    dog,M,66,,
    fly,F,61
    ant,F,21""")
    df = pd.read_csv(data, sep=',', on_bad_lines='warn')

    b'Skipping line 4: expected 3 fields, saw 4\nSkipping line 6: expected 3 fields, saw 5\n'

    df.head(10)
    #    nom  f  nb
    # 0  bat  F  52
    # 1  caw  F  15
    # 2  fly  F  61
    # 3  ant  F  21

But in case the number of delimiter (here sep=,) is less than the main, the line
is added adding NaN.:

    import pandas as pd
    from io import StringIO
    data = StringIO("""
    nom,f,nb
    bat,F,52
    catM66,
    caw,F,15
    dog,M66
    fly,F,61
    ant,F,21""")
    df = pd.read_csv(data, sep=',', on_bad_lines='warn', dtype=str)
    df.head(10)

    #       nom    f   nb
    # 0     bat    F   52
    # 1  catM66  NaN  NaN            <==
    # 2     caw    F   15
    # 3     dog  M66  NaN            <==
    # 4     fly    F   61
    # 5     ant    F   21

Is there a way to make read_csv to not add lines with less columns delimiters than
the main lines ?

The text was updated successfully, but these errors were encountered:

phofl · 2022-09-29T14:01:44Z

Hi, thanks for your report.
Could you check, if we have open issues about this? Pretty sure that I've seen something like this already

DJviolin · 2022-11-02T08:43:40Z

@phofl I also experience this problem, I also searched similar opened issues, but I couldn't find either.

This missing feature is the reason why I still need to use AWK, to pre-process my CSV files:

{
    if (NF != 11) next; # if column count is not equal to 11, then remove it
}

This feature is a MUST in a tool that's processing CSV.

Also, usecols together with on_bad_lines breaks the later one: #40049

phofl · 2022-11-02T08:48:34Z

Contributions are welcome

brki · 2024-09-06T14:28:40Z

The standard csv DictReader has this approach:

If a row has more fields than fieldnames, the remaining data is put in a list and stored with the fieldname specified by restkey (which defaults to None). If a non-blank row has fewer fields than fieldnames, the missing values are filled-in with the value of restval (which defaults to None).

I've used this like this to detect row that have too few columns:

last_field_name = 'foo'
missing_value = object()
for row in csv.DictReader(fp, restval=missing_value):
    if row[last_field_name] == missing_value:
        print(f"Row with too few columns: {row}")

Perhaps something like that could be easily added to pandas?

LuisBL added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Sep 23, 2022

simonjayhawkins added the IO CSV read_csv, to_csv label Feb 6, 2024

mroeschke added Enhancement and removed Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QST: handle lines with less separators than main lines in read_csv #48728

QST: handle lines with less separators than main lines in read_csv #48728

LuisBL commented Sep 23, 2022

phofl commented Sep 29, 2022

DJviolin commented Nov 2, 2022 •

edited

Loading

phofl commented Nov 2, 2022

brki commented Sep 6, 2024

QST: handle lines with less separators than main lines in read_csv #48728

QST: handle lines with less separators than main lines in read_csv #48728

Comments

LuisBL commented Sep 23, 2022

Research

Link to question on StackOverflow

Question about pandas

phofl commented Sep 29, 2022

DJviolin commented Nov 2, 2022 • edited Loading

phofl commented Nov 2, 2022

brki commented Sep 6, 2024

DJviolin commented Nov 2, 2022 •

edited

Loading