Skip to content

QST: handle lines with less separators than main lines in read_csv #48728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
LuisBL opened this issue Sep 23, 2022 · 4 comments
Open
2 tasks done

QST: handle lines with less separators than main lines in read_csv #48728

LuisBL opened this issue Sep 23, 2022 · 4 comments
Labels
Enhancement IO CSV read_csv, to_csv

Comments

@LuisBL
Copy link

LuisBL commented Sep 23, 2022

Research

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/73820090/make-pandas-read-csv-to-not-add-lines-with-less-columns-delimiters-than-the-main

Question about pandas

The on_bad_lines=warn question me. Pandas team added the functionality to directly handle lines with more separators than the main lines, that's why it don't seems strange to me that some other pandas option could handle in the same way lines with less separators than main lines

Using pandas.read_csv with on_bad_lines='warn' option the case there is too many
columns delimiters work well, bad lines are not loaded and stderr catch the bad lines
numbers:

    import pandas as pd
    from io import StringIO
    data = StringIO("""
    nom,f,nb
    bat,F,52
    cat,M,66,
    caw,F,15
    dog,M,66,,
    fly,F,61
    ant,F,21""")
    df = pd.read_csv(data, sep=',', on_bad_lines='warn')

    b'Skipping line 4: expected 3 fields, saw 4\nSkipping line 6: expected 3 fields, saw 5\n'

    df.head(10)
    #    nom  f  nb
    # 0  bat  F  52
    # 1  caw  F  15
    # 2  fly  F  61
    # 3  ant  F  21

But in case the number of delimiter (here sep=,) is less than the main, the line
is added adding NaN.:

    import pandas as pd
    from io import StringIO
    data = StringIO("""
    nom,f,nb
    bat,F,52
    catM66,
    caw,F,15
    dog,M66
    fly,F,61
    ant,F,21""")
    df = pd.read_csv(data, sep=',', on_bad_lines='warn', dtype=str)
    df.head(10)

    #       nom    f   nb
    # 0     bat    F   52
    # 1  catM66  NaN  NaN            <==
    # 2     caw    F   15
    # 3     dog  M66  NaN            <==
    # 4     fly    F   61
    # 5     ant    F   21

Is there a way to make read_csv to not add lines with less columns delimiters than
the main lines ?

@LuisBL LuisBL added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Sep 23, 2022
@phofl
Copy link
Member

phofl commented Sep 29, 2022

Hi, thanks for your report.
Could you check, if we have open issues about this? Pretty sure that I've seen something like this already

@DJviolin
Copy link

DJviolin commented Nov 2, 2022

@phofl I also experience this problem, I also searched similar opened issues, but I couldn't find either.

This missing feature is the reason why I still need to use AWK, to pre-process my CSV files:

{
    if (NF != 11) next; # if column count is not equal to 11, then remove it
}

This feature is a MUST in a tool that's processing CSV.

Also, usecols together with on_bad_lines breaks the later one: #40049

@phofl
Copy link
Member

phofl commented Nov 2, 2022

Contributions are welcome

@simonjayhawkins simonjayhawkins added the IO CSV read_csv, to_csv label Feb 6, 2024
@mroeschke mroeschke added Enhancement and removed Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Jul 16, 2024
@brki
Copy link

brki commented Sep 6, 2024

The standard csv DictReader has this approach:

If a row has more fields than fieldnames, the remaining data is put in a list and stored with the fieldname specified by restkey (which defaults to None). If a non-blank row has fewer fields than fieldnames, the missing values are filled-in with the value of restval (which defaults to None).

I've used this like this to detect row that have too few columns:

last_field_name = 'foo'
missing_value = object()
for row in csv.DictReader(fp, restval=missing_value):
    if row[last_field_name] == missing_value:
        print(f"Row with too few columns: {row}")

Perhaps something like that could be easily added to pandas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

6 participants