Skip to content

BUG: read_csv accepts bad lines when option usecols is used #40049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
lodo1995 opened this issue Feb 25, 2021 · 4 comments
Open
2 of 3 tasks

BUG: read_csv accepts bad lines when option usecols is used #40049

lodo1995 opened this issue Feb 25, 2021 · 4 comments
Labels
Deprecate Functionality to remove in pandas IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action

Comments

@lodo1995
Copy link

lodo1995 commented Feb 25, 2021

  • I have checked that this issue has not already been reported (couldn't find it)

  • I have confirmed this bug exists on the latest version of pandas (current version is 1.2.2 on conda)

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

Bad CSV file:

A,B,C
foo,1,2
bar,bar,3,4
baz,5,6

Python code:

import pandas as pd

df_all = pd.read_csv('bad.csv', error_bad_lines = False, warn_bad_lines = False)
print(df_all)
#      A  B  C
# 0  foo  1  2
# 1  baz  5  6

df_sub = pd.read_csv('bad.csv', error_bad_lines = False, warn_bad_lines = False, usecols = ['A','B'])
print(df_sub)
#      A    B
# 0  foo    1
# 1  bar  bar
# 2  baz    5

Problem description

As can be seen in the example above, when not using usecols, read_csv correctly ignores the line that is too long.
However, this is not the case when usecols is used.

Expected Output

Using usecols should not affect which lines are discarded due to being too long.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 7d32926db8f7541c356066dcadabf854487738de
python           : 3.8.6.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.3.0-64-generic
Version          : #58-Ubuntu SMP Fri Jul 10 19:33:51 UTC 2020
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.2.2
numpy            : 1.20.1
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 21.0.1
setuptools       : 49.6.0.post20210108
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.4
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : 1.6.0
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
@lodo1995 lodo1995 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 25, 2021
@alexprincel
Copy link
Contributor

alexprincel commented Feb 27, 2021

This appears to be intentional, given the following code in pandas.io.parsers.python_parser

    def _rows_to_cols(self, content):
        col_len = self.num_original_columns

        if self._implicit_index:
            col_len += len(self.index_col)

        max_len = max(len(row) for row in content)

        # Check that there are no rows with too many
        # elements in their row (rows with too few
        # elements are padded with NaN).
        if max_len > col_len and self.index_col is not False and self.usecols is None:

            footers = self.skipfooter if self.skipfooter else 0
            bad_lines = []

While I understand your point of view on this, changing this would have potential to silently alter the result of existing code. It is also possible to get the first result by indexing with the column names instead of using usecols in this specific context.

If there is consensus that this should be fixed, I could submit a PR. The fix could be as simple as removing the check on self.usecols and adding tests.

@lodo1995
Copy link
Author

I understand that this would be a breaking change. However, I believe the current behaviour is illogical and dangerous, and therefore should be fixed. If the core team decides against the change, I would suggest at least adding a clear warning in the documentation. I'll now list my reasons to argue this.

First, the current behaviour is illogical. From a logical perspective, there is no reason why usecols should affect error behaviour. Filtering columns and rejecting bad rows are logically orthogonal behaviours.

Second, the current behaviour defeats developer expectations. Using usecols should be the same as indexing the dataframe returned by read_csv. Developers are incentivized by the documentation to replace the latter with the former, due to better performance. But they are not always the same!

This last fact leads to danger. Moving from read_csv with error_bad_lines = False followed by indexing, to read_csv with error_bad_lines = False and usecols can silently change the contents of the dataframe. Unless typing errors appear, the change could go unnoticed forever.

Even without error_bad_lines = False there could be dangers. A developer used to having read_csv reject any malformed CSV might believe that if a CSV is accepted by read_csv, then it must be well-formed and therefore the dataframe reliable. However, if usecols is used, that might not be the case. And again, if no typing errors appear in later code, the issue might go unnoticed.

Honestly, when opening this bug, I thought the behaviour difference might have to do with optimizations. I thought that maybe the parser doesn't even parse an entire row, and instead only consider the columns specified by usecols, therefore being unable to verify the row length. However, the snippet of code showed in the previous comment seems to imply, as @alexprincel noted, that the different behaviour is deliberate and not driven by optimizations. This in my opinion negates the only good reason to motivate this behaviour.

@MarcoGorelli MarcoGorelli added the IO CSV read_csv, to_csv label Feb 28, 2021
@lithomas1 lithomas1 added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 11, 2021
@lithomas1
Copy link
Member

lithomas1 commented Mar 15, 2021

@lodo1995 @alexprincel This is actually mentioned in the user guide(https://pandas.pydata.org/docs/user_guide/io.html#handling-bad-lines) and would need deprecation if it were to be removed. I don't think it would be appropriate to remove this now given that it is the only way to read in lines with more fields than usual. I think if #5686 is implemented, this would remove the need for usecols to have this functionality and this could be deprecated after.

@mroeschke mroeschke added Deprecate Functionality to remove in pandas and removed Bug labels Aug 19, 2021
@rotten
Copy link

rotten commented Apr 11, 2022

The problem is that depending on where stray commas appear in the row, we could end up with bad data in the dataframe because usecols will pull the wrong fields.

What we need to do then, whenever we want to use usecols, is to read the CSV in twice. The first time generates a list of bad rows that we pass into the skiprows argument, and then the second time we can leverage usecols.

Unfortunately it isn't practical to use the df.read_csv() function to do the bad row detection. The on_bad_lines='error' option halts, and the on_bad_lines='warn' option writes to stderr, making it hard to capture and parse into a list of row numbers for the skiprows argument. Also loading the whole CSV into memory to simply detect bad lines and then throw it out immediately afterwards doesn't make sense. So we fall back to using a function something like this that can generate a list of the bad rows before we do the read_csv() with usecols:

def get_bad_rows(csv_file_name: str) -> List[int]:
    '''Identify rows in a csv file with the wrong number of fields'''

    print(f'\tChecking {csv_file_name} for bad rows')
    bad_row_list = []

    with open(csv_file_name, 'r', encoding='latin-1') as csv_file:

        # pandas row numbers start counting at 0
        # the row numbers include the header if it exists
        row_counter = 0
        number_of_columns = 0
        csv_reader = csv.reader(csv_file)

        for row in csv_reader:
            if number_of_columns == 0:
                # assume the first row has the correct number of columns
                # (whether or not it is a header)
                number_of_columns = len(row)
            if len(row) != number_of_columns:
                bad_row_list.append(row_counter)
            row_counter += 1

    print(f'\t\tFound {len(bad_row_list)} bad rows.')
    return bad_row_list

I'm game for ideas for faster ways to discover the bad rows before we use read_csv if we can't discover them while using read_csv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

6 participants