-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_csv accepts bad lines when option usecols is used #40049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This appears to be intentional, given the following code in pandas.io.parsers.python_parser
While I understand your point of view on this, changing this would have potential to silently alter the result of existing code. It is also possible to get the first result by indexing with the column names instead of using usecols in this specific context. If there is consensus that this should be fixed, I could submit a PR. The fix could be as simple as removing the check on self.usecols and adding tests. |
I understand that this would be a breaking change. However, I believe the current behaviour is illogical and dangerous, and therefore should be fixed. If the core team decides against the change, I would suggest at least adding a clear warning in the documentation. I'll now list my reasons to argue this. First, the current behaviour is illogical. From a logical perspective, there is no reason why Second, the current behaviour defeats developer expectations. Using This last fact leads to danger. Moving from Even without Honestly, when opening this bug, I thought the behaviour difference might have to do with optimizations. I thought that maybe the parser doesn't even parse an entire row, and instead only consider the columns specified by |
@lodo1995 @alexprincel This is actually mentioned in the user guide(https://pandas.pydata.org/docs/user_guide/io.html#handling-bad-lines) and would need deprecation if it were to be removed. I don't think it would be appropriate to remove this now given that it is the only way to read in lines with more fields than usual. I think if #5686 is implemented, this would remove the need for usecols to have this functionality and this could be deprecated after. |
The problem is that depending on where stray commas appear in the row, we could end up with bad data in the dataframe because What we need to do then, whenever we want to use Unfortunately it isn't practical to use the
I'm game for ideas for faster ways to discover the bad rows before we use |
I have checked that this issue has not already been reported (couldn't find it)
I have confirmed this bug exists on the latest version of pandas (current version is 1.2.2 on conda)
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Bad CSV file:
Python code:
Problem description
As can be seen in the example above, when not using
usecols
,read_csv
correctly ignores the line that is too long.However, this is not the case when
usecols
is used.Expected Output
Using
usecols
should not affect which lines are discarded due to being too long.Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: