-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
usecols dooesn't help with unclean csv's #9549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
there were a couple of issues w.r.t. to this that have been fixed in master, the upcoming 0.16.0 release. want to give a try. otherwise, pls |
!cat pandastrial.csv
A,B,C
1,2,3
3,4,5
1,2,4,5,1,6
1,2,3,,,1,
1,2,3
5,6,7
In [3]: df = pd.read_csv('pandastrial.csv', error_bad_lines = False)
Skipping line 4: expected 3 fields, saw 6
Skipping line 5: expected 3 fields, saw 7
In [5]: df = pd.read_csv('pandastrial.csv', usecols = ['A','B','C'])
CParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 6 Basically, I know that there is an issue with other columns and I only trust three of them. How do I tell pandas to just read the first three values and truncate the rest? INSTALLED VERSIONScommit: None pandas: 0.15.2
|
I came up with this hack. def clean_csv(filename, n_clean_cols, maxcols = 20):
df = pd.read_csv(filename, names = range(maxcols), index_col = False)
cols = df.columns
rename_dict = dict([ (c, df[c][0]) for c in cols])
df.rename(columns = rename_dict, inplace = True)
#removing the first row which is no longer needed.
# df = df.drop(df.head(1).index)
df = df.ix[1:,:n_clean_cols]
return df |
@jreback ? |
This looks like a good soln to me. I don't think
|
I added a process_bad_lines option to the c parser for times when you want to get as much data out of a csv file with bad lines as possible without pre-processing the file. The doc string mentions the danger of using this option. Using it outputs WARNING lines for bad lines that are processed pd.read_csv(StringIO(data),process_bad_lines=True) Out[8]: It's still easy to get bad data from bad lines with this option, so it's not a cure for bad data If this is an option that would be useful I can clean up the code and create a pull request. |
how does this differ from my example above? |
error_bad_lines and warn_bad_lines skip over bad rows. process_bad_lines attempts to extract data from the bad rows. So the dataframe in your example with error_bad_lines=False contains data from lines 1,2,5,6 while the dataframe with process_bad_rows=True has data from all 6 lines. I believe this is what the op was looking for. process_bad_lines would be useful when you know the data in the columns your are interested in are OK even on lines in the file that contain extra columns you don't care about. |
I not add a new arg then instead |
OK, so something like this for the bad_lines options? error_bad_lines : True, False, "process", default True |
I think I would change things a bit why don't we make a new argument (and deprecate
default would be
too confusing? (or less confusing)? (obviously not all of these apply to c/python parsers but that is a separate issue) |
@selasley Yes. That is what I wanted. My work involved checking csv's which over the years some people had edited and added comments etc. But the data was clean, the parts I wanted atleast. I finally used the hack I have mentioned earlier. But it was a little slow and dirty. How did you implement it initially? @jreback Your last comment makes sense. I didn't find it that confusing. Maybe something other than |
@harshnisar I am modifying the python, cython and c files for the parser and tokenizer so bad_lines will eventually be available in a release version of pandas. What do you think about this bad_lines : string, {'error', 'skip', or 'keep'}, default 'error' I vote to keep warn_bad_lines in order to turn warnings on or off with the skip and keep options. |
@selasley why would you need maybe enumerate what the other 3, e.g. |
@selasley The heuristic way would be to take the most frequent row length as the default. But is this an overkill? Also, warning as a different argument seems more intuitive too. I vote for that too, if it counts :) |
@jreback @harshnisar
I'm still working on the code. For example, I would like to make header=False work as well as skiprows=1 when they are used with bad_lines='keep' |
@selasley I think the addition described here for processing bad lines would be very useful. Do you still want to make a PR for this? Or just push your code to a branch on your fork so somebody else could use it? |
Closing this in favor of #15122, the discussion here was a bit buried under the original related but different issue report |
So I have a lot of csv's which are clean till say 7 columns and have no missing values but have strings at random places starting after column 7.
I know it is clean till only 7. So, when I say usecols and list the 7 columns, I want it to ignore the other columns, probably truncate the remaining parts in the row when reading too. Shouldn't that be the functionality?
I don't want to skip over bad lines.
Is there a way in which I can force pandas to only read 7 columns and expect 7 rows while reading and hence not raise an exception?
Another method is to use names = range(35), an arbitrarily large number. But then I lose the real headers in my file and can't say what they are talking about. These columns are not fixed.
edit: It's my first issue report in a huge python package. Please bear if I didn't follow any protocol.
The text was updated successfully, but these errors were encountered: