You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The on_bad_lines=warn question me. Pandas team added the functionality to directly handle lines with more separators than the main lines, that's why it don't seems strange to me that some other pandas option could handle in the same way lines with less separators than main lines
Using pandas.read_csv with on_bad_lines='warn' option the case there is too many
columns delimiters work well, bad lines are not loaded and stderr catch the bad lines
numbers:
importpandasaspdfromioimportStringIOdata=StringIO(""" nom,f,nb bat,F,52 cat,M,66, caw,F,15 dog,M,66,, fly,F,61 ant,F,21""")
df=pd.read_csv(data, sep=',', on_bad_lines='warn')
b'Skipping line 4: expected 3 fields, saw 4\nSkipping line 6: expected 3 fields, saw 5\n'df.head(10)
# nom f nb# 0 bat F 52# 1 caw F 15# 2 fly F 61# 3 ant F 21
But in case the number of delimiter (here sep=,) is less than the main, the line
is added adding NaN.:
importpandasaspdfromioimportStringIOdata=StringIO(""" nom,f,nb bat,F,52 catM66, caw,F,15 dog,M66 fly,F,61 ant,F,21""")
df=pd.read_csv(data, sep=',', on_bad_lines='warn', dtype=str)
df.head(10)
# nom f nb# 0 bat F 52# 1 catM66 NaN NaN <==# 2 caw F 15# 3 dog M66 NaN <==# 4 fly F 61# 5 ant F 21
Is there a way to make read_csv to not add lines with less columns delimiters than
the main lines ?
The text was updated successfully, but these errors were encountered:
If a row has more fields than fieldnames, the remaining data is put in a list and stored with the fieldname specified by restkey (which defaults to None). If a non-blank row has fewer fields than fieldnames, the missing values are filled-in with the value of restval (which defaults to None).
I've used this like this to detect row that have too few columns:
last_field_name = 'foo'
missing_value = object()
for row in csv.DictReader(fp, restval=missing_value):
if row[last_field_name] == missing_value:
print(f"Row with too few columns: {row}")
Perhaps something like that could be easily added to pandas?
Research
I have searched the [pandas] tag on StackOverflow for similar questions.
I have asked my usage related question on StackOverflow.
Link to question on StackOverflow
https://stackoverflow.com/questions/73820090/make-pandas-read-csv-to-not-add-lines-with-less-columns-delimiters-than-the-main
Question about pandas
The
on_bad_lines=warn
question me. Pandas team added the functionality to directly handle lines with more separators than the main lines, that's why it don't seems strange to me that some other pandas option could handle in the same way lines with less separators than main linesUsing
pandas.read_csv
withon_bad_lines='warn'
option the case there is too manycolumns delimiters work well, bad lines are not loaded and stderr catch the bad lines
numbers:
But in case the number of delimiter (here
sep=,
) is less than the main, the lineis added adding
NaN
.:Is there a way to make
read_csv
to not add lines with less columns delimiters thanthe main lines ?
The text was updated successfully, but these errors were encountered: