You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
importpandasaspdfromnumpyimportdtype### Create the sample datawithopen('data.csv', 'w+') asfile:
file.write("ID,X1,X2,X3\n")
file.write("0,1,Amigo,3\n")
file.write("1,1,Inimigo, amor,9\n")
file.write("2,1,Cowboy,42\n")
file.close()
dtypes= {"ID": dtype("int64"),
"X1": dtype("int64"),
"X2": dtype("O"),
"X3": dtype("int64")}
print("Load df with no params: ", end="")
try:
df=pd.read_csv("data.csv")
print("Sucess")
except:
print("Fail")
print("Load df with error bad lines: ", end="")
try:
df=pd.read_csv("data.csv", error_bad_lines=False)
print("Sucess")
except:
print("Fail")
print("Load df with error bad lines and dtypes: ", end="")
try:
df=pd.read_csv("data.csv", error_bad_lines=False, dtype=dtypes)
print("Sucess")
except:
print("Fail")
Problem description
The problem is that error_bad_lines is pretty useful to deal with undesired commas inside the data that splits a single column into two new ones. But when dtype is defined, it checks the type of each column before skipping a problematic row, causing it to not match.
I'd argue that the row should be skipped before checking the dtype, because when a problematic row appears its dtypes are messed.
Expected Output
It should skip the problematic row even when dtype is passed as param.
When submitting a bug report please try to make the reproducible example as minimal as possible. FWIW I've done that below, but I am not getting any errors on the master branch. I see you are using pandas 20.3 - have you at least tried on the latest version of 22?
Code Sample, a copy-pastable example if possible
Problem description
The problem is that
error_bad_lines
is pretty useful to deal with undesired commas inside the data that splits a single column into two new ones. But whendtype
is defined, it checks the type of each column before skipping a problematic row, causing it to not match.I'd argue that the row should be skipped before checking the dtype, because when a problematic row appears its dtypes are messed.
Expected Output
It should skip the problematic row even when dtype is passed as param.
Output of
pd.show_versions()
pandas: 0.22.0
pytest: 3.0.5
pip: 9.0.1
setuptools: 36.5.0
Cython: None
numpy: 1.14.2
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.14
pymysql: None
psycopg2: 2.7.3.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: 0.1.4
fastparquet: None
pandas_gbq: None
pandas_datareader: None
update: I've updated my pandas version.
The text was updated successfully, but these errors were encountered: