-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
CSV files with header break with delim_whitespace and skiprows using the C-engine #18692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@gfyoung can you have a look |
@rgieseke : Thanks for reporting this! Handling malformed rows is just not an easy question, and I agree here that indeed we should handle this case. #12900 was a design choice on our part to ensure that quoted lines got fully skipped, even if they had line-terminators within them. In your case, because your quoted line never properly terminates, Surprised to see that the Python engine is okay with this. Line skipping behavior is hard to understand there because it's masked away in the Python language library itself. My feeling is that we can add a second parameter called |
@gfyoung Thanks for the quick feedback! Here is a more real-world example from the data where I encountered the problem (simplified, this is Fortran namelist metadata plus whitespace separated columns), it's not really malformed, just having a space within the quotes. Again, with import pandas as pd
from io import StringIO
data = """&THISFILE_SPECIFICATIONS
THISFILE_UNITS="K ",
/
YEARS GLOBAL
1765 0.00000000E+00
"""
pd.read_csv(StringIO(data), skiprows=3, delim_whitespace=True) If there were newlines in the header to be skipped, wouldn't it be okay to treat them as newlines? If one has a weird header one wants to be skipped, one would need to check anyway if the first line is correctly identified. |
Indeed, this example is harder to explain away, since it isn't particularly malformed in this case...have a look at the CParser code to see where the discrepancy is arising.
Not sure I fully understand your question here. |
Sorry, that was hard to parse ... I hadn't thought of the usecase of actually wanting to skip a number of rows. I only ever use skiprows to get rid of meta information in header lines. |
Ah, gotcha. In any case, FWIW, if you remove the |
I've experienced a similar issue when using skiprows to skip "corrupt" lines in the csv files. Tested with pandas 0.22.0 and 0.19.2. Perhaps adding an argument to disable parsing of quotes in skipped rows is an option to fix this? Exampleimport pandas as pd
from io import StringIO
import csv pd.__version__
data="""1 2 3
a"b "
4 5 6
""" print(data)
pd.read_csv(StringIO(data), header=None, delim_whitespace=True, skiprows=[1])
pd.read_csv(StringIO(data), header=None, delim_whitespace=True, skiprows=[1], engine="python")
pd.read_csv(StringIO(data), header=None, delim_whitespace=True, skiprows=[1], quoting=csv.QUOTE_NONE)
|
|
Code Sample, a copy-pastable example if possible
Problem description
When there is a quote char with a space before and using
delim_whitespace=True
andskiprows
reading a CSV file breaks withExpected Output
It should simply skip the header rows.
When using the Python engine it works, so this seems to be a problem with the C-based parser, possibly related to the behaviour introduced in #12900
My real data has something like
I also tested this with current master.
Output of
pd.show_versions()
commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-40-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.21.0
pytest: 3.3.1
pip: 9.0.1
setuptools: 38.2.4
Cython: 0.27.3
numpy: 1.13.3
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: