Skip to content

read_fwf does not use comment character if colspecs argument does not include first column #14135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nograpes opened this issue Sep 2, 2016 · 5 comments · Fixed by #44245
Closed
Assignees
Labels
good first issue IO CSV read_csv, to_csv Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@nograpes
Copy link

nograpes commented Sep 2, 2016

When reading fixed-width files using the read_fwf function, it is possible to specify a comment character using the comment argument. I expected that all lines beginning with the comment character would be ignored. However, if you do not specify the first column in the file in any column in colspecs, the comment character does not appear to be used.

Code Sample

import io
import pandas as pd

# Two input files, first line is comment, second line is data.
# Second file has a column (with the letter A) 
# that I don't want at start of data.
string = "#\n1K\n"
off_string = "#\nA1K\n"

# When using skiprows to skip commented row, both work.
pd.read_fwf(io.StringIO(string), colspecs = [(0,1), (1,2)], skiprows = 1, header = None)
#    0  1
#0  1  K

pd.read_fwf(io.StringIO(off_string), colspecs = [(1,2), (2,3)], skiprows = 1, header = None)
#    0  1
#0  1  K

# If a comment character is specified, it only works when the colspecs 
# includes the column with the comment character.
pd.read_fwf(io.StringIO(string), colspecs = [(0,1), (1,2)], comment = '#', header = None)
#    0  1
#0  1  K

pd.read_fwf(io.StringIO(off_string), colspecs = [(1,2), (2,3)], comment = '#', header = None)
#      0    1
#0  NaN  NaN
#1  1.0    K

Expected Output

I expect that the final line should have the same result as the other three that is:

pd.read_fwf(io.StringIO(off_string), colspecs = [(1,2), (2,3)], comment = '#', header = None)

Should result in

#    0  1
#0  1  K

I first reported this on StackOverflow.

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.6.5-x86_64-linode71
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 3.3
Cython: None
numpy: 1.11.1
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.0.14
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: None
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Sep 2, 2016

kind of surpised that the comment works at all, it isn't tested and disambiguating the comment vs the colspecs is certainly not tested. that said, comment determination works just fine as FWF inherits from the PythonParser.

I'll mark it, pull-requests with tests welcome.

@jreback jreback added this to the Next Major Release milestone Sep 2, 2016
@jbrockmendel jbrockmendel added IO Fixed Width read_fwf and removed IO CSV read_csv, to_csv labels Dec 12, 2019
@mroeschke
Copy link
Member

This appears to work now on master. Could use a test

In [30]: import io
    ...: import pandas as pd
    ...:
    ...: # Two input files, first line is comment, second line is data.
    ...: # Second file has a column (with the letter A)
    ...: # that I don't want at start of data.
    ...: string = "#\n1K\n"
    ...: off_string = "#\nA1K\n"

In [31]: pd.read_fwf(io.StringIO(off_string), colspecs = [(1,2), (2,3)], comment = '#', header = None)
Out[31]:
   0  1
0  1  K

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug IO Fixed Width read_fwf labels May 1, 2021
@calvh
Copy link
Contributor

calvh commented May 14, 2021

take

@calvh
Copy link
Contributor

calvh commented May 15, 2021

Specs:

INSTALLED VERSIONS
------------------
commit           : 2cb96529396d93b46abab7bbc73a208e708c642e
python           : 3.8.5.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.8.0-53-generic
Version          : #60~20.04.1-Ubuntu SMP Thu May 6 09:52:46 UTC 2021
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_CA.UTF-8
LOCALE           : en_CA.UTF-8

pandas           : 1.2.4
numpy            : 1.20.3
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 44.0.0
Cython           : None
pytest           : 6.2.4
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
None

I think the bug still exists but on a deeper level. If I understand the problem correctly, read_fwf should recognize commented lines even if the colspecs does not include the column where the comment character is found.

from io import StringIO

import pandas as pd
from pandas.io.parsers import read_fwf

data = [
    "#\n123\n456",
    "# \n123\n456",
    "#abc\n123\n456",
    "# abc\n123\n456",
]

colspecs = [(0, 1), (1, 2)]
print(f"colspecs: {colspecs}")
for s in data:
    df = read_fwf(StringIO(s), comment="#", colspecs=colspecs, header=None)
    print(f"{df}\n")

colspecs = [(1, 2), (2, 3)]
print(f"colspecs: {colspecs}")
for s in data:
    df = read_fwf(StringIO(s), comment="#", colspecs=colspecs, header=None)
    print(f"{df}\n")

Output:

colspecs: [(0, 1), (1, 2)]
   0  1
0  1  2
1  4  5

   0  1
0  1  2
1  4  5

   0  1
0  1  2
1  4  5

   0  1
0  1  2
1  4  5

colspecs: [(1, 2), (2, 3)]
   0  1
0  2  3
1  5  6

   0  1
0  2  3
1  5  6

   0  1
0  a  b
1  2  3
2  5  6

     0  1
0  NaN  a
1  2.0  3
2  5.0  6

The last two results are the interesting ones. In short, if you set colspecs to exclude the first column, two things could happen:

  1. Your comment line only has "#" and possibly whitespaces: no problem
  2. Your comment line has "#" and non-whitespace character: it will ignore the comment character and try to parse the line

Please let me know if I made a mistake somewhere.

@calvh
Copy link
Contributor

calvh commented May 15, 2021

Here is the test code I made:

@pytest.mark.parametrize(
    "data",
    [
        "#\n123\n456",
        "# \n123\n456",
        "#abc\n123\n456",
        "# abc\n123\n456",
    ],
)
@pytest.mark.parametrize(
    "colspecs, expected",
    [
        ([(0, 1), (1, 2)], DataFrame([[1, 2], [4, 5]])),
        ([(1, 2), (2, 3)], DataFrame([[2, 3], [5, 6]])),
    ],
)
def test_comment_no_colspecs(data, colspecs, expected):
    result = read_fwf(StringIO(data), comment="#", colspecs=colspecs, header=None)
    tm.assert_frame_equal(result, expected)

@mroeschke mroeschke mentioned this issue Oct 31, 2021
9 tasks
@jreback jreback added the IO CSV read_csv, to_csv label Oct 31, 2021
@jreback jreback modified the milestones: Contributions Welcome, 1.4 Oct 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue IO CSV read_csv, to_csv Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants