Skip to content

Commit a2aff4c

Browse files
committed
Squashed commit of the following:
commit 5e9e0fa29d727953583a116638a9d0db81f9ed21 Author: Michael Mueller <[email protected]> Date: Thu Jun 26 19:53:35 2014 -0400 Fixed issue with empty lines commit 57b54918b251ab77f000b575d77bcce3affcb27a Author: Michael Mueller <[email protected]> Date: Thu Jun 26 16:31:27 2014 -0400 Added reference to new functionality in docs commit a2371638691584416439d3c6a4dd2ef1829dcbe3 Author: Michael Mueller <[email protected]> Date: Thu Jun 26 16:26:06 2014 -0400 Implemented functionality to ignore comment lines, wrote a test
1 parent f8b101c commit a2aff4c

File tree

7 files changed

+211
-22
lines changed

7 files changed

+211
-22
lines changed

doc/source/io.rst

+31-4
Original file line numberDiff line numberDiff line change
@@ -98,8 +98,10 @@ They can take a number of arguments:
9898
data. Defaults to 0 if no ``names`` passed, otherwise ``None``. Explicitly
9999
pass ``header=0`` to be able to replace existing names. The header can be
100100
a list of integers that specify row locations for a multi-index on the columns
101-
E.g. [0,1,3]. Intervening rows that are not specified will be skipped.
102-
(E.g. 2 in this example are skipped)
101+
E.g. [0,1,3]. Intervening rows that are not specified will be
102+
skipped (e.g. 2 in this example are skipped). Note that this parameter
103+
ignores commented lines, so header=0 denotes the first line of
104+
data rather than the first line of the file.
103105
- ``skiprows``: A collection of numbers for rows in the file to skip. Can
104106
also be an integer to skip the first ``n`` rows
105107
- ``index_col``: column number, column name, or list of column numbers/names,
@@ -145,8 +147,12 @@ They can take a number of arguments:
145147
Acceptable values are 0, 1, 2, and 3 for QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONE, and QUOTE_NONNUMERIC, respectively.
146148
- ``skipinitialspace`` : boolean, default ``False``, Skip spaces after delimiter
147149
- ``escapechar`` : string, to specify how to escape quoted data
148-
- ``comment``: denotes the start of a comment and ignores the rest of the line.
149-
Currently line commenting is not supported.
150+
- ``comment``: Indicates remainder of line should not be parsed. If found at the
151+
beginning of a line, the line will be ignored altogether. This parameter
152+
must be a single character. Also, fully commented lines
153+
are ignored by the parameter `header` but not by `skiprows`. For example,
154+
if comment='#', parsing '#empty\n1,2,3\na,b,c' with `header=0` will
155+
result in '1,2,3' being treated as the header.
150156
- ``nrows``: Number of rows to read out of the file. Useful to only read a
151157
small portion of a large file
152158
- ``iterator``: If True, return a ``TextFileReader`` to enable reading a file
@@ -252,6 +258,27 @@ after a delimiter:
252258
data = 'a, b, c\n1, 2, 3\n4, 5, 6'
253259
print(data)
254260
pd.read_csv(StringIO(data), skipinitialspace=True)
261+
262+
Moreover, ``read_csv`` ignores any completely commented lines:
263+
264+
.. ipython:: python
265+
266+
data = 'a,b,c\n# commented line\n1,2,3\n#another comment\n4,5,6'
267+
print(data)
268+
pd.read_csv(StringIO(data), comment='#')
269+
270+
.. note::
271+
272+
The presence of ignored lines might create ambiguities involving line numbers;
273+
the parameter ``header`` uses row numbers (ignoring commented
274+
lines), while ``skiprows`` uses line numbers (including commented lines):
275+
276+
.. ipython:: python
277+
278+
data = '#comment\na,b,c\nA,B,C\n1,2,3'
279+
pd.read_csv(StringIO(data), comment='#', header=1)
280+
data = 'A,B,C\n#comment\na,b,c\n1,2,3'
281+
pd.read_csv(StringIO(data), comment='#', skiprows=2)
255282
256283
The parsers make every attempt to "do the right thing" and not be very
257284
fragile. Type inference is a pretty big deal. So if a column can be coerced to

doc/source/v0.14.1.txt

+3-3
Original file line numberDiff line numberDiff line change
@@ -102,9 +102,9 @@ Enhancements
102102

103103

104104

105-
106-
107-
105+
- The file parsers ``read_csv`` and ``read_table`` now ignore line comments provided by
106+
the parameter `comment`, which accepts only a single character for the C reader.
107+
In particular, they allow for comments before file data begins (:issue:`2685`)
108108
- Tests for basic reading of public S3 buckets now exist (:issue:`7281`).
109109
- ``read_html`` now sports an ``encoding`` argument that is passed to the
110110
underlying parser library. You can use this to read non-ascii encoded web

pandas/io/parsers.py

+33-15
Original file line numberDiff line numberDiff line change
@@ -64,9 +64,11 @@ class ParserWarning(Warning):
6464
pass ``header=0`` to be able to replace existing names. The header can be
6565
a list of integers that specify row locations for a multi-index on the
6666
columns E.g. [0,1,3]. Intervening rows that are not specified will be
67-
skipped. (E.g. 2 in this example are skipped)
67+
skipped (e.g. 2 in this example are skipped). Note that this parameter
68+
ignores commented lines, so header=0 denotes the first line of
69+
data rather than the first line of the file.
6870
skiprows : list-like or integer
69-
Row numbers to skip (0-indexed) or number of rows to skip (int)
71+
Line numbers to skip (0-indexed) or number of lines to skip (int)
7072
at the start of the file
7173
index_col : int or sequence or False, default None
7274
Column to use as the row labels of the DataFrame. If a sequence is given, a
@@ -106,8 +108,12 @@ class ParserWarning(Warning):
106108
thousands : str, default None
107109
Thousands separator
108110
comment : str, default None
109-
Indicates remainder of line should not be parsed
110-
Does not support line commenting (will return empty line)
111+
Indicates remainder of line should not be parsed. If found at the
112+
beginning of a line, the line will be ignored altogether. This parameter
113+
must be a single character. Also, fully commented lines
114+
are ignored by the parameter `header` but not by `skiprows`. For example,
115+
if comment='#', parsing '#empty\n1,2,3\na,b,c' with `header=0` will
116+
result in '1,2,3' being treated as the header.
111117
decimal : str, default '.'
112118
Character to recognize as decimal point. E.g. use ',' for European data
113119
nrows : int, default None
@@ -1313,6 +1319,7 @@ def __init__(self, f, **kwds):
13131319
self.data = None
13141320
self.buf = []
13151321
self.pos = 0
1322+
self.line_pos = 0
13161323

13171324
self.encoding = kwds['encoding']
13181325
self.compression = kwds['compression']
@@ -1459,6 +1466,7 @@ class MyDialect(csv.Dialect):
14591466
line = self._check_comments([line])[0]
14601467

14611468
self.pos += 1
1469+
self.line_pos += 1
14621470
sniffed = csv.Sniffer().sniff(line)
14631471
dia.delimiter = sniffed.delimiter
14641472
if self.encoding is not None:
@@ -1566,7 +1574,7 @@ def _infer_columns(self):
15661574
if self.header is not None:
15671575
header = self.header
15681576

1569-
# we have a mi columns, so read and extra line
1577+
# we have a mi columns, so read an extra line
15701578
if isinstance(header, (list, tuple, np.ndarray)):
15711579
have_mi_columns = True
15721580
header = list(header) + [header[-1] + 1]
@@ -1578,9 +1586,8 @@ def _infer_columns(self):
15781586
for level, hr in enumerate(header):
15791587
line = self._buffered_line()
15801588

1581-
while self.pos <= hr:
1589+
while self.line_pos <= hr:
15821590
line = self._next_line()
1583-
15841591
unnamed_count = 0
15851592
this_columns = []
15861593
for i, c in enumerate(line):
@@ -1705,25 +1712,36 @@ def _buffered_line(self):
17051712
else:
17061713
return self._next_line()
17071714

1715+
def _empty(self, line):
1716+
return not line or all(not x for x in line)
1717+
17081718
def _next_line(self):
17091719
if isinstance(self.data, list):
17101720
while self.pos in self.skiprows:
17111721
self.pos += 1
17121722

1713-
try:
1714-
line = self.data[self.pos]
1715-
except IndexError:
1716-
raise StopIteration
1723+
while True:
1724+
try:
1725+
line = self._check_comments([self.data[self.pos]])[0]
1726+
self.pos += 1
1727+
# either uncommented or blank to begin with
1728+
if self._empty(self.data[self.pos - 1]) or line:
1729+
break
1730+
except IndexError:
1731+
raise StopIteration
17171732
else:
17181733
while self.pos in self.skiprows:
17191734
next(self.data)
17201735
self.pos += 1
17211736

1722-
line = next(self.data)
1723-
1724-
line = self._check_comments([line])[0]
1737+
while True:
1738+
orig_line = next(self.data)
1739+
line = self._check_comments([orig_line])[0]
1740+
self.pos += 1
1741+
if self._empty(orig_line) or line:
1742+
break
17251743

1726-
self.pos += 1
1744+
self.line_pos += 1
17271745
self.buf.append(line)
17281746

17291747
return line

pandas/io/tests/test_parsers.py

+118
Original file line numberDiff line numberDiff line change
@@ -1584,6 +1584,65 @@ def test_read_table_buglet_4x_multiindex(self):
15841584
df = self.read_table(StringIO(text), sep='\s+')
15851585
self.assertEqual(df.index.names, ('one', 'two', 'three', 'four'))
15861586

1587+
def test_line_comment(self):
1588+
data = """# empty
1589+
A,B,C
1590+
1,2.,4.#hello world
1591+
#ignore this line
1592+
5.,NaN,10.0
1593+
"""
1594+
expected = [[1., 2., 4.],
1595+
[5., np.nan, 10.]]
1596+
df = self.read_csv(StringIO(data), comment='#')
1597+
tm.assert_almost_equal(df.values, expected)
1598+
1599+
def test_comment_skiprows(self):
1600+
data = """# empty
1601+
random line
1602+
# second empty line
1603+
1,2,3
1604+
A,B,C
1605+
1,2.,4.
1606+
5.,NaN,10.0
1607+
"""
1608+
expected = [[1., 2., 4.],
1609+
[5., np.nan, 10.]]
1610+
# this should ignore the first four lines (including comments)
1611+
df = self.read_csv(StringIO(data), comment='#', skiprows=4)
1612+
tm.assert_almost_equal(df.values, expected)
1613+
1614+
def test_comment_header(self):
1615+
data = """# empty
1616+
# second empty line
1617+
1,2,3
1618+
A,B,C
1619+
1,2.,4.
1620+
5.,NaN,10.0
1621+
"""
1622+
expected = [[1., 2., 4.],
1623+
[5., np.nan, 10.]]
1624+
# header should begin at the second non-comment line
1625+
df = self.read_csv(StringIO(data), comment='#', header=1)
1626+
tm.assert_almost_equal(df.values, expected)
1627+
1628+
def test_comment_skiprows_header(self):
1629+
data = """# empty
1630+
# second empty line
1631+
# third empty line
1632+
X,Y,Z
1633+
1,2,3
1634+
A,B,C
1635+
1,2.,4.
1636+
5.,NaN,10.0
1637+
"""
1638+
expected = [[1., 2., 4.],
1639+
[5., np.nan, 10.]]
1640+
# skiprows should skip the first 4 lines (including comments), while
1641+
# header should start from the second non-commented line starting
1642+
# with line 5
1643+
df = self.read_csv(StringIO(data), comment='#', skiprows=4, header=1)
1644+
tm.assert_almost_equal(df.values, expected)
1645+
15871646
def test_read_csv_parse_simple_list(self):
15881647
text = """foo
15891648
bar baz
@@ -2874,6 +2933,65 @@ def test_parse_dates_empty_string(self):
28742933
def test_usecols(self):
28752934
raise nose.SkipTest("Usecols is not supported in C High Memory engine.")
28762935

2936+
def test_line_comment(self):
2937+
data = """# empty
2938+
A,B,C
2939+
1,2.,4.#hello world
2940+
#ignore this line
2941+
5.,NaN,10.0
2942+
"""
2943+
expected = [[1., 2., 4.],
2944+
[5., np.nan, 10.]]
2945+
df = self.read_csv(StringIO(data), comment='#')
2946+
tm.assert_almost_equal(df.values, expected)
2947+
2948+
def test_comment_skiprows(self):
2949+
data = """# empty
2950+
random line
2951+
# second empty line
2952+
1,2,3
2953+
A,B,C
2954+
1,2.,4.
2955+
5.,NaN,10.0
2956+
"""
2957+
expected = [[1., 2., 4.],
2958+
[5., np.nan, 10.]]
2959+
# this should ignore the first four lines (including comments)
2960+
df = self.read_csv(StringIO(data), comment='#', skiprows=4)
2961+
tm.assert_almost_equal(df.values, expected)
2962+
2963+
def test_comment_header(self):
2964+
data = """# empty
2965+
# second empty line
2966+
1,2,3
2967+
A,B,C
2968+
1,2.,4.
2969+
5.,NaN,10.0
2970+
"""
2971+
expected = [[1., 2., 4.],
2972+
[5., np.nan, 10.]]
2973+
# header should begin at the second non-comment line
2974+
df = self.read_csv(StringIO(data), comment='#', header=1)
2975+
tm.assert_almost_equal(df.values, expected)
2976+
2977+
def test_comment_skiprows_header(self):
2978+
data = """# empty
2979+
# second empty line
2980+
# third empty line
2981+
X,Y,Z
2982+
1,2,3
2983+
A,B,C
2984+
1,2.,4.
2985+
5.,NaN,10.0
2986+
"""
2987+
expected = [[1., 2., 4.],
2988+
[5., np.nan, 10.]]
2989+
# skiprows should skip the first 4 lines (including comments), while
2990+
# header should start from the second non-commented line starting
2991+
# with line 5
2992+
df = self.read_csv(StringIO(data), comment='#', skiprows=4, header=1)
2993+
tm.assert_almost_equal(df.values, expected)
2994+
28772995
def test_passing_dtype(self):
28782996
# GH 6607
28792997
# This is a copy which should eventually be merged into ParserTests

pandas/parser.pyx

+2
Original file line numberDiff line numberDiff line change
@@ -78,8 +78,10 @@ cdef extern from "parser/tokenizer.h":
7878
ESCAPE_IN_QUOTED_FIELD
7979
QUOTE_IN_QUOTED_FIELD
8080
EAT_CRNL
81+
EAT_CRNL_NOP
8182
EAT_WHITESPACE
8283
EAT_COMMENT
84+
EAT_LINE_COMMENT
8385
FINISHED
8486

8587
enum: ERROR_OVERFLOW

pandas/src/parser/tokenizer.c

+22
Original file line numberDiff line numberDiff line change
@@ -698,6 +698,9 @@ int tokenize_delimited(parser_t *self, size_t line_limit)
698698
} else if (c == '\r') {
699699
self->state = EAT_CRNL;
700700
break;
701+
} else if (c == self->commentchar) {
702+
self->state = EAT_LINE_COMMENT;
703+
break;
701704
}
702705

703706
/* normal character - handle as START_FIELD */
@@ -752,6 +755,16 @@ int tokenize_delimited(parser_t *self, size_t line_limit)
752755
self->state = IN_FIELD;
753756
break;
754757

758+
case EAT_LINE_COMMENT:
759+
if (c == '\n') {
760+
self->file_lines++;
761+
self->state = START_RECORD;
762+
} else if (c == '\r') {
763+
self->file_lines++;
764+
self->state = EAT_CRNL_NOP;
765+
}
766+
break;
767+
755768
case IN_FIELD:
756769
/* in unquoted field */
757770
if (c == '\n') {
@@ -883,6 +896,15 @@ int tokenize_delimited(parser_t *self, size_t line_limit)
883896
}
884897
break;
885898

899+
case EAT_CRNL_NOP: /* inside an ignored comment line */
900+
self->state = START_RECORD;
901+
/* \r line terminator -- parse this character again */
902+
if (c != '\n' && c != self->delimiter) {
903+
--i;
904+
--buf;
905+
}
906+
break;
907+
886908
default:
887909
break;
888910

pandas/src/parser/tokenizer.h

+2
Original file line numberDiff line numberDiff line change
@@ -121,8 +121,10 @@ typedef enum {
121121
ESCAPE_IN_QUOTED_FIELD,
122122
QUOTE_IN_QUOTED_FIELD,
123123
EAT_CRNL,
124+
EAT_CRNL_NOP,
124125
EAT_WHITESPACE,
125126
EAT_COMMENT,
127+
EAT_LINE_COMMENT,
126128
FINISHED
127129
} ParserState;
128130

0 commit comments

Comments
 (0)