-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Respect usecols even with empty data #12506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -179,6 +179,45 @@ New Behavior: | |
# Output is a DataFrame | ||
df.groupby(pd.TimeGrouper(key='date', freq='M')).apply(lambda x: x[['value']].sum()) | ||
|
||
.. _whatsnew_0181.read_csv_exceptions: | ||
|
||
Change in ``read_csv`` exceptions | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
In order to standardize the ``read_csv`` API for both the C and Python engines, both will now raise an | ||
``EmptyDataError``, a subclass of ``ValueError``, in response to empty columns or header (:issue:`12506`) | ||
|
||
Previous behaviour: | ||
|
||
.. code-block:: python | ||
|
||
In [1]: df = pd.read_csv(StringIO(''), engine='c') | ||
... | ||
ValueError: No columns to parse from file | ||
|
||
In [2]: df = pd.read_csv(StringIO(''), engine='python') | ||
... | ||
StopIteration | ||
|
||
New behaviour: | ||
|
||
.. code-block:: python | ||
|
||
In [1]: df = pd.read_csv(StringIO(''), engine='c') | ||
... | ||
pandas.io.common.EmptyDataError: No columns to parse from file | ||
|
||
In [2]: df = pd.read_csv(StringIO(''), engine='python') | ||
... | ||
pandas.io.common.EmptyDataError: No columns to parse from file | ||
|
||
In addition to this error change, several others have been made as well: | ||
|
||
- ``CParserError`` is now a ``ValueError`` instead of just an ``Exception`` (:issue:`12551`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. was this whatsnew just not put in before? (the PR was already merged) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but I moved it into this section because it's related |
||
- A ``CParserError`` is now raised instead of a generic ``Exception`` in ``read_csv`` when the C engine cannot parse a column | ||
- A ``ValueError`` is now raised instead of a generic ``Exception`` in ``read_csv`` when the C engine encounters a ``NaN`` value in an integer column | ||
- A ``ValueError`` is now raised instead of a generic ``Exception`` in ``read_csv`` when ``true_values`` is specified, and the C engine encounters an element in a column containing unencodable bytes | ||
- ``pandas.parser.OverflowError`` exception has been removed and has been replaced with Python's built-in ``OverflowError`` exception | ||
|
||
.. _whatsnew_0181.deprecations: | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,7 +13,7 @@ | |
from pandas.core.frame import DataFrame | ||
from pandas.io.parsers import TextParser | ||
from pandas.io.common import (_is_url, _urlopen, _validate_header_arg, | ||
get_filepath_or_buffer) | ||
EmptyDataError, get_filepath_or_buffer) | ||
from pandas.tseries.period import Period | ||
from pandas import json | ||
from pandas.compat import (map, zip, reduce, range, lrange, u, add_metaclass, | ||
|
@@ -468,7 +468,7 @@ def _parse_cell(cell_contents, cell_typ): | |
if not squeeze or isinstance(output[asheetname], DataFrame): | ||
output[asheetname].columns = output[ | ||
asheetname].columns.set_names(header_names) | ||
except StopIteration: | ||
except EmptyDataError: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you need to raise this at a lower level and catch the appropriate ValueError/EmptyDataError here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't quite understand what you mean by that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. there are 2 cases here, a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't that what I'm currently doing? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And yes, I agree that testing needs to make sure that they are raised, which is why I created those tests in my PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. prove it by having tests that raise these separtely (this may not exist or partially), that is the point, want to prove that we are only making very deliberate API changes. IOW, pretend you are a user reading the release notes, I want to know what changes I can expect. So the whatsnew should show exactly what is changing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you read my comment above? I said I've already done just that. I wrote the tests initially to surface these specific errors when I was making the changes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. they don't differentiate There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It doesn't differentiate because the C engine throws a generic |
||
# No Data, return an empty DataFrame | ||
output[asheetname] = DataFrame() | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,7 +12,8 @@ | |
|
||
import numpy as np | ||
|
||
from pandas.io.common import _is_url, urlopen, parse_url, _validate_header_arg | ||
from pandas.io.common import (EmptyDataError, _is_url, urlopen, | ||
parse_url, _validate_header_arg) | ||
from pandas.io.parsers import TextParser | ||
from pandas.compat import (lrange, lmap, u, string_types, iteritems, | ||
raise_with_traceback, binary_type) | ||
|
@@ -742,7 +743,7 @@ def _parse(flavor, io, match, header, index_col, skiprows, | |
parse_dates=parse_dates, | ||
tupleize_cols=tupleize_cols, | ||
thousands=thousands)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is not a good idea at all (same as above) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Telling me it's not a very good idea is not very helpful feedback given that that was what was written before in the first place. First, why is it not a good idea? And second, what might you suggest as alternatives to doing a |
||
except StopIteration: # empty table | ||
except EmptyDataError: # empty table | ||
continue | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same here |
||
return ret | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,7 +20,8 @@ | |
from pandas.io.date_converters import generic_parser | ||
from pandas.io.common import (get_filepath_or_buffer, _validate_header_arg, | ||
_get_handle, UnicodeReader, UTF8Recoder, | ||
BaseIterator) | ||
BaseIterator, CParserError, EmptyDataError, | ||
ParserWarning) | ||
from pandas.tseries import tools | ||
|
||
from pandas.util.decorators import Appender | ||
|
@@ -36,10 +37,6 @@ | |
'N/A', 'NA', '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', '' | ||
]) | ||
|
||
|
||
class ParserWarning(Warning): | ||
pass | ||
|
||
_parser_params = """Also supports optionally iterating or breaking of the file | ||
into chunks. | ||
|
||
|
@@ -936,7 +933,7 @@ def tostr(x): | |
# long | ||
for n in range(len(columns[0])): | ||
if all(['Unnamed' in tostr(c[n]) for c in columns]): | ||
raise _parser.CParserError( | ||
raise CParserError( | ||
"Passed header=[%s] are too many rows for this " | ||
"multi_index of columns" | ||
% ','.join([str(x) for x in self.header]) | ||
|
@@ -1255,10 +1252,19 @@ def read(self, nrows=None): | |
except StopIteration: | ||
if self._first_chunk: | ||
self._first_chunk = False | ||
return _get_empty_meta(self.orig_names, | ||
self.index_col, | ||
self.index_names, | ||
dtype=self.kwds.get('dtype')) | ||
|
||
index, columns, col_dict = _get_empty_meta( | ||
self.orig_names, self.index_col, | ||
self.index_names, dtype=self.kwds.get('dtype')) | ||
|
||
if self.usecols is not None: | ||
columns = self._filter_usecols(columns) | ||
|
||
col_dict = dict(filter(lambda item: item[0] in columns, | ||
col_dict.items())) | ||
|
||
return index, columns, col_dict | ||
|
||
else: | ||
raise | ||
|
||
|
@@ -1750,10 +1756,26 @@ def _infer_columns(self): | |
|
||
columns = [] | ||
for level, hr in enumerate(header): | ||
line = self._buffered_line() | ||
try: | ||
line = self._buffered_line() | ||
|
||
while self.line_pos <= hr: | ||
line = self._next_line() | ||
|
||
while self.line_pos <= hr: | ||
line = self._next_line() | ||
except StopIteration: | ||
if self.line_pos < hr: | ||
raise ValueError( | ||
'Passed header=%s but only %d lines in file' | ||
% (hr, self.line_pos + 1)) | ||
|
||
# We have an empty file, so check | ||
# if columns are provided. That will | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. a caller can now differntiate between these errors. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, right...so what is the purpose of this comment? |
||
# serve as the 'line' for parsing | ||
if not self.names: | ||
raise EmptyDataError( | ||
"No columns to parse from file") | ||
|
||
line = self.names[:] | ||
|
||
unnamed_count = 0 | ||
this_columns = [] | ||
|
@@ -1818,10 +1840,19 @@ def _infer_columns(self): | |
else: | ||
columns = self._handle_usecols(columns, columns[0]) | ||
else: | ||
# header is None | ||
line = self._buffered_line() | ||
try: | ||
line = self._buffered_line() | ||
|
||
except StopIteration: | ||
if not names: | ||
raise EmptyDataError( | ||
"No columns to parse from file") | ||
|
||
line = names[:] | ||
|
||
ncols = len(line) | ||
num_original_columns = ncols | ||
|
||
if not names: | ||
if self.prefix: | ||
columns = [['%s%d' % (self.prefix, i) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,7 +16,6 @@ | |
import nose | ||
import numpy as np | ||
import pandas.lib as lib | ||
import pandas.parser | ||
from numpy import nan | ||
from numpy.testing.decorators import slow | ||
from pandas.lib import Timestamp | ||
|
@@ -32,7 +31,8 @@ | |
) | ||
from pandas.compat import parse_date | ||
from pandas.core.common import AbstractMethodError | ||
from pandas.io.common import DtypeWarning, URLError | ||
from pandas.io.common import (CParserError, DtypeWarning, | ||
EmptyDataError, URLError) | ||
from pandas.io.parsers import (read_csv, read_table, read_fwf, | ||
TextFileReader, TextParser) | ||
from pandas.tseries.index import date_range | ||
|
@@ -1209,7 +1209,7 @@ def test_read_table_wrong_num_columns(self): | |
6,7,8,9,10,11,12 | ||
11,12,13,14,15,16 | ||
""" | ||
self.assertRaises(Exception, self.read_csv, StringIO(data)) | ||
self.assertRaises(ValueError, self.read_csv, StringIO(data)) | ||
|
||
def test_read_table_duplicate_index(self): | ||
data = """index,A,B,C,D | ||
|
@@ -1740,7 +1740,7 @@ def test_read_table_buglet_4x_multiindex(self): | |
# Temporarily copied to TestPythonParser. | ||
# Here test that CParserError is raised: | ||
|
||
with tm.assertRaises(pandas.parser.CParserError): | ||
with tm.assertRaises(CParserError): | ||
text = """ A B C D E | ||
one two three four | ||
a b 10.0032 5 -0.5109 -2.3358 -0.4645 0.05076 0.3640 | ||
|
@@ -1840,7 +1840,7 @@ def test_parse_dates_custom_euroformat(self): | |
tm.assert_frame_equal(df, expected) | ||
|
||
parser = lambda d: parse_date(d, day_first=True) | ||
self.assertRaises(Exception, self.read_csv, | ||
self.assertRaises(TypeError, self.read_csv, | ||
StringIO(text), skiprows=[0], | ||
names=['time', 'Q', 'NTU'], index_col=0, | ||
parse_dates=True, date_parser=parser, | ||
|
@@ -2014,7 +2014,7 @@ def test_bool_na_values(self): | |
def test_nonexistent_path(self): | ||
# don't segfault pls #2428 | ||
path = '%s.csv' % tm.rands(10) | ||
self.assertRaises(Exception, self.read_csv, path) | ||
self.assertRaises(IOError, self.read_csv, path) | ||
|
||
def test_missing_trailing_delimiters(self): | ||
data = """A,B,C,D | ||
|
@@ -2358,7 +2358,7 @@ def test_catch_too_many_names(self): | |
4,,6 | ||
7,8,9 | ||
10,11,12\n""" | ||
tm.assertRaises(Exception, read_csv, StringIO(data), | ||
tm.assertRaises(ValueError, read_csv, StringIO(data), | ||
header=0, names=['a', 'b', 'c', 'd']) | ||
|
||
def test_ignore_leading_whitespace(self): | ||
|
@@ -2525,9 +2525,8 @@ def test_int64_overflow(self): | |
result = self.read_csv(StringIO(data)) | ||
self.assertTrue(result['ID'].dtype == object) | ||
|
||
self.assertRaises((OverflowError, pandas.parser.OverflowError), | ||
self.read_csv, StringIO(data), | ||
converters={'ID': np.int64}) | ||
self.assertRaises(OverflowError, self.read_csv, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i see (so don't worry about the example above) |
||
StringIO(data), converters={'ID': np.int64}) | ||
|
||
# Just inside int64 range: parse as integer | ||
i_max = np.iinfo(np.int64).max | ||
|
@@ -2774,7 +2773,7 @@ def test_mixed_dtype_usecols(self): | |
usecols = [0, 'b', 2] | ||
|
||
with tm.assertRaisesRegexp(ValueError, msg): | ||
df = self.read_csv(StringIO(data), usecols=usecols) | ||
self.read_csv(StringIO(data), usecols=usecols) | ||
|
||
def test_usecols_with_integer_like_header(self): | ||
data = """2,0,1 | ||
|
@@ -2796,6 +2795,37 @@ def test_usecols_with_integer_like_header(self): | |
df = self.read_csv(StringIO(data), usecols=usecols) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
def test_read_empty_with_usecols(self): | ||
# See gh-12493 | ||
names = ['Dummy', 'X', 'Dummy_2'] | ||
usecols = names[1:2] # ['X'] | ||
|
||
# first, check to see that the response of | ||
# parser when faced with no provided columns | ||
# throws the correct error, with or without usecols | ||
errmsg = "No columns to parse from file" | ||
|
||
with tm.assertRaisesRegexp(EmptyDataError, errmsg): | ||
self.read_csv(StringIO('')) | ||
|
||
with tm.assertRaisesRegexp(EmptyDataError, errmsg): | ||
self.read_csv(StringIO(''), usecols=usecols) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suspect there are tests that are testing for Exception but now need to look for the specific exception There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are indeed right. I changed some of those to |
||
|
||
expected = DataFrame(columns=usecols, index=[0], dtype=np.float64) | ||
df = self.read_csv(StringIO(',,'), names=names, usecols=usecols) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
expected = DataFrame(columns=usecols) | ||
df = self.read_csv(StringIO(''), names=names, usecols=usecols) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
def test_read_with_bad_header(self): | ||
errmsg = "but only \d+ lines in file" | ||
|
||
with tm.assertRaisesRegexp(ValueError, errmsg): | ||
s = StringIO(',,') | ||
self.read_csv(s, header=[10]) | ||
|
||
|
||
class CompressionTests(object): | ||
def test_zip(self): | ||
|
@@ -4399,7 +4429,7 @@ def test_raise_on_passed_int_dtype_with_nas(self): | |
2001,106380451,10 | ||
2001,,11 | ||
2001,106380451,67""" | ||
self.assertRaises(Exception, read_csv, StringIO(data), sep=",", | ||
self.assertRaises(ValueError, read_csv, StringIO(data), sep=",", | ||
skipinitialspace=True, | ||
dtype={'DOY': np.int64}) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it actually shown with the full name? (just a question, didn't test it, didn't fetch the PR, but I would just show the same as in an actual console)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which is what I did. 😄 - FYI you can observe this full out name thing if you trigger any current
CParserError
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, perfect! (I just wondered if it was the case)