Skip to content

Commit 447df80

Browse files
gfyoungjorisvandenbossche
authored andcommitted
BUG, DOC: Fix inconsistencies with scalar na_values in read_csv (pandas-dev#14056)
Update documentation to state that scalars are accepted for na_values. In addition, accept scalars for the values when a dictionary is passed in for na_values. Closes pandas-devgh-12224.
1 parent ae4ffac commit 447df80

File tree

5 files changed

+25
-6
lines changed

5 files changed

+25
-6
lines changed

doc/source/io.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -208,7 +208,7 @@ memory_map : boolean, default False
208208
NA and Missing Data Handling
209209
++++++++++++++++++++++++++++
210210

211-
na_values : str, list-like or dict, default ``None``
211+
na_values : scalar, str, list-like, or dict, default ``None``
212212
Additional strings to recognize as NA/NaN. If dict passed, specific per-column
213213
NA values. By default the following values are interpreted as NaN:
214214
``'-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'NA',

doc/source/whatsnew/v0.19.0.txt

+1
Original file line numberDiff line numberDiff line change
@@ -957,6 +957,7 @@ Bug Fixes
957957
- Bug in ``groupby().shift()``, which could cause a segfault or corruption in rare circumstances when grouping by columns with missing values (:issue:`13813`)
958958
- Bug in ``groupby().cumsum()`` calculating ``cumprod`` when ``axis=1``. (:issue:`13994`)
959959
- Bug in ``pd.read_csv()``, which may cause a segfault or corruption when iterating in large chunks over a stream/file under rare circumstances (:issue:`13703`)
960+
- Bug in ``pd.read_csv()``, which caused errors to be raised when a dictionary containing scalars is passed in for ``na_values`` (:issue:`12224`)
960961
- Bug in ``pd.read_csv()``, which caused BOM files to be incorrectly parsed by not ignoring the BOM (:issue:`4793`)
961962
- Bug in ``pd.read_csv()`` with ``engine='python'`` which raised errors when a numpy array was passed in for ``usecols`` (:issue:`12546`)
962963
- Bug in ``pd.to_timedelta()`` in which the ``errors`` parameter was not being respected (:issue:`13613`)

pandas/io/excel.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@
9494
column ranges (e.g. "A:E" or "A,C,E:F")
9595
squeeze : boolean, default False
9696
If the parsed data only contains one column then return a Series
97-
na_values : str or list-like or dict, default None
97+
na_values : scalar, str, list-like, or dict, default None
9898
Additional strings to recognize as NA/NaN. If dict passed, specific
9999
per-column NA values. By default the following values are interpreted
100100
as NaN: '""" + "', '".join(sorted(_NA_VALUES)) + """'.

pandas/io/parsers.py

+6-4
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,7 @@
129129
DEPRECATED: use the `skipfooter` parameter instead, as they are identical
130130
nrows : int, default None
131131
Number of rows of file to read. Useful for reading pieces of large files
132-
na_values : str or list-like or dict, default None
132+
na_values : scalar, str, list-like, or dict, default None
133133
Additional strings to recognize as NA/NaN. If dict passed, specific
134134
per-column NA values. By default the following values are interpreted as
135135
NaN: `'""" + "'`, `'".join(sorted(_NA_VALUES)) + """'`.
@@ -1604,8 +1604,8 @@ def TextParser(*args, **kwds):
16041604
has_index_names: boolean, default False
16051605
True if the cols defined in index_col have an index name and are
16061606
not in the header
1607-
na_values : iterable, default None
1608-
Custom NA values
1607+
na_values : scalar, str, list-like, or dict, default None
1608+
Additional strings to recognize as NA/NaN.
16091609
keep_default_na : bool, default True
16101610
thousands : str, default None
16111611
Thousands separator
@@ -2687,7 +2687,9 @@ def _clean_na_values(na_values, keep_default_na=True):
26872687
elif isinstance(na_values, dict):
26882688
if keep_default_na:
26892689
for k, v in compat.iteritems(na_values):
2690-
v = set(list(v)) | _NA_VALUES
2690+
if not is_list_like(v):
2691+
v = [v]
2692+
v = set(v) | _NA_VALUES
26912693
na_values[k] = v
26922694
na_fvalues = dict([
26932695
(k, _floatify_na_values(v)) for k, v in na_values.items() # noqa

pandas/io/tests/parser/na_values.py

+16
Original file line numberDiff line numberDiff line change
@@ -250,3 +250,19 @@ def test_na_trailing_columns(self):
250250
result = self.read_csv(StringIO(data))
251251
self.assertEqual(result['Date'][1], '2012-05-12')
252252
self.assertTrue(result['UnitPrice'].isnull().all())
253+
254+
def test_na_values_scalar(self):
255+
# see gh-12224
256+
names = ['a', 'b']
257+
data = '1,2\n2,1'
258+
259+
expected = DataFrame([[np.nan, 2.0], [2.0, np.nan]],
260+
columns=names)
261+
out = self.read_csv(StringIO(data), names=names, na_values=1)
262+
tm.assert_frame_equal(out, expected)
263+
264+
expected = DataFrame([[1.0, 2.0], [np.nan, np.nan]],
265+
columns=names)
266+
out = self.read_csv(StringIO(data), names=names,
267+
na_values={'a': 2, 'b': 1})
268+
tm.assert_frame_equal(out, expected)

0 commit comments

Comments
 (0)