Skip to content

read_excel with dtype=str converts empty cells to np.nan #20429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
dd53df8
TST: Test for astype_nansafe. Modified test for astype
nikoskaragiannakis Mar 20, 2018
6f771fb
BUG: np.nan should stay as it is when we cast to str/basestring
nikoskaragiannakis Mar 20, 2018
37f00ad
BUG: revert change in lib.pyx. modify excel functionality directly
nikoskaragiannakis Mar 20, 2018
f194b70
TST: revert changes in dtypes/test_cast. test excel functionality
nikoskaragiannakis Mar 20, 2018
eb8f4c5
DOC: added description
nikoskaragiannakis Mar 20, 2018
ac6a409
TST: correction and pep8
nikoskaragiannakis Mar 20, 2018
6994bb0
BUG: pep8
nikoskaragiannakis Mar 20, 2018
40a563f
TST: remove unused import
nikoskaragiannakis Mar 20, 2018
9858259
DOC: resolved conflict
nikoskaragiannakis Mar 20, 2018
5f71a99
Update v0.23.0.txt
nikoskaragiannakis Mar 20, 2018
0a93b60
conflict again
nikoskaragiannakis Mar 20, 2018
f296f9a
arghh
nikoskaragiannakis Mar 20, 2018
7c0af1f
DOC: add disallowing of Series construction of len-1 list with index …
jorisvandenbossche Mar 19, 2018
f0fd0a7
Bug: Allow np.timedelta64 objects to index TimedeltaIndex (#20408)
mroeschke Mar 19, 2018
61e0519
DOC: Only use ~ in class links to hide prefixes. (#20402)
dukebody Mar 19, 2018
9fdac27
DOC: update the pandas.DataFrame.plot.hist docstring (#20155)
liopic Mar 19, 2018
ddb904f
DOC" update the Pandas core window rolling count docstring" (#20264)
tommy-stone Mar 19, 2018
694849d
BUG: astype_unicode astype_str turn a np.nan to empty string (#20377)
nikoskaragiannakis Mar 24, 2018
5ba95a1
TST: added unitest for read_excel and modified series/test_dtypes for…
nikoskaragiannakis Mar 24, 2018
d3ceec3
TST: added unitest for read_csv (#20377)
nikoskaragiannakis Mar 25, 2018
ea1d73a
BUG: patched TextReader to turn np.nan to empty string if dtype=str (…
nikoskaragiannakis Mar 25, 2018
c1376a5
DOC: updated IO section (#20377)
nikoskaragiannakis Mar 25, 2018
3103811
DOC: updated IO section (#20377)
nikoskaragiannakis Mar 25, 2018
7d5f6b2
pull from master
nikoskaragiannakis Mar 25, 2018
478d08d
DOC: updated IO section (#20377)
nikoskaragiannakis Apr 2, 2018
edb26d7
BUG: np.nan stays as np.nan (#20377)
nikoskaragiannakis Apr 2, 2018
c3ab9cb
TXT: Moved test from series.test_io to io.parser.na_values. Corrected…
nikoskaragiannakis Apr 2, 2018
69f6c95
DOC: updated IO section (#20377)
nikoskaragiannakis Apr 2, 2018
97a345a
TST: pep8 (#20377)
nikoskaragiannakis Apr 2, 2018
8b2fb0b
TXT: Moved test from series.test_io to io.parser.na_values. Corrected…
nikoskaragiannakis Apr 2, 2018
c9f5120
DOC: updated IO section (#20377)
nikoskaragiannakis Apr 2, 2018
fab0b27
resolve conflict
nikoskaragiannakis Apr 2, 2018
571d5c4
pep8 correction
nikoskaragiannakis Apr 2, 2018
0712392
Merge remote-tracking branch 'upstream/master' into nikoskaragiannaki…
TomAugspurger Apr 3, 2018
47bc105
DOC: Better explanation (#20377)
nikoskaragiannakis Apr 5, 2018
3740dfe
BUG: use checknull (#20377)
nikoskaragiannakis Apr 5, 2018
7d453bb
TST: update tests (#20377)
nikoskaragiannakis Apr 8, 2018
bcd739d
BUG: string nans to np.nan in Series for list data (#20377)
nikoskaragiannakis Apr 8, 2018
7341cd1
sync
nikoskaragiannakis Apr 8, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1098,6 +1098,7 @@ I/O
- Bug in :func:`read_pickle` when unpickling objects with :class:`TimedeltaIndex` or :class:`Float64Index` created with pandas prior to version 0.20 (:issue:`19939`)
- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are including some other changes here, pls rebase on master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not mine. I deleted it by mistake and added it back.
You can check master here https://github.com/pandas-dev/pandas/blob/master/doc/source/whatsnew/v0.23.0.txt#L985
However, even after rebasing, I keep getting this conflict

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you rebased off master and resolved the conflicts in the rebase then it should be ok. Did you fetch the current master before rebasing?

Copy link
Contributor Author

@nikoskaragiannakis nikoskaragiannakis Mar 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did now

- Bug in ``usecols`` parameter in :func:`pandas.io.read_csv` and :func:`pandas.io.read_table` where error is not raised correctly when passing a string. (:issue:`20529`)
- Bug in :func:`read_excel` and :func:`read_csv` where missing values turned to ``'nan'`` with ``dtype=str`` and ``na_filter=True``. Now, they turn to ``np.nan``. (:issue `20377`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make the last part a bit more clear. These missing values are converted to the string missing indicator, np.nan

- Bug in :func:`HDFStore.keys` when reading a file with a softlink causes exception (:issue:`20523`)

Plotting
Expand Down
12 changes: 10 additions & 2 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -465,7 +465,11 @@ cpdef ndarray[object] astype_unicode(ndarray arr):
for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, unicode(arr[i]))
arr_i = arr[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is arr_i in the cdef?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d'oh!

util.set_value_at_unsafe(
result,
i,
unicode(arr_i) if arr_i is not np.nan else np.nan)
Copy link
Member

@gfyoung gfyoung Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting spacing...maybe we should do this instead:

uni_arr_i = unicode(arr_i) if arr_i is not np.nan else np.nan
util.set_value_at_unsafe(result, i, uni_arr_i)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use np.isnan here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using np.nan here raises:

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that is not friendly to strings - ok
use checknull (should already be imported
from util )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When i use that, all hell breaks loose. I get errors in tests like this one https://github.com/pandas-dev/pandas/blob/master/pandas/tests/frame/test_dtypes.py#L533

Is it because they use np.NaN? It looks like checknull checks both np.NaN and np.nan, while before the change I used to check only np.nan.
If that's the case, then I have to modify more tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung are you sure this indentation is a big problem? Because if I do what you suggest, then how should I declare uni_arr_i (and str_arr_i) in the cdef?
Would it be ok if I changed it to sth like

util.set_value_at_unsafe(
    ...
)

(moved the close bracket in the next line)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the nans are the same; iow they point to the same object
go ahead and change tests if need be and i will have a look


return result

Expand All @@ -478,7 +482,11 @@ cpdef ndarray[object] astype_str(ndarray arr):
for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, str(arr[i]))
arr_i = arr[i]
util.set_value_at_unsafe(
result,
i,
str(arr_i) if arr_i is not np.nan else np.nan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same


return result

Expand Down
2 changes: 2 additions & 0 deletions pandas/_libs/parsers.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1218,6 +1218,7 @@ cdef class TextReader:
# treat as a regular string parsing
return self._string_convert(i, start, end, na_filter,
na_hashset)

elif dtype.kind == 'U':
width = dtype.itemsize
if width > 0:
Expand All @@ -1227,6 +1228,7 @@ cdef class TextReader:
# unicode variable width
return self._string_convert(i, start, end, na_filter,
na_hashset)

elif is_categorical_dtype(dtype):
# TODO: I suspect that _categorical_convert could be
# optimized when dtype is an instance of CategoricalDtype
Expand Down
24 changes: 24 additions & 0 deletions pandas/tests/io/parser/na_values.py
Original file line number Diff line number Diff line change
Expand Up @@ -369,3 +369,27 @@ def test_no_na_filter_on_index(self):
expected = DataFrame({"a": [1, 4], "c": [3, 6]},
index=Index([np.nan, 5.0], name="b"))
tm.assert_frame_equal(out, expected)

def test_na_values_with_dtype_str_and_na_filter_true(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize this on na_filter (you will need to provide the nan_value as well in the parameterize as they are different)

# see gh-20377
data = "a,b,c\n1,,3\n4,5,6"

out = self.read_csv(StringIO(data), na_filter=True, dtype=str)

# missing data turn to np.nan, which stays as it is after dtype=str
expected = DataFrame({"a": ["1", "4"],
"b": [np.nan, "5"],
"c": ["3", "6"]})
tm.assert_frame_equal(out, expected)

def test_na_values_with_dtype_str_and_na_filter_false(self):
# see gh-20377
data = "a,b,c\n1,,3\n4,5,6"

out = self.read_csv(StringIO(data), na_filter=False, dtype=str)

# missing data turn to empty string
expected = DataFrame({"a": ["1", "4"],
"b": ["", "5"],
"c": ["3", "6"]})
tm.assert_frame_equal(out, expected)
27 changes: 27 additions & 0 deletions pandas/tests/io/test_excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -360,6 +360,33 @@ def test_reader_dtype(self, ext):
with pytest.raises(ValueError):
actual = self.get_exceldf(basename, ext, dtype={'d': 'int64'})

def test_reader_dtype_str(self, ext):
# GH 20377
basename = 'testdtype'
actual = self.get_exceldf(basename, ext)

expected = DataFrame({
'a': [1, 2, 3, 4],
'b': [2.5, 3.5, 4.5, 5.5],
'c': [1, 2, 3, 4],
'd': [1.0, 2.0, np.nan, 4.0]}).reindex(
columns=['a', 'b', 'c', 'd'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just specify columns=list('abcd') rather than reindex


tm.assert_frame_equal(actual, expected)

actual = self.get_exceldf(basename, ext,
dtype={'a': 'float64',
'b': 'float32',
'c': str,
'd': str})

expected['a'] = expected['a'].astype('float64')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this higher (by the expected), you can simply construct things directly by using e.g. Series(...., dtype='float32') rather than a list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this higher (by the expected)

I'm not sure what you mean here.

you can simply construct things directly by using e.g. Series(...., dtype='float32') rather than a list

First of all, this is copy-paste from the previous test, which was added for #8212

Do you mean to do

expected = DataFrame({'a': Series([1,2,3,4], dtype='float64'),
                      'b': Series([2.5,3.5,4.5,5.5], dtype='float32'),
                      ...})

?
If i use, for example, for the 'c' column: Series([1, 2, 3, 4], dtype=str), then it will give me ['1', '2', '3', '4'] instead of the expected ['001', '002', '003', '004'].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly. If you need things like '001', then just do it that way, e.g. Series(['001', '002'....])

expected['b'] = expected['b'].astype('float32')
expected['c'] = ['001', '002', '003', '004']
expected['d'] = ['1', '2', np.nan, '4']

tm.assert_frame_equal(actual, expected)

def test_reading_all_sheets(self, ext):
# Test reading all sheetnames by setting sheetname to None,
# Ensure a dict is returned.
Expand Down
1 change: 1 addition & 0 deletions pandas/tests/series/test_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,7 @@ def test_astype_str_map(self, dtype, series):
# see gh-4405
result = series.astype(dtype)
expected = series.map(compat.text_type)
expected.replace('nan', np.nan, inplace=True) # see gh-20377
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use inplace

tm.assert_series_equal(result, expected)

@pytest.mark.parametrize("dtype", [str, compat.text_type])
Expand Down