Skip to content

read_excel with dtype=str converts empty cells to np.nan #20429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
dd53df8
TST: Test for astype_nansafe. Modified test for astype
nikoskaragiannakis Mar 20, 2018
6f771fb
BUG: np.nan should stay as it is when we cast to str/basestring
nikoskaragiannakis Mar 20, 2018
37f00ad
BUG: revert change in lib.pyx. modify excel functionality directly
nikoskaragiannakis Mar 20, 2018
f194b70
TST: revert changes in dtypes/test_cast. test excel functionality
nikoskaragiannakis Mar 20, 2018
eb8f4c5
DOC: added description
nikoskaragiannakis Mar 20, 2018
ac6a409
TST: correction and pep8
nikoskaragiannakis Mar 20, 2018
6994bb0
BUG: pep8
nikoskaragiannakis Mar 20, 2018
40a563f
TST: remove unused import
nikoskaragiannakis Mar 20, 2018
9858259
DOC: resolved conflict
nikoskaragiannakis Mar 20, 2018
5f71a99
Update v0.23.0.txt
nikoskaragiannakis Mar 20, 2018
0a93b60
conflict again
nikoskaragiannakis Mar 20, 2018
f296f9a
arghh
nikoskaragiannakis Mar 20, 2018
7c0af1f
DOC: add disallowing of Series construction of len-1 list with index …
jorisvandenbossche Mar 19, 2018
f0fd0a7
Bug: Allow np.timedelta64 objects to index TimedeltaIndex (#20408)
mroeschke Mar 19, 2018
61e0519
DOC: Only use ~ in class links to hide prefixes. (#20402)
dukebody Mar 19, 2018
9fdac27
DOC: update the pandas.DataFrame.plot.hist docstring (#20155)
liopic Mar 19, 2018
ddb904f
DOC" update the Pandas core window rolling count docstring" (#20264)
tommy-stone Mar 19, 2018
694849d
BUG: astype_unicode astype_str turn a np.nan to empty string (#20377)
nikoskaragiannakis Mar 24, 2018
5ba95a1
TST: added unitest for read_excel and modified series/test_dtypes for…
nikoskaragiannakis Mar 24, 2018
d3ceec3
TST: added unitest for read_csv (#20377)
nikoskaragiannakis Mar 25, 2018
ea1d73a
BUG: patched TextReader to turn np.nan to empty string if dtype=str (…
nikoskaragiannakis Mar 25, 2018
c1376a5
DOC: updated IO section (#20377)
nikoskaragiannakis Mar 25, 2018
3103811
DOC: updated IO section (#20377)
nikoskaragiannakis Mar 25, 2018
7d5f6b2
pull from master
nikoskaragiannakis Mar 25, 2018
478d08d
DOC: updated IO section (#20377)
nikoskaragiannakis Apr 2, 2018
edb26d7
BUG: np.nan stays as np.nan (#20377)
nikoskaragiannakis Apr 2, 2018
c3ab9cb
TXT: Moved test from series.test_io to io.parser.na_values. Corrected…
nikoskaragiannakis Apr 2, 2018
69f6c95
DOC: updated IO section (#20377)
nikoskaragiannakis Apr 2, 2018
97a345a
TST: pep8 (#20377)
nikoskaragiannakis Apr 2, 2018
8b2fb0b
TXT: Moved test from series.test_io to io.parser.na_values. Corrected…
nikoskaragiannakis Apr 2, 2018
c9f5120
DOC: updated IO section (#20377)
nikoskaragiannakis Apr 2, 2018
fab0b27
resolve conflict
nikoskaragiannakis Apr 2, 2018
571d5c4
pep8 correction
nikoskaragiannakis Apr 2, 2018
0712392
Merge remote-tracking branch 'upstream/master' into nikoskaragiannaki…
TomAugspurger Apr 3, 2018
47bc105
DOC: Better explanation (#20377)
nikoskaragiannakis Apr 5, 2018
3740dfe
BUG: use checknull (#20377)
nikoskaragiannakis Apr 5, 2018
7d453bb
TST: update tests (#20377)
nikoskaragiannakis Apr 8, 2018
bcd739d
BUG: string nans to np.nan in Series for list data (#20377)
nikoskaragiannakis Apr 8, 2018
7341cd1
sync
nikoskaragiannakis Apr 8, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -981,6 +981,8 @@ I/O
- :class:`Timedelta` now supported in :func:`DataFrame.to_excel` for all Excel file types (:issue:`19242`, :issue:`9155`, :issue:`19900`)
- Bug in :meth:`pandas.io.stata.StataReader.value_labels` raising an ``AttributeError`` when called on very old files. Now returns an empty dict (:issue:`19417`)
- Bug in :func:`read_pickle` when unpickling objects with :class:`TimedeltaIndex` or :class:`Float64Index` created with pandas prior to version 0.20 (:issue:`19939`)
- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are including some other changes here, pls rebase on master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not mine. I deleted it by mistake and added it back.
You can check master here https://github.com/pandas-dev/pandas/blob/master/doc/source/whatsnew/v0.23.0.txt#L985
However, even after rebasing, I keep getting this conflict

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you rebased off master and resolved the conflicts in the rebase then it should be ok. Did you fetch the current master before rebasing?

Copy link
Contributor Author

@nikoskaragiannakis nikoskaragiannakis Mar 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did now

- Bug in :`read_excel` where it transforms np.nan to 'nan' if dtype=str is chosen. Now keeps np.nan as they are. (:issue:`20377`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be :func:`read_excel`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double back-ticks around dtype=str and around np.nan


Plotting
^^^^^^^^
Expand Down
5 changes: 5 additions & 0 deletions pandas/io/excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -679,6 +679,11 @@ def _parse_cell(cell_contents, cell_typ):
**kwds)

output[asheetname] = parser.read(nrows=nrows)
dtypes = output[asheetname].dtypes
output[asheetname].replace('nan', np.nan, inplace=True)
output[asheetname] = output[asheetname].astype(dtypes,
copy=False)
Copy link
Member

@gfyoung gfyoung Mar 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about this patch being a performance hit against read_excel. The Python parser (in io/parsers.py) processes each of the Excel elements before placing it into a DataFrame. I would look there for the fix, since as I mentioned below, this bug impacts other read_* functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result from read_csv is not the same as the one from read_excel with dtype=str. In the former case, empties are read in as np.nan, whereas in the latter they are read in as the string 'nan'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, though that doesn't change my opinion. The problematic part still likely stems in the engine parsing, which would also effect read_csv (the more popular of the two IMO). Thus, if we can kill two birds with one stone, that would be even better.


if names is not None:
output[asheetname].columns = names
if not squeeze or isinstance(output[asheetname], DataFrame):
Expand Down
19 changes: 18 additions & 1 deletion pandas/tests/io/test_excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -207,12 +207,29 @@ def test_excel_passes_na(self, ext):
columns=['Test'])
tm.assert_frame_equal(parsed, expected)

# gh-20377 dtype=str (all 'nan' turn to np.nan)

parsed = read_excel(excel, 'Sheet1', dtype=str, keep_default_na=False,
na_values=['apple'])
expected = DataFrame([['NA'], ['1'], ['NA'], [np.nan], ['rabbit']],
columns=['Test'])
tm.assert_frame_equal(parsed, expected)

parsed = read_excel(excel, 'Sheet1', dtype=str, keep_default_na=True,
na_values=['apple'])
expected = DataFrame([[np.nan], ['1'], [np.nan], [np.nan], ['rabbit']],
columns=['Test'])
tm.assert_frame_equal(parsed, expected)

# 13967
excel = self.get_excelfile('test5', ext)

parsed = read_excel(excel, 'Sheet1', keep_default_na=False,
na_values=['apple'])
expected = DataFrame([['1.#QNAN'], [1], ['nan'], [np.nan], ['rabbit']],
# gh-20377 'nan' was given in the spreadsheet, but turned
# to np.nan as well
expected = DataFrame([['1.#QNAN'], [1], [np.nan], [np.nan],
['rabbit']],
columns=['Test'])
tm.assert_frame_equal(parsed, expected)

Expand Down