Skip to content

read_excel with dtype=str converts empty cells to np.nan #20429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
dd53df8
TST: Test for astype_nansafe. Modified test for astype
nikoskaragiannakis Mar 20, 2018
6f771fb
BUG: np.nan should stay as it is when we cast to str/basestring
nikoskaragiannakis Mar 20, 2018
37f00ad
BUG: revert change in lib.pyx. modify excel functionality directly
nikoskaragiannakis Mar 20, 2018
f194b70
TST: revert changes in dtypes/test_cast. test excel functionality
nikoskaragiannakis Mar 20, 2018
eb8f4c5
DOC: added description
nikoskaragiannakis Mar 20, 2018
ac6a409
TST: correction and pep8
nikoskaragiannakis Mar 20, 2018
6994bb0
BUG: pep8
nikoskaragiannakis Mar 20, 2018
40a563f
TST: remove unused import
nikoskaragiannakis Mar 20, 2018
9858259
DOC: resolved conflict
nikoskaragiannakis Mar 20, 2018
5f71a99
Update v0.23.0.txt
nikoskaragiannakis Mar 20, 2018
0a93b60
conflict again
nikoskaragiannakis Mar 20, 2018
f296f9a
arghh
nikoskaragiannakis Mar 20, 2018
7c0af1f
DOC: add disallowing of Series construction of len-1 list with index …
jorisvandenbossche Mar 19, 2018
f0fd0a7
Bug: Allow np.timedelta64 objects to index TimedeltaIndex (#20408)
mroeschke Mar 19, 2018
61e0519
DOC: Only use ~ in class links to hide prefixes. (#20402)
dukebody Mar 19, 2018
9fdac27
DOC: update the pandas.DataFrame.plot.hist docstring (#20155)
liopic Mar 19, 2018
ddb904f
DOC" update the Pandas core window rolling count docstring" (#20264)
tommy-stone Mar 19, 2018
694849d
BUG: astype_unicode astype_str turn a np.nan to empty string (#20377)
nikoskaragiannakis Mar 24, 2018
5ba95a1
TST: added unitest for read_excel and modified series/test_dtypes for…
nikoskaragiannakis Mar 24, 2018
d3ceec3
TST: added unitest for read_csv (#20377)
nikoskaragiannakis Mar 25, 2018
ea1d73a
BUG: patched TextReader to turn np.nan to empty string if dtype=str (…
nikoskaragiannakis Mar 25, 2018
c1376a5
DOC: updated IO section (#20377)
nikoskaragiannakis Mar 25, 2018
3103811
DOC: updated IO section (#20377)
nikoskaragiannakis Mar 25, 2018
7d5f6b2
pull from master
nikoskaragiannakis Mar 25, 2018
478d08d
DOC: updated IO section (#20377)
nikoskaragiannakis Apr 2, 2018
edb26d7
BUG: np.nan stays as np.nan (#20377)
nikoskaragiannakis Apr 2, 2018
c3ab9cb
TXT: Moved test from series.test_io to io.parser.na_values. Corrected…
nikoskaragiannakis Apr 2, 2018
69f6c95
DOC: updated IO section (#20377)
nikoskaragiannakis Apr 2, 2018
97a345a
TST: pep8 (#20377)
nikoskaragiannakis Apr 2, 2018
8b2fb0b
TXT: Moved test from series.test_io to io.parser.na_values. Corrected…
nikoskaragiannakis Apr 2, 2018
c9f5120
DOC: updated IO section (#20377)
nikoskaragiannakis Apr 2, 2018
fab0b27
resolve conflict
nikoskaragiannakis Apr 2, 2018
571d5c4
pep8 correction
nikoskaragiannakis Apr 2, 2018
0712392
Merge remote-tracking branch 'upstream/master' into nikoskaragiannaki…
TomAugspurger Apr 3, 2018
47bc105
DOC: Better explanation (#20377)
nikoskaragiannakis Apr 5, 2018
3740dfe
BUG: use checknull (#20377)
nikoskaragiannakis Apr 5, 2018
7d453bb
TST: update tests (#20377)
nikoskaragiannakis Apr 8, 2018
bcd739d
BUG: string nans to np.nan in Series for list data (#20377)
nikoskaragiannakis Apr 8, 2018
7341cd1
sync
nikoskaragiannakis Apr 8, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1099,6 +1099,7 @@ I/O
- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`)
- Bug in ``usecols`` parameter in :func:`pandas.io.read_csv` and :func:`pandas.io.read_table` where error is not raised correctly when passing a string. (:issue:`20529`)
- Bug in :func:`HDFStore.keys` when reading a file with a softlink causes exception (:issue:`20523`)
- Bug in :func:`read_excel` and :func:`read_csv` where missing values turned to ``'nan'`` with ``dtype=str`` and ``na_filter=True``. Now, these missing values are converted to the string missing indicator, ``np.nan``. (:issue `20377`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean make this a separate sub-section showing the previous and the new


Plotting
^^^^^^^^
Expand Down
16 changes: 14 additions & 2 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -461,11 +461,17 @@ cpdef ndarray[object] astype_unicode(ndarray arr):
cdef:
Py_ssize_t i, n = arr.size
ndarray[object] result = np.empty(n, dtype=object)
object arr_i

for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, unicode(arr[i]))
arr_i = arr[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is arr_i in the cdef?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d'oh!

util.set_value_at_unsafe(
result,
i,
unicode(arr_i) if not checknull(arr_i) else np.nan
)

return result

Expand All @@ -474,11 +480,17 @@ cpdef ndarray[object] astype_str(ndarray arr):
cdef:
Py_ssize_t i, n = arr.size
ndarray[object] result = np.empty(n, dtype=object)
object arr_i

for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, str(arr[i]))
arr_i = arr[i]
util.set_value_at_unsafe(
result,
i,
str(arr_i) if not checknull(arr_i) else np.nan
)

return result

Expand Down
2 changes: 2 additions & 0 deletions pandas/_libs/parsers.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1218,6 +1218,7 @@ cdef class TextReader:
# treat as a regular string parsing
return self._string_convert(i, start, end, na_filter,
na_hashset)

elif dtype.kind == 'U':
width = dtype.itemsize
if width > 0:
Expand All @@ -1227,6 +1228,7 @@ cdef class TextReader:
# unicode variable width
return self._string_convert(i, start, end, na_filter,
na_hashset)

elif is_categorical_dtype(dtype):
# TODO: I suspect that _categorical_convert could be
# optimized when dtype is an instance of CategoricalDtype
Expand Down
4 changes: 4 additions & 0 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -4153,4 +4153,8 @@ def _try_cast(arr, take_fast_path):
data = np.array(data, dtype=dtype, copy=False)
subarr = np.array(data, dtype=object, copy=copy)

# GH 20377
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? this should not be necessary (not to mention non-performant)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this fixes an error I got in this test https://github.com/pandas-dev/pandas/blob/master/pandas/tests/frame/test_dtypes.py#L533

This line https://github.com/nikoskaragiannakis/pandas/blob/7341cd17e11461728969afa159250e882b32dee0/pandas/core/series.py#L4154 does the damage by turning a np.nan to 'nan'. But this comes directly from numpy, so I cannot change it.

Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how's that an error? that test is correct.

Copy link
Contributor Author

@nikoskaragiannakis nikoskaragiannakis May 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that with the changes in the .pyx files, np.nan stays a float, even if you try to cast it to str.
For example

pd.DataFrame([np.NaN]).astype(str)

leaves np.NaN a float, while before the changes you used to get a 'nan'.

So I have to ask: do we want calls like the one above to turn np.NaN into nan or not?
If not, then we would be inconsistent with this https://github.com/pandas-dev/pandas/pull/20429/files#diff-6e435422c67fa1384140f92110fb69a7R379

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Turn all 'nan' to np.nan
subarr[subarr == 'nan'] = np.nan

return subarr
2 changes: 1 addition & 1 deletion pandas/tests/frame/test_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -529,7 +529,7 @@ def test_astype_str(self):
# consistency in astype(str)
for tt in set([str, compat.text_type]):
result = DataFrame([np.NaN]).astype(tt)
expected = DataFrame(['nan'])
expected = DataFrame([np.NaN], dtype=object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no leave this alone. turning a float nan into string should still work

assert_frame_equal(result, expected)

result = DataFrame([1.12345678901234567890]).astype(tt)
Expand Down
24 changes: 24 additions & 0 deletions pandas/tests/io/parser/na_values.py
Original file line number Diff line number Diff line change
Expand Up @@ -369,3 +369,27 @@ def test_no_na_filter_on_index(self):
expected = DataFrame({"a": [1, 4], "c": [3, 6]},
index=Index([np.nan, 5.0], name="b"))
tm.assert_frame_equal(out, expected)

def test_na_values_with_dtype_str_and_na_filter_true(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize this on na_filter (you will need to provide the nan_value as well in the parameterize as they are different)

# see gh-20377
data = "a,b,c\n1,,3\n4,5,6"

out = self.read_csv(StringIO(data), na_filter=True, dtype=str)

# missing data turn to np.nan, which stays as it is after dtype=str
expected = DataFrame({"a": ["1", "4"],
"b": [np.nan, "5"],
"c": ["3", "6"]})
tm.assert_frame_equal(out, expected)

def test_na_values_with_dtype_str_and_na_filter_false(self):
# see gh-20377
data = "a,b,c\n1,,3\n4,5,6"

out = self.read_csv(StringIO(data), na_filter=False, dtype=str)

# missing data turn to empty string
expected = DataFrame({"a": ["1", "4"],
"b": ["", "5"],
"c": ["3", "6"]})
tm.assert_frame_equal(out, expected)
27 changes: 27 additions & 0 deletions pandas/tests/io/test_excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -360,6 +360,33 @@ def test_reader_dtype(self, ext):
with pytest.raises(ValueError):
actual = self.get_exceldf(basename, ext, dtype={'d': 'int64'})

def test_reader_dtype_str(self, ext):
# GH 20377
basename = 'testdtype'
actual = self.get_exceldf(basename, ext)

expected = DataFrame({
'a': [1, 2, 3, 4],
'b': [2.5, 3.5, 4.5, 5.5],
'c': [1, 2, 3, 4],
'd': [1.0, 2.0, np.nan, 4.0]}).reindex(
columns=['a', 'b', 'c', 'd'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just specify columns=list('abcd') rather than reindex


tm.assert_frame_equal(actual, expected)

actual = self.get_exceldf(basename, ext,
dtype={'a': 'float64',
'b': 'float32',
'c': str,
'd': str})

expected['a'] = expected['a'].astype('float64')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this higher (by the expected), you can simply construct things directly by using e.g. Series(...., dtype='float32') rather than a list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this higher (by the expected)

I'm not sure what you mean here.

you can simply construct things directly by using e.g. Series(...., dtype='float32') rather than a list

First of all, this is copy-paste from the previous test, which was added for #8212

Do you mean to do

expected = DataFrame({'a': Series([1,2,3,4], dtype='float64'),
                      'b': Series([2.5,3.5,4.5,5.5], dtype='float32'),
                      ...})

?
If i use, for example, for the 'c' column: Series([1, 2, 3, 4], dtype=str), then it will give me ['1', '2', '3', '4'] instead of the expected ['001', '002', '003', '004'].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly. If you need things like '001', then just do it that way, e.g. Series(['001', '002'....])

expected['b'] = expected['b'].astype('float32')
expected['c'] = ['001', '002', '003', '004']
expected['d'] = ['1', '2', np.nan, '4']

tm.assert_frame_equal(actual, expected)

def test_reading_all_sheets(self, ext):
# Test reading all sheetnames by setting sheetname to None,
# Ensure a dict is returned.
Expand Down
1 change: 1 addition & 0 deletions pandas/tests/series/test_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,7 @@ def test_astype_str_map(self, dtype, series):
# see gh-4405
result = series.astype(dtype)
expected = series.map(compat.text_type)
expected = expected.replace('nan', np.nan) # see gh-20377
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this, this is equiv to .astype(str) its mapping to a string and is correct

tm.assert_series_equal(result, expected)

@pytest.mark.parametrize("dtype", [str, compat.text_type])
Expand Down