-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
read_excel with dtype=str converts empty cells to np.nan #20429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 34 commits
dd53df8
6f771fb
37f00ad
f194b70
eb8f4c5
ac6a409
6994bb0
40a563f
9858259
5f71a99
0a93b60
f296f9a
7c0af1f
f0fd0a7
61e0519
9fdac27
ddb904f
694849d
5ba95a1
d3ceec3
ea1d73a
c1376a5
3103811
7d5f6b2
478d08d
edb26d7
c3ab9cb
69f6c95
97a345a
8b2fb0b
c9f5120
fab0b27
571d5c4
0712392
47bc105
3740dfe
7d453bb
bcd739d
7341cd1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1098,6 +1098,7 @@ I/O | |
- Bug in :func:`read_pickle` when unpickling objects with :class:`TimedeltaIndex` or :class:`Float64Index` created with pandas prior to version 0.20 (:issue:`19939`) | ||
- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`) | ||
- Bug in ``usecols`` parameter in :func:`pandas.io.read_csv` and :func:`pandas.io.read_table` where error is not raised correctly when passing a string. (:issue:`20529`) | ||
- Bug in :func:`read_excel` and :func:`read_csv` where missing values turned to ``'nan'`` with ``dtype=str`` and ``na_filter=True``. Now, they turn to ``np.nan``. (:issue `20377`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you make the last part a bit more clear. These missing values are converted to the string missing indicator, |
||
- Bug in :func:`HDFStore.keys` when reading a file with a softlink causes exception (:issue:`20523`) | ||
|
||
Plotting | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -465,7 +465,11 @@ cpdef ndarray[object] astype_unicode(ndarray arr): | |
for i in range(n): | ||
# we can use the unsafe version because we know `result` is mutable | ||
# since it was created from `np.empty` | ||
util.set_value_at_unsafe(result, i, unicode(arr[i])) | ||
arr_i = arr[i] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is arr_i in the cdef? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. d'oh! |
||
util.set_value_at_unsafe( | ||
result, | ||
i, | ||
unicode(arr_i) if arr_i is not np.nan else np.nan) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interesting spacing...maybe we should do this instead: uni_arr_i = unicode(arr_i) if arr_i is not np.nan else np.nan
util.set_value_at_unsafe(result, i, uni_arr_i) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah that is not friendly to strings - ok There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When i use that, all hell breaks loose. I get errors in tests like this one https://github.com/pandas-dev/pandas/blob/master/pandas/tests/frame/test_dtypes.py#L533 Is it because they use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @gfyoung are you sure this indentation is a big problem? Because if I do what you suggest, then how should I declare uni_arr_i (and str_arr_i) in the cdef? util.set_value_at_unsafe(
...
) (moved the close bracket in the next line)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That would work as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the nans are the same; iow they point to the same object |
||
|
||
return result | ||
|
||
|
@@ -478,7 +482,11 @@ cpdef ndarray[object] astype_str(ndarray arr): | |
for i in range(n): | ||
# we can use the unsafe version because we know `result` is mutable | ||
# since it was created from `np.empty` | ||
util.set_value_at_unsafe(result, i, str(arr[i])) | ||
arr_i = arr[i] | ||
util.set_value_at_unsafe( | ||
result, | ||
i, | ||
str(arr_i) if arr_i is not np.nan else np.nan) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same |
||
|
||
return result | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -369,3 +369,27 @@ def test_no_na_filter_on_index(self): | |
expected = DataFrame({"a": [1, 4], "c": [3, 6]}, | ||
index=Index([np.nan, 5.0], name="b")) | ||
tm.assert_frame_equal(out, expected) | ||
|
||
def test_na_values_with_dtype_str_and_na_filter_true(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you parameterize this on na_filter (you will need to provide the nan_value as well in the parameterize as they are different) |
||
# see gh-20377 | ||
data = "a,b,c\n1,,3\n4,5,6" | ||
|
||
out = self.read_csv(StringIO(data), na_filter=True, dtype=str) | ||
|
||
# missing data turn to np.nan, which stays as it is after dtype=str | ||
expected = DataFrame({"a": ["1", "4"], | ||
"b": [np.nan, "5"], | ||
"c": ["3", "6"]}) | ||
tm.assert_frame_equal(out, expected) | ||
|
||
def test_na_values_with_dtype_str_and_na_filter_false(self): | ||
# see gh-20377 | ||
data = "a,b,c\n1,,3\n4,5,6" | ||
|
||
out = self.read_csv(StringIO(data), na_filter=False, dtype=str) | ||
|
||
# missing data turn to empty string | ||
expected = DataFrame({"a": ["1", "4"], | ||
"b": ["", "5"], | ||
"c": ["3", "6"]}) | ||
tm.assert_frame_equal(out, expected) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -360,6 +360,33 @@ def test_reader_dtype(self, ext): | |
with pytest.raises(ValueError): | ||
actual = self.get_exceldf(basename, ext, dtype={'d': 'int64'}) | ||
|
||
def test_reader_dtype_str(self, ext): | ||
# GH 20377 | ||
basename = 'testdtype' | ||
actual = self.get_exceldf(basename, ext) | ||
|
||
expected = DataFrame({ | ||
'a': [1, 2, 3, 4], | ||
'b': [2.5, 3.5, 4.5, 5.5], | ||
'c': [1, 2, 3, 4], | ||
'd': [1.0, 2.0, np.nan, 4.0]}).reindex( | ||
columns=['a', 'b', 'c', 'd']) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just specify |
||
|
||
tm.assert_frame_equal(actual, expected) | ||
|
||
actual = self.get_exceldf(basename, ext, | ||
dtype={'a': 'float64', | ||
'b': 'float32', | ||
'c': str, | ||
'd': str}) | ||
|
||
expected['a'] = expected['a'].astype('float64') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. move this higher (by the expected), you can simply construct things directly by using e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean here.
First of all, this is copy-paste from the previous test, which was added for #8212 Do you mean to do expected = DataFrame({'a': Series([1,2,3,4], dtype='float64'),
'b': Series([2.5,3.5,4.5,5.5], dtype='float32'),
...}) ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes exactly. If you need things like '001', then just do it that way, e.g. |
||
expected['b'] = expected['b'].astype('float32') | ||
expected['c'] = ['001', '002', '003', '004'] | ||
expected['d'] = ['1', '2', np.nan, '4'] | ||
|
||
tm.assert_frame_equal(actual, expected) | ||
|
||
def test_reading_all_sheets(self, ext): | ||
# Test reading all sheetnames by setting sheetname to None, | ||
# Ensure a dict is returned. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -149,6 +149,7 @@ def test_astype_str_map(self, dtype, series): | |
# see gh-4405 | ||
result = series.astype(dtype) | ||
expected = series.map(compat.text_type) | ||
expected.replace('nan', np.nan, inplace=True) # see gh-20377 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. don't use inplace |
||
tm.assert_series_equal(result, expected) | ||
|
||
@pytest.mark.parametrize("dtype", [str, compat.text_type]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are including some other changes here, pls rebase on master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not mine. I deleted it by mistake and added it back.
You can check master here https://github.com/pandas-dev/pandas/blob/master/doc/source/whatsnew/v0.23.0.txt#L985
However, even after rebasing, I keep getting this conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you rebased off master and resolved the conflicts in the rebase then it should be ok. Did you fetch the current master before rebasing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i did now