-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: DataFrame.to_csv formatting parameters for float indexes #11681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
And I've corrected the fact that the |
# if any index contains only NaNs, it has collapsed into an empty | ||
# Float64Index, and when the multiindex has been recomposed | ||
# the NaNs have come back as NaNs, not as strings corresponding to | ||
# na_rep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is non-performant. What exactly is the issue here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hu, I realize my comment is not only not-clear, but also non-accurate
Let's consider the example in my tests:
df = DataFrame({'a': [0, np.NaN], 'b': [0, 1], 'c': [2, 3]}).set_index(['a', 'b'])
The index is a multi-index, with the following levels:
Float64Index([0.0], dtype='float64', name=u'a')
Int64Index([0, 1], dtype='int64', name=u'b')
After the calls to _format_native_types
, the variable levels
contains:
['0.0']
['0', '1']
and afterwards mi.values
contain:
[('0.0', '0'), (nan, '1')]
Here, nan
is not a string "nan"
, but numpy.NaN
(which is printed as the string nan
)
Then, for that reason, in the tests I've introduced, the following test fails:
Traceback (most recent call last):
File "/home/nicolas/Git/pandas/pandas/tests/test_format.py", line 2965, in test_to_csv_na_rep
self.assertEqual(df.set_index(['a', 'b']).to_csv(na_rep='_'), expected)
AssertionError: 'a,b,c\n0.0,0,2\nnan,1,3\n' != 'a,b,c\n0.0,0,2\n_,1,3\n'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the way to fix this is set both the levels & labels; e.g. you append a new value to the level (which will represent the nans), then change the -1 in the labels to that value (the position of that value, e.g. 1 in this case). This will then reformat the MultiIndex to work correctly.
You can do this in the MultiIndex constructor of _format_native_types
In [20]: df.index.set_levels([0.0,'_'],level=0).set_labels([0,1],level=0).values
Out[20]: array([(0.0, 0), ('_', 1)], dtype=object)
@jreback all set |
@@ -3878,6 +3878,32 @@ def _convert_slice_indexer(self, key, kind=None): | |||
# translate to locations | |||
return self.slice_indexer(key.start, key.stop, key.step) | |||
|
|||
def _format_native_types(self, na_rep='', float_format=None, | |||
decimal='.', quoting=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is basically identical to core/internals/FloatBlock/to_native_types
. let's pull both those out and put it in a function in core/format.py/FloatArrayFormatter
and call it .get_formatted_data()
. See how that works out. These are the routines for screen printing (which are necessarily different from to_csv / index formatting).
ok, looks pretty good. I think we can take this opportunity to re-factor a bit as I have noted above. |
@nbonnotte if you can update / refactor as above would be great |
I will, just had not the time yet. If you'd like this to be done quickly, because of the schedule for the 0.18.0 release, let me know. |
no, just pinging :) |
lmk when you can update |
@jreback I just pushed the changes. Let me know if other changes are needed. Happy Holidays! |
expected = "a,b,c\n_,0,2\n_,1,3\n" | ||
self.assertEqual(df.set_index('a').to_csv(na_rep='_'), expected) | ||
self.assertEqual(df.set_index(['a', 'b']).to_csv(na_rep='_'), expected) | ||
# check if na_rep parameter does not break anything when no NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line here
lgtm. some very minor formatting changes. in general like to have blank lines between different sub-tests and to format code nicely. ping when pushed and green. |
@@ -2101,6 +2105,32 @@ def _format_strings(self): | |||
|
|||
return fmt_values | |||
|
|||
def get_formatted_data(self): | |||
values = self.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a doc-string here describing what this is doing. Some comments in the code as well (to describe the purposes of the if clauses)
If you and Travis agree, that should be it. I have added some comments as you suggested, to make the code clearer. But to be honest I'm not quite sure about how some bits of the code fit with the rest, for instance how and where the Also, I suppose |
|
||
# same but for an index | ||
self.assertEqual( | ||
df.set_index('a').to_csv(decimal='^'), expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, are there tests for quoting
? if not can you add a couple. thxs.
@nbonnotte code looks great. just see if we have some tests for the explanations are fine. Just want to have a note instead of a mass of code to explain a bit what is happening. as far as refactoring, you got what I was looking for (integration between ping when pushed / green (or if tests ok, lmk) |
Yeah, the |
API: DataFrame.to_csv formatting parameters for float indexes
thanks! |
Fix issue #11553
Two things:
I've created a
Float64Index._format_native_types
method which is a copy-paste ofFloatBlock.to_native_types
. I would have preferred to call the latter directly, but I'm not sure what theplacement
parameter of theFloatBlock
constructor means. I guess I doesn't really matters, since I could put whatever value and it should work (I think), and my hesitation a bit unfounded, but I don't know if it would be really a clean solution. Maybe someone can think of a more elegant way?Since a
Float64Index
containing only NaNs collapses when part of a multi-index, its NaNs values would not be converted usingna_rep
, so I had to hack a solution. I put a comment in the relevant part. I'm not quite convinced myself of the elegance of the solution, though.What do you think?