-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix #25099 set na_rep values before converting to string to prevent data truncation #25103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #25099 set na_rep values before converting to string to prevent data truncation #25103
Conversation
Codecov Report
@@ Coverage Diff @@
## master #25103 +/- ##
==========================================
+ Coverage 92.37% 92.37% +<.01%
==========================================
Files 166 166
Lines 52415 52417 +2
==========================================
+ Hits 48418 48420 +2
Misses 3997 3997
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25103 +/- ##
=========================================
Coverage ? 40.71%
=========================================
Files ? 175
Lines ? 52378
Branches ? 0
=========================================
Hits ? 21328
Misses ? 31050
Partials ? 0
Continue to review full report at Codecov.
|
f393d18
to
388cd59
Compare
5245859
to
f6c09b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls run the asv's for csv and pytables as well and report if any regressions
@jreback just finished running asv with the following command asv continuous -f 1.1 upstream/master HEAD -b ^io.csv Complete output in this gist, but near the bottom of its output, it says
Which doesn't look good I think... do you reckon I should try to find another alternative, perhaps duplicating code and leaving the previous method unaltered? By the way, I didn't find a Thanks |
can you merge master |
04d8752
to
0f93786
Compare
Branch rebased onto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a whatsnew note in bug fixes, I/O section. ping on green.
0f93786
to
6e0dd9c
Compare
@jreback added a whatsnew entry as requested - happy to amend if text is incorrect or too simple. I resolved the outdated discussions, except for one that is still pertinent. The initial approach was to calculate the maximum length when it was a string, and then just create a type of that length. That way the masked values were not truncated. The pending discussion is around that approach, where we tried instead to change the order of where the mask gets applied. But several other tests failed after that (see my last comment that Travis failed). In one test that I found, it had an empty string, and for some reason with the different order, it was trying to call So I reverted that change, and Travis stopped complaining. Also ran
However, I am still not too confident if that is a valid approach, as I did not find other parts of the code in Pandas where types had a dynamic size 😕 I hope that makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping on green.
6e0dd9c
to
83a6b14
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you merge master?
83a6b14
to
a56b837
Compare
Thanks for the review @WillAyd
Done. Rebased, fixed conflicts, and tried to address the feedback, except the parameterized test as that was requested before (I think the intention was to add more test scenarios). Cheers |
a56b837
to
3dcb453
Compare
can you merge master |
3dcb453
to
1fdb74a
Compare
@kinow can you merge master. Conflicting files |
Done @simonjayhawkins 👍 |
1fdb74a
to
6330776
Compare
this looked reasonble, can you merge master |
54c3697
to
549363e
Compare
Done @jreback 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this went for a bit without review - @kinow any chance this is still active?
549363e
to
53e0ae9
Compare
Rebased, now will try to find some time this weekend to troubleshoot the Travis errors, and see if I can fix them and them look at |
7fd549d
to
0b0bb76
Compare
ec87b6e
to
391c3b4
Compare
… prevent data truncation
391c3b4
to
7d94d1f
Compare
@WillAyd rebased, fixed conflicts, tried to address your feedback the best I could, and squashed commits too. CI passing, so should be ready for review again 👍 thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just need to move the whatsnew but I think this is reasonable. Back to @jreback
doc/source/whatsnew/v0.25.2.rst
Outdated
@@ -64,7 +64,7 @@ I/O | |||
|
|||
- Fix regression in notebook display where <th> tags not used for :attr:`DataFrame.index` (:issue:`28204`). | |||
- Regression in :meth:`~DataFrame.to_csv` where writing a :class:`Series` or :class:`DataFrame` indexed by an :class:`IntervalIndex` would incorrectly raise a ``TypeError`` (:issue:`28210`) | |||
- | |||
- Bug in :meth:`DataFrame.to_csv` where values were truncated when the length of ``na_rep`` was shorter than the text input data. (:issue:`25099`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this to v1.0.0 at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done. Moved to v1.0.0 under IO too. Removed the changes to v0.25.2, and edited last commit. Thanks @WillAyd
7d94d1f
to
10f67ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. By the way @kinow no need to squash your commits we do so before merge
Oh, thanks for the heads-up, I thought I would be asked to squash them later. And I realized I had made a mess in the last commits, so decided to tidy it up a bit. Lesson learned for the next PR. Thanks! |
Looks like we have a couple +1s. Thanks @kinow! |
…uncation (pandas-dev#25103) * Fix pandas-dev#25099 set na_rep values before converting to string to prevent data truncation
…uncation (pandas-dev#25103) * Fix pandas-dev#25099 set na_rep values before converting to string to prevent data truncation
Hi,
The code in this pull request fixes #25099, and all tests pass. However, I am not sure if this is the best (or even an elegant) solution.
The issue happened because the float
NaN
is converted tostr
(i.e. "nan"), which gets stored as a<U3
. When the user-providedna_rep
is used, it is first truncated down to the data type length (i.e. "myn").I tried moving the
na_rep
to before it is converted to string, but other tests fail. I could not find other code using a similar approach, of concatenating<U
with the desired length. So I assume it's an easy hack, but not a proper solution?Any suggestions? I also tried setting the
na_rep
value only in theif
branch that callsvalues.astype(str)
, but there are still test failures.Cheers
Bruno
git diff upstream/master -u -- "*.py" | flake8 --diff