-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Don't overflow PeriodIndex in to_csv #15984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Don't overflow PeriodIndex in to_csv #15984
Conversation
@@ -1260,6 +1260,7 @@ I/O | |||
- Bug in ``pd.read_csv()`` in which invalid values for ``nrows`` and ``chunksize`` were allowed (:issue:`15767`) | |||
- Bug in ``pd.read_csv()`` for the Python engine in which unhelpful error messages were being raised when parsing errors occurred (:issue:`15910`) | |||
- Bug in ``pd.read_csv()`` in which the ``skipfooter`` parameter was not being properly validated (:issue:`15925`) | |||
- Bug in ``pd.to_csv()`` in which there was numeric overflow when a timestamp index was being written (:issue:`15982`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timestamp
-> Period
I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used timestamp
because that seemed more clear (to the everyday user) than PeriodIndex
. What do you think?
Codecov Report
@@ Coverage Diff @@
## master #15984 +/- ##
==========================================
+ Coverage 91% 91% +<.01%
==========================================
Files 145 145
Lines 49641 49639 -2
==========================================
Hits 45175 45175
+ Misses 4466 4464 -2
Continue to review full report at Codecov.
|
@@ -1143,3 +1143,23 @@ def test_to_csv_quoting(self): | |||
df = df.set_index(['a', 'b']) | |||
expected = '"a","b","c"\n"1","3","5"\n"2","4","6"\n' | |||
self.assertEqual(df.to_csv(quoting=csv.QUOTE_ALL), expected) | |||
|
|||
def test_period_index_date_overflow(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also test NaT. I think we have very few tests for outputing a PI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done.
can you also tests |
Slightly outside of the scope of this PR, but I can squeeze that in with another commit. |
Ha, tests are never out of scope! |
Well, I was focusing |
20cdf1e
to
06cf781
Compare
@jreback , @TomAugspurger : All comments have been addressed, and everything is green. |
An indexer into `self` that specifies which values | ||
are used in the formatting process. | ||
kwargs : dict | ||
Options for specifying how the values should be formatted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think would be ok to actually list these options (with there defautls) in the signature itself. I don't recall why I didn't do this originally. followup PR for this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Derived classes change the signature with different defaults. That's why **kwargs
is provided in the base implementation.
pandas/tests/frame/test_to_csv.py
Outdated
df = pd.DataFrame([4, 5, 6], index=index) | ||
|
||
buf = StringIO() | ||
df.to_csv(buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are looking for something to clean :>
there is a bit of a dichotomy between this test file and pandas/formats/test_formats.py
where this tests 'higher' level functionaility and test_formats
somewhat lower, but this is probably not consistent, and certainly not documented (maybe just a comment at the top of the file would be enough).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...I'm not sure I really see any issue with this right now. What sort of comment (and which file) were you thinking exactly?
expected = np.array(['01-2017-01', '01-2017-02', | ||
'01-2017-03'], dtype=object) | ||
|
||
result = index.to_native_types(date_format='%m-%Y-%d') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, can you move these to test_formats.py
(new file), in this directory, same with the period ones. I don't these are actually tested anywhere :>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they aren't, but they are now with this PR 😄
index = PeriodIndex(['2017-01-01', '2017-01-02', | ||
'2017-01-03'], freq='D') | ||
|
||
# First, with no arguments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done.
mostly about putting tests in better places, otherwise lgtm. |
pandas/tests/frame/test_to_csv.py
Outdated
index = pd.PeriodIndex(dates, freq="D") | ||
df = pd.DataFrame([4, 5, 6], index=index) | ||
|
||
buf = StringIO() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering outloud whether this particular testing idiom could be improved somehow. a possibilitiy
with tm.assert_format_equal(expected) as buf:
df.to_csv(buf)
we do this a LOT in testing output formatting (suggestions on name welcome), certainly a separate PR (to do changes in the rest of the codebase), though could be introduced here if you'd like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, now that I look at this, since df.to_csv
DOES return if buf
is None, complicating the testing with writing to a buffer seems unnecessary. In the future, we should encourage any tests of to_csv
to do the following:
expected = "foo"
result = df.to_csv()
assert result == expected
That makes your point somewhat moot, as I think this way is much cleaner.
ac7e911
to
7165fdb
Compare
@jreback : Build machines running into SSL errors when cloning repository. |
7165fdb
to
5939a24
Compare
Addressed all comments, and everything is green. |
thanks @gfyoung ! |
Don't overflow time-stamp objects if being used as index in
to_csv
.Closes #15982.