-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: formating integers datetimes using sql GH17855 #17882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #17882 +/- ##
==========================================
- Coverage 91.23% 91.21% -0.02%
==========================================
Files 163 163
Lines 50102 50104 +2
==========================================
- Hits 45712 45704 -8
- Misses 4390 4400 +10
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17882 +/- ##
==========================================
- Coverage 91.36% 91.34% -0.02%
==========================================
Files 164 164
Lines 49730 49732 +2
==========================================
- Hits 45435 45429 -6
- Misses 4295 4303 +8
Continue to review full report at Codecov.
|
@drorata : Will need to add tests for the new behavior. |
pandas/io/sql.py
Outdated
@@ -109,7 +109,11 @@ def _handle_date_column(col, utc=None, format=None): | |||
issubclass(col.dtype.type, np.integer)): | |||
# parse dates as timestamp | |||
format = 's' if format is None else format | |||
return to_datetime(col, errors='coerce', unit=format, utc=utc) | |||
if '%' in format: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment as to why you are doing this logic branching (and reference issue number).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought, I think we can do this a bit cleaner like this:
if format is None and (issubclass(col.dtype.type, np.floating) or
issubclass(col.dtype.type, np.integer)):
format = 's'
if format in ['D', 'd', 'h', 'm' 's', 'ms', 'us', 'ns']:
return to_datetime(col, errors='coerce', unit=format, utc=utc)
elif is_datetime64tz_dtype(col):
...
else:
return to_datetime(col, errors='coerce', format=format, utc=utc)
So first check for the specific case of numeric values and no format -> parse as seconds. Then the format arg is checked for all possible values for unit
. Once this check is passed, we don't need to check if '%' is in format anymore, as it can never be a valid unit (this has already been checked)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche But what about the case where the column consists of integers of the format YYYYMMDD
or something similar? This is not a valid unit and has to be formatted using %
(e.g. %Y%m%d
).
If the format string contains %
it means that the user knows something about the data and this knowledge has to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you specify format="%Y%m%d"
, the column will be parsed with that format
in the snippet above (only the specific recognized units specifiers are passed to unit
, otherwise format
is used)
can you rebase and update according to comments |
@jreback It's on my agenda and I'll do it as soon as I have time. |
@drorata there went something wrong with updating the branch (see all the commits included here on github).
|
That worked! |
Cleaned the fix and implemented tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
pandas/tests/io/test_sql.py
Outdated
assert issubclass(df.IntDateCol.dtype.type, np.datetime64) | ||
|
||
df = sql.read_sql_query("SELECT * FROM types_test_data", self.conn, | ||
parse_dates={'IntDateOnlyCol': '%Y%m%d'}) | ||
assert issubclass(df.IntDateOnlyCol.dtype.type, np.datetime64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I included a test. I also noticed that the existing tests are not enough. It might be that the parsing returns a NaT
which satisfies the classing condition but the values are wrong. This is the reason I added an explicit test that checks that the resulting values are correct. As a matter of fact while implementing the improvement suggested by @jorisvandenbossche the tests passed but the returned values were NaT
s (the reason was a missing ,
in the code @jorisvandenbossche suggested)
Would you agree there's a problem with the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, those are indeed not very thorough. Do you want to add such a similar check to the others as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche Within this same PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be good yes (because as you said, with your changes you could actually silently 'break' them now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche I have included more explicit tests.
I guess before merging I still have to squash all commits, is that correct? What is the canonical way to do so? |
No need to squash in the PR. We will squash when merging (github provides that), so you can keep the history in the PR branch itself as it is. |
Seems like some tests fail on ci/circleci. Why is it? Locally everything passes (or skipped which I assumed is OK) |
Included explicit values comparison
The error on circle ci seems relevant, as something in the setup of sql tests is failing |
The log says there is a sql syntax error:
|
pandas/tests/io/test_sql.py
Outdated
@@ -98,6 +99,7 @@ | |||
`TextCol` TEXT, | |||
`DateCol` DATETIME, | |||
`IntDateCol` INTEGER, | |||
"IntDateOnlyCol" INTEGER, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are using a different quoting here, that's probably the reason for the failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll fix it.
The |
I restarted the build. All builds seemed rather slow. I don't directly see something in the diff here that could cause a slowdown, so maybe it is a travis issue. Let's see when it finishes. |
Who can help me review the fail of |
It was also something unrelated I think (it couldn't reach github), so restarted the build |
@gfyoung are you comfortable with this PR? |
It seems that finally everything is green! (not sure why the CIs were so flaky here) |
@drorata Thanks for the fix! |
Ah, forgot to ask you to add a whatsnew, did that in cf90995 |
(cherry picked from commit cf90995)
(cherry picked from commit bc95629)
(cherry picked from commit bc95629)
git diff upstream/master -u -- "*.py" | flake8 --diff