Skip to content

BUG: formating integers datetimes using sql GH17855 #17882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 22, 2017

Conversation

drorata
Copy link
Contributor

@drorata drorata commented Oct 15, 2017

@codecov
Copy link

codecov bot commented Oct 15, 2017

Codecov Report

Merging #17882 into master will decrease coverage by 0.01%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17882      +/-   ##
==========================================
- Coverage   91.23%   91.21%   -0.02%     
==========================================
  Files         163      163              
  Lines       50102    50104       +2     
==========================================
- Hits        45712    45704       -8     
- Misses       4390     4400      +10
Flag Coverage Δ
#multiple 89.02% <0%> (-0.01%) ⬇️
#single 40.31% <66.66%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/sql.py 94.66% <66.66%> (-0.15%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aed9b92...0498dd1. Read the comment docs.

@codecov
Copy link

codecov bot commented Oct 15, 2017

Codecov Report

Merging #17882 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17882      +/-   ##
==========================================
- Coverage   91.36%   91.34%   -0.02%     
==========================================
  Files         164      164              
  Lines       49730    49732       +2     
==========================================
- Hits        45435    45429       -6     
- Misses       4295     4303       +8
Flag Coverage Δ
#multiple 89.14% <0%> (ø) ⬆️
#single 39.65% <100%> (-0.04%) ⬇️
Impacted Files Coverage Δ
pandas/io/sql.py 94.79% <100%> (-0.01%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️
pandas/core/resample.py 96.34% <0%> (+0.18%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 509e03c...987f588. Read the comment docs.

@gfyoung gfyoung added IO SQL to_sql, read_sql, read_sql_query Datetime Datetime data dtype labels Oct 16, 2017
@gfyoung
Copy link
Member

gfyoung commented Oct 16, 2017

@drorata : Will need to add tests for the new behavior.

pandas/io/sql.py Outdated
@@ -109,7 +109,11 @@ def _handle_date_column(col, utc=None, format=None):
issubclass(col.dtype.type, np.integer)):
# parse dates as timestamp
format = 's' if format is None else format
return to_datetime(col, errors='coerce', unit=format, utc=utc)
if '%' in format:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment as to why you are doing this logic branching (and reference issue number).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, I think we can do this a bit cleaner like this:

if format is None and (issubclass(col.dtype.type, np.floating) or
        issubclass(col.dtype.type, np.integer)):
    format = 's'

if format in ['D', 'd', 'h', 'm' 's', 'ms', 'us', 'ns']:
 return to_datetime(col, errors='coerce', unit=format, utc=utc)
elif is_datetime64tz_dtype(col):
    ...
else:
    return to_datetime(col, errors='coerce', format=format, utc=utc)

So first check for the specific case of numeric values and no format -> parse as seconds. Then the format arg is checked for all possible values for unit. Once this check is passed, we don't need to check if '%' is in format anymore, as it can never be a valid unit (this has already been checked)

Copy link
Contributor Author

@drorata drorata Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche But what about the case where the column consists of integers of the format YYYYMMDD or something similar? This is not a valid unit and has to be formatted using % (e.g. %Y%m%d).

If the format string contains % it means that the user knows something about the data and this knowledge has to be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you specify format="%Y%m%d", the column will be parsed with that format in the snippet above (only the specific recognized units specifiers are passed to unit, otherwise format is used)

@jreback
Copy link
Contributor

jreback commented Oct 28, 2017

can you rebase and update according to comments

@drorata
Copy link
Contributor Author

drorata commented Oct 28, 2017

@jreback It's on my agenda and I'll do it as soon as I have time.

@jorisvandenbossche
Copy link
Member

@drorata there went something wrong with updating the branch (see all the commits included here on github).
To fix this, doing exactly the following should normally work (assuming 'usptream' is pandas-dev/pandas, and 'origin' is drorata/pandas):

git pull upstream master
git push origin fix-17882 -f

@jorisvandenbossche
Copy link
Member

That worked!
Can you further add some tests for this? (and update for my comment above? #17882 (comment))

Cleaned the fix and implemented tests
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

assert issubclass(df.IntDateCol.dtype.type, np.datetime64)

df = sql.read_sql_query("SELECT * FROM types_test_data", self.conn,
parse_dates={'IntDateOnlyCol': '%Y%m%d'})
assert issubclass(df.IntDateOnlyCol.dtype.type, np.datetime64)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I included a test. I also noticed that the existing tests are not enough. It might be that the parsing returns a NaT which satisfies the classing condition but the values are wrong. This is the reason I added an explicit test that checks that the resulting values are correct. As a matter of fact while implementing the improvement suggested by @jorisvandenbossche the tests passed but the returned values were NaTs (the reason was a missing , in the code @jorisvandenbossche suggested)

Would you agree there's a problem with the tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those are indeed not very thorough. Do you want to add such a similar check to the others as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Within this same PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be good yes (because as you said, with your changes you could actually silently 'break' them now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche I have included more explicit tests.

@drorata
Copy link
Contributor Author

drorata commented Nov 21, 2017

I guess before merging I still have to squash all commits, is that correct? What is the canonical way to do so?

@jorisvandenbossche
Copy link
Member

I guess before merging I still have to squash all commits, is that correct? What is the canonical way to do so?

No need to squash in the PR. We will squash when merging (github provides that), so you can keep the history in the PR branch itself as it is.

@drorata
Copy link
Contributor Author

drorata commented Nov 21, 2017

Seems like some tests fail on ci/circleci. Why is it? Locally everything passes (or skipped which I assumed is OK)

Included explicit values comparison
@jorisvandenbossche
Copy link
Member

The error on circle ci seems relevant, as something in the setup of sql tests is failing

@jorisvandenbossche
Copy link
Member

The log says there is a sql syntax error:

E       sqlalchemy.exc.ProgrammingError: (pymysql.err.ProgrammingError) (1064, 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near \'"IntDateOnlyCol" INTEGER,\n                    `FloatCol` DOUBLE,\n               \' at line 5') [SQL: 'CREATE TABLE types_test_data (\n                    `TextCol` TEXT,\n                    `DateCol` DATETIME,\n                    `IntDateCol` INTEGER,\n                    "IntDateOnlyCol" INTEGER,\n                    `FloatCol` DOUBLE,\n                    `IntCol` INTEGER,\n                    `BoolCol` BOOLEAN,\n                    `IntColWithNull` INTEGER,\n                    `BoolColWithNull` BOOLEAN\n                )']

../miniconda3/envs/pandas/lib/python3.6/site-packages/pymysql/err.py:107: ProgrammingError

@@ -98,6 +99,7 @@
`TextCol` TEXT,
`DateCol` DATETIME,
`IntDateCol` INTEGER,
"IntDateOnlyCol" INTEGER,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are using a different quoting here, that's probably the reason for the failure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll fix it.

@drorata
Copy link
Contributor Author

drorata commented Nov 21, 2017

The travis test failed due to timeout (link). Any idea? @jorisvandenbossche @jreback

@jorisvandenbossche
Copy link
Member

I restarted the build. All builds seemed rather slow. I don't directly see something in the diff here that could cause a slowdown, so maybe it is a travis issue. Let's see when it finishes.

@drorata
Copy link
Contributor Author

drorata commented Nov 21, 2017

Who can help me review the fail of AppVeyor's build?

@jorisvandenbossche
Copy link
Member

It was also something unrelated I think (it couldn't reach github), so restarted the build

@drorata
Copy link
Contributor Author

drorata commented Nov 22, 2017

@gfyoung are you comfortable with this PR?

@jorisvandenbossche
Copy link
Member

It seems that finally everything is green! (not sure why the CIs were so flaky here)

@jorisvandenbossche jorisvandenbossche merged commit bc95629 into pandas-dev:master Nov 22, 2017
@jorisvandenbossche
Copy link
Member

@drorata Thanks for the fix!

@jorisvandenbossche jorisvandenbossche added this to the 0.21.1 milestone Nov 22, 2017
jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Nov 22, 2017
@jorisvandenbossche
Copy link
Member

Ah, forgot to ask you to add a whatsnew, did that in cf90995

jorisvandenbossche added a commit that referenced this pull request Nov 22, 2017
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Dec 8, 2017
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Dec 8, 2017
TomAugspurger pushed a commit that referenced this pull request Dec 11, 2017
(cherry picked from commit cf90995)
TomAugspurger pushed a commit that referenced this pull request Dec 11, 2017
@drorata drorata deleted the fix-17855 branch January 19, 2018 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failing to parse date given as integers from a (MS)SQL query
4 participants