Fixed incorrect datatype conversion on multi-indexes #8022

artemyk · 2014-08-14T08:32:13Z

When creating a new table from a pandas dataframe with a multi-index, the code would iterate over the index columns backwards but use their types as if counting forwards. This meant that the created table had incorrect indexes for multi-indexes.

This code factors out a function to pull column names and column types for both SQLAlchemy and legacy table objects. This code now iterates over the indices properly.

And additional test has been written. All test pass on my local machine.

Closes #8021 .

jorisvandenbossche · 2014-08-14T09:04:47Z

pandas/io/tests/test_sql.py

@@ -1086,6 +1086,18 @@ def test_bigint_warning(self):
            sql.read_sql_table('test_bigintwarning', self.conn)
            self.assertEqual(len(w), 0, "Warning triggered for other table")

+    def test_PandasSQLTable_multiindex_dtypes(self):


I think you can put this in _TestSQLApi instead of TestSQLiteAlchemy, as this is not really specific (or will not differ) between flavors. And in this way it is also tested for the fallback mode (legacy). There is eg a test_to_sql_index_label_multiindex, you can put it after that one.

Ah, but then you will need to use something else as the _numpy_type for checking, this is of course not available in fallback ..
Possibility is just to read and write it back, and check the types of what is returned.

You are right -- this is a weird place to put the test. However, I decided to go this way because of two reasons:

I first tried the writing then reading back route. To my surprise, sqllite does not enforce type constraints (http://www.sqlite.org/faq.html#q3). So, for example, inserting a float into an int-defined field does not convert it into an int, and when read back, it comes back as float. So we need to check the actually schema definition made by pandas.

This bug also showed up in the legacy (non-SQLAlchemy) table creation code, i.e. PandasSQLTableLegacy.insert_statement and ideally that would be tested also. If we could rely on writing/reading back it could be, but since not we have to parse the SQL string returned by that method -- and this seems too brittle. So I did not test it.
Definitely would be open to suggestion of where to move the test while satisfying these constraints.
Also thanks for the point regarding assertTrue -- fixed now.

Yes, good reasons (indeed, with current pandas (with the bug) you get two float columns as a result of these dynamic types).
But in any case, you can put it in TestSQLApi (also with sqlite and sqlalchemy, so don't need to change something in the test).
And I would also try to test it for the fallback mode in _TestSQLApi. I think splitting up the create statement, and checking "INTEGER" in lines[1]/"REAL" in lines[2] or something like that is indeed a bit brittle, but I think not too bad (should we change the create statement, we just have to adapt the test). Or do it with writing/reading anyway (it is not a full guarantee, but such a test would fail at the moment, so it will test at least something ..)

jorisvandenbossche · 2014-08-14T09:12:39Z

@artemyk Looks good, thanks a lot! Only some comments on the test.

artemyk · 2014-08-15T09:24:42Z

You are totally right --- I didn't realize that while the write/read test would fail currently, I thought it would return both columns as having float dtype both pre- and post-patch. But in fact, post-patch the dtypes are correct, while pre-patch they are both float. So the test is simpler and now applies both to SQLAlchemy and legacy APIs (it is under _TestSQLApi). Let me know if there is anything else to fix.

jorisvandenbossche · 2014-08-15T11:23:40Z

pandas/io/tests/test_sql.py

+                                 self.conn)
+        self.assertTrue(issubclass(df2.A.dtype.type, np.integer), 
+                         "Multi-index types not getting parsed correctly")
+        self.assertTrue(issubclass(df2.B.dtype.type, np.float),


last small comment, I think np.floating is a bit more general than np.float (but I am not certain if this is necessary, @jreback ?)

why are you not checking
assert_frame_equal(df,df2)

is this not round trippable?

hmm, yes, indeed much simpler! (@artemyk, this also checks the dtypes of the columns)

@jorisvandenbossche you should have a general round trip checker in any event

yes, there is, but is specifically with a multi-index with different types in the levels

Yes, that seems like a good idea. However, to my surprise, assert_frame_equal does not catch differences in the datatypes of a multi-index. This is because assert_frame_equal does not check the dtypes of the individual levels but only the overall index. If both dataframes use multi-indexes, then the dtype of the index is simply object for both.
This seems like it should be fixed by comparing the dtype of each level. Should I submit a patch for this as a separate pull-request?

As an example of what I mean, the following does not raise an exception:

import pandas as pd import pandas.util.testing as tm df = pd.DataFrame.from_records({'a':[1,2],'b':[2.1,1.5],'c':['txt1','txt2']}, index=['a','b']) df2 = pd.DataFrame.from_records({'a':[1.0,2.0],'b':[2.1,1.5],'c':['txt1','txt2']}, index=['a','b']) tm.assert_frame_equal(df, df2, check_index_type=True)

yes, that seems to be a bug in the testing. Why don't you submit a separate PR for that (with tests for the test! in tests/test_testing.py IIRC). we can merge before this PR.

Created a PR #8042 .

artemyk · 2014-08-15T20:10:47Z

I think this may be ready to go.

jreback · 2014-08-15T20:12:52Z

@artemyk needs:

rebase on master (so you will pick up the changed test functions)
a release note in v0.15.0 (bug fix?)
a squash

artemyk · 2014-08-15T20:27:32Z

@jreback Sorry, this is actually my first pull request ever so I'm not exactly sure what I need to do now.
I inserted a line in release notes. Regarding rebasing on master, I actually cherry-picked the commit with the changed test functions, so that change is included. Do I still need to rebase & squash?

jorisvandenbossche · 2014-08-15T20:30:54Z

The change in the test function is already merged in master, so if you fetch pandas master (git fetch upstream) it will already be included, and you certainly don't need to cherry-pick it. So try:

git fetch upstream
git rebase upstream/master

And then, with squashing we mean to combine the different update commits all into one for this PR, you can do this with git rebase -i HEAD~6, and then squash the necessary commits.

Just ask if something is not clear!

jreback · 2014-08-15T20:31:13Z

see here: https://github.com/pydata/pandas/wiki/Using-Git

@jorisvandenbossche instructions are good

artemyk · 2014-08-15T20:52:42Z

@jreback Thanks greatly for the instructions. I followed them however, when I did git push I got a "Updates were rejected because the tip of your current branch is behind its remote counterpart" error. So I did a git push -f. Does this commit look correct?
I notice that the changes include those I made to assert_frame_equal -- that probably doesn't belong, right?

jreback · 2014-08-15T20:59:02Z

yes, you need to use -f.

pls squash as well

so

git rebase -i origin/master

then 's' all of the commits (ecept the first)

then repush with the '-f'

Fixed test to use assertTrue Simplifying tests to write and read back Simplified multiindex test to use a roundtrip Release notes update

artemyk · 2014-08-15T21:18:20Z

@jreback Thanks for the help.
This should be squashed now.

jreback · 2014-08-15T21:27:16Z

looks good to me, @jorisvandenbossche ball in your court

Fixed incorrect datatype conversion on multi-indexes

jorisvandenbossche · 2014-08-15T22:41:08Z

Tests passed, so merging.
Thanks a lot @artemyk, excellent first (or second now) PR!
(and more is always welcome if you want :-))

artemyk mentioned this pull request Aug 14, 2014

Bug in generate CREATE TABLE SQL for DataFrame with MultiIndex with different types #8021

Closed

jorisvandenbossche added the SQL label Aug 14, 2014

jorisvandenbossche reviewed Aug 14, 2014
View reviewed changes

jorisvandenbossche reviewed Aug 15, 2014
View reviewed changes

jreback added this to the 0.15.0 milestone Aug 15, 2014

Fixed incorrect datatype conversion on multi-indexes

1d83aa2

Fixed test to use assertTrue Simplifying tests to write and read back Simplified multiindex test to use a roundtrip Release notes update

jorisvandenbossche added a commit that referenced this pull request Aug 15, 2014

Merge pull request #8022 from artemyk/sql-create-table-index-dtype-fix

045577b

Fixed incorrect datatype conversion on multi-indexes

jorisvandenbossche merged commit 045577b into pandas-dev:master Aug 15, 2014

artemyk mentioned this pull request Aug 20, 2014

BUG: When creating table, db indexes should be created from DataFrame indexes #8083

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed incorrect datatype conversion on multi-indexes #8022

Fixed incorrect datatype conversion on multi-indexes #8022

artemyk commented Aug 14, 2014

jorisvandenbossche Aug 14, 2014

jorisvandenbossche Aug 14, 2014

artemyk Aug 14, 2014

jorisvandenbossche Aug 14, 2014

jorisvandenbossche commented Aug 14, 2014

artemyk commented Aug 15, 2014

jorisvandenbossche Aug 15, 2014

jreback Aug 15, 2014

jorisvandenbossche Aug 15, 2014

jreback Aug 15, 2014

jorisvandenbossche Aug 15, 2014

artemyk Aug 15, 2014

jreback Aug 15, 2014

artemyk Aug 15, 2014

artemyk commented Aug 15, 2014

jreback commented Aug 15, 2014

artemyk commented Aug 15, 2014

jorisvandenbossche commented Aug 15, 2014

jreback commented Aug 15, 2014

artemyk commented Aug 15, 2014

jreback commented Aug 15, 2014

artemyk commented Aug 15, 2014

jreback commented Aug 15, 2014

jorisvandenbossche commented Aug 15, 2014

Fixed incorrect datatype conversion on multi-indexes #8022

Fixed incorrect datatype conversion on multi-indexes #8022

Conversation

artemyk commented Aug 14, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 14, 2014

artemyk commented Aug 15, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

artemyk commented Aug 15, 2014

jreback commented Aug 15, 2014

artemyk commented Aug 15, 2014

jorisvandenbossche commented Aug 15, 2014

jreback commented Aug 15, 2014

artemyk commented Aug 15, 2014

jreback commented Aug 15, 2014

artemyk commented Aug 15, 2014

jreback commented Aug 15, 2014

jorisvandenbossche commented Aug 15, 2014