Skip to content

Fixed incorrect datatype conversion on multi-indexes #8022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

artemyk
Copy link
Contributor

@artemyk artemyk commented Aug 14, 2014

When creating a new table from a pandas dataframe with a multi-index, the code would iterate over the index columns backwards but use their types as if counting forwards. This meant that the created table had incorrect indexes for multi-indexes.

This code factors out a function to pull column names and column types for both SQLAlchemy and legacy table objects. This code now iterates over the indices properly.

And additional test has been written. All test pass on my local machine.

Closes #8021 .

@@ -1086,6 +1086,18 @@ def test_bigint_warning(self):
sql.read_sql_table('test_bigintwarning', self.conn)
self.assertEqual(len(w), 0, "Warning triggered for other table")

def test_PandasSQLTable_multiindex_dtypes(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can put this in _TestSQLApi instead of TestSQLiteAlchemy, as this is not really specific (or will not differ) between flavors. And in this way it is also tested for the fallback mode (legacy). There is eg a test_to_sql_index_label_multiindex, you can put it after that one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but then you will need to use something else as the _numpy_type for checking, this is of course not available in fallback ..
Possibility is just to read and write it back, and check the types of what is returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right -- this is a weird place to put the test. However, I decided to go this way because of two reasons:

  1. I first tried the writing then reading back route. To my surprise, sqllite does not enforce type constraints (http://www.sqlite.org/faq.html#q3). So, for example, inserting a float into an int-defined field does not convert it into an int, and when read back, it comes back as float. So we need to check the actually schema definition made by pandas.
  2. This bug also showed up in the legacy (non-SQLAlchemy) table creation code, i.e. PandasSQLTableLegacy.insert_statement and ideally that would be tested also. If we could rely on writing/reading back it could be, but since not we have to parse the SQL string returned by that method -- and this seems too brittle. So I did not test it.
    Definitely would be open to suggestion of where to move the test while satisfying these constraints.
    Also thanks for the point regarding assertTrue -- fixed now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good reasons (indeed, with current pandas (with the bug) you get two float columns as a result of these dynamic types).
But in any case, you can put it in TestSQLApi (also with sqlite and sqlalchemy, so don't need to change something in the test).
And I would also try to test it for the fallback mode in _TestSQLApi. I think splitting up the create statement, and checking "INTEGER" in lines[1]/"REAL" in lines[2] or something like that is indeed a bit brittle, but I think not too bad (should we change the create statement, we just have to adapt the test). Or do it with writing/reading anyway (it is not a full guarantee, but such a test would fail at the moment, so it will test at least something ..)

@jorisvandenbossche
Copy link
Member

@artemyk Looks good, thanks a lot! Only some comments on the test.

@artemyk
Copy link
Contributor Author

artemyk commented Aug 15, 2014

You are totally right --- I didn't realize that while the write/read test would fail currently, I thought it would return both columns as having float dtype both pre- and post-patch. But in fact, post-patch the dtypes are correct, while pre-patch they are both float. So the test is simpler and now applies both to SQLAlchemy and legacy APIs (it is under _TestSQLApi). Let me know if there is anything else to fix.

self.conn)
self.assertTrue(issubclass(df2.A.dtype.type, np.integer),
"Multi-index types not getting parsed correctly")
self.assertTrue(issubclass(df2.B.dtype.type, np.float),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last small comment, I think np.floating is a bit more general than np.float (but I am not certain if this is necessary, @jreback ?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you not checking
assert_frame_equal(df,df2)

is this not round trippable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yes, indeed much simpler! (@artemyk, this also checks the dtypes of the columns)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche you should have a general round trip checker in any event

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there is, but is specifically with a multi-index with different types in the levels

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that seems like a good idea. However, to my surprise, assert_frame_equal does not catch differences in the datatypes of a multi-index. This is because assert_frame_equal does not check the dtypes of the individual levels but only the overall index. If both dataframes use multi-indexes, then the dtype of the index is simply object for both.
This seems like it should be fixed by comparing the dtype of each level. Should I submit a patch for this as a separate pull-request?

As an example of what I mean, the following does not raise an exception:

import pandas as pd
import pandas.util.testing as tm

df = pd.DataFrame.from_records({'a':[1,2],'b':[2.1,1.5],'c':['txt1','txt2']}, index=['a','b'])
df2 = pd.DataFrame.from_records({'a':[1.0,2.0],'b':[2.1,1.5],'c':['txt1','txt2']}, index=['a','b'])

tm.assert_frame_equal(df, df2, check_index_type=True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that seems to be a bug in the testing. Why don't you submit a separate PR for that (with tests for the test! in tests/test_testing.py IIRC). we can merge before this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a PR #8042 .

@artemyk
Copy link
Contributor Author

artemyk commented Aug 15, 2014

I think this may be ready to go.

@jreback
Copy link
Contributor

jreback commented Aug 15, 2014

@artemyk needs:

  • rebase on master (so you will pick up the changed test functions)
  • a release note in v0.15.0 (bug fix?)
  • a squash

@artemyk
Copy link
Contributor Author

artemyk commented Aug 15, 2014

@jreback Sorry, this is actually my first pull request ever so I'm not exactly sure what I need to do now.
I inserted a line in release notes. Regarding rebasing on master, I actually cherry-picked the commit with the changed test functions, so that change is included. Do I still need to rebase & squash?

@jorisvandenbossche
Copy link
Member

The change in the test function is already merged in master, so if you fetch pandas master (git fetch upstream) it will already be included, and you certainly don't need to cherry-pick it. So try:

git fetch upstream
git rebase upstream/master

And then, with squashing we mean to combine the different update commits all into one for this PR, you can do this with git rebase -i HEAD~6, and then squash the necessary commits.

Just ask if something is not clear!

@jreback
Copy link
Contributor

jreback commented Aug 15, 2014

@jreback jreback added this to the 0.15.0 milestone Aug 15, 2014
@artemyk
Copy link
Contributor Author

artemyk commented Aug 15, 2014

@jreback Thanks greatly for the instructions. I followed them however, when I did git push I got a "Updates were rejected because the tip of your current branch is behind its remote counterpart" error. So I did a git push -f. Does this commit look correct?
I notice that the changes include those I made to assert_frame_equal -- that probably doesn't belong, right?

@jreback
Copy link
Contributor

jreback commented Aug 15, 2014

yes, you need to use -f.

pls squash as well

so

git rebase -i origin/master

then 's' all of the commits (ecept the first)

then repush with the '-f'

Fixed test to use assertTrue

Simplifying tests to write and read back

Simplified multiindex test to use a roundtrip

Release notes update
@artemyk
Copy link
Contributor Author

artemyk commented Aug 15, 2014

@jreback Thanks for the help.
This should be squashed now.

@jreback
Copy link
Contributor

jreback commented Aug 15, 2014

looks good to me, @jorisvandenbossche ball in your court

jorisvandenbossche added a commit that referenced this pull request Aug 15, 2014
Fixed incorrect datatype conversion on multi-indexes
@jorisvandenbossche jorisvandenbossche merged commit 045577b into pandas-dev:master Aug 15, 2014
@jorisvandenbossche
Copy link
Member

Tests passed, so merging.
Thanks a lot @artemyk, excellent first (or second now) PR!
(and more is always welcome if you want :-))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in generate CREATE TABLE SQL for DataFrame with MultiIndex with different types
3 participants