-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fixed incorrect datatype conversion on multi-indexes #8022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed incorrect datatype conversion on multi-indexes #8022
Conversation
@@ -1086,6 +1086,18 @@ def test_bigint_warning(self): | |||
sql.read_sql_table('test_bigintwarning', self.conn) | |||
self.assertEqual(len(w), 0, "Warning triggered for other table") | |||
|
|||
def test_PandasSQLTable_multiindex_dtypes(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can put this in _TestSQLApi
instead of TestSQLiteAlchemy
, as this is not really specific (or will not differ) between flavors. And in this way it is also tested for the fallback mode (legacy). There is eg a test_to_sql_index_label_multiindex
, you can put it after that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but then you will need to use something else as the _numpy_type
for checking, this is of course not available in fallback ..
Possibility is just to read and write it back, and check the types of what is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right -- this is a weird place to put the test. However, I decided to go this way because of two reasons:
- I first tried the writing then reading back route. To my surprise, sqllite does not enforce type constraints (http://www.sqlite.org/faq.html#q3). So, for example, inserting a float into an int-defined field does not convert it into an int, and when read back, it comes back as float. So we need to check the actually schema definition made by pandas.
- This bug also showed up in the legacy (non-SQLAlchemy) table creation code, i.e.
PandasSQLTableLegacy.insert_statement
and ideally that would be tested also. If we could rely on writing/reading back it could be, but since not we have to parse the SQL string returned by that method -- and this seems too brittle. So I did not test it.
Definitely would be open to suggestion of where to move the test while satisfying these constraints.
Also thanks for the point regardingassertTrue
-- fixed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good reasons (indeed, with current pandas (with the bug) you get two float columns as a result of these dynamic types).
But in any case, you can put it in TestSQLApi
(also with sqlite and sqlalchemy, so don't need to change something in the test).
And I would also try to test it for the fallback mode in _TestSQLApi
. I think splitting up the create statement, and checking "INTEGER" in lines[1]
/"REAL" in lines[2]
or something like that is indeed a bit brittle, but I think not too bad (should we change the create statement, we just have to adapt the test). Or do it with writing/reading anyway (it is not a full guarantee, but such a test would fail at the moment, so it will test at least something ..)
@artemyk Looks good, thanks a lot! Only some comments on the test. |
You are totally right --- I didn't realize that while the write/read test would fail currently, I thought it would return both columns as having |
self.conn) | ||
self.assertTrue(issubclass(df2.A.dtype.type, np.integer), | ||
"Multi-index types not getting parsed correctly") | ||
self.assertTrue(issubclass(df2.B.dtype.type, np.float), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last small comment, I think np.floating
is a bit more general than np.float
(but I am not certain if this is necessary, @jreback ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you not checking
assert_frame_equal(df,df2)
is this not round trippable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, yes, indeed much simpler! (@artemyk, this also checks the dtypes of the columns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche you should have a general round trip checker in any event
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, there is, but is specifically with a multi-index with different types in the levels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that seems like a good idea. However, to my surprise, assert_frame_equal
does not catch differences in the datatypes of a multi-index. This is because assert_frame_equal
does not check the dtypes of the individual levels but only the overall index. If both dataframes use multi-indexes, then the dtype of the index is simply object
for both.
This seems like it should be fixed by comparing the dtype of each level. Should I submit a patch for this as a separate pull-request?
As an example of what I mean, the following does not raise an exception:
import pandas as pd
import pandas.util.testing as tm
df = pd.DataFrame.from_records({'a':[1,2],'b':[2.1,1.5],'c':['txt1','txt2']}, index=['a','b'])
df2 = pd.DataFrame.from_records({'a':[1.0,2.0],'b':[2.1,1.5],'c':['txt1','txt2']}, index=['a','b'])
tm.assert_frame_equal(df, df2, check_index_type=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that seems to be a bug in the testing. Why don't you submit a separate PR for that (with tests for the test! in tests/test_testing.py IIRC). we can merge before this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a PR #8042 .
I think this may be ready to go. |
@artemyk needs:
|
@jreback Sorry, this is actually my first pull request ever so I'm not exactly sure what I need to do now. |
The change in the test function is already merged in master, so if you fetch pandas master (
And then, with squashing we mean to combine the different update commits all into one for this PR, you can do this with Just ask if something is not clear! |
see here: https://github.com/pydata/pandas/wiki/Using-Git @jorisvandenbossche instructions are good |
@jreback Thanks greatly for the instructions. I followed them however, when I did |
yes, you need to use pls squash as well so
then 's' all of the commits (ecept the first) then repush with the '-f' |
Fixed test to use assertTrue Simplifying tests to write and read back Simplified multiindex test to use a roundtrip Release notes update
@jreback Thanks for the help. |
looks good to me, @jorisvandenbossche ball in your court |
Fixed incorrect datatype conversion on multi-indexes
Tests passed, so merging. |
When creating a new table from a
pandas
dataframe with a multi-index, the code would iterate over the index columns backwards but use their types as if counting forwards. This meant that the created table had incorrect indexes for multi-indexes.This code factors out a function to pull column names and column types for both SQLAlchemy and legacy table objects. This code now iterates over the indices properly.
And additional test has been written. All test pass on my local machine.
Closes #8021 .