Skip to content

ENH: Store in SQL using double precision #9041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 25, 2015

Conversation

artemyk
Copy link
Contributor

@artemyk artemyk commented Dec 8, 2014

Closes #9009 . @jorisvandenbossche -- do you think this should test/use single precision ? We could test for float32 perhaps? My own feeling is that this is the safer default.

@jorisvandenbossche jorisvandenbossche added the IO SQL to_sql, read_sql, read_sql_query label Dec 12, 2014
@jorisvandenbossche jorisvandenbossche added this to the 0.16.0 milestone Dec 12, 2014
@jorisvandenbossche
Copy link
Member

We should indeed be safe by default, so opt for what you did now.
But maybe we can do a refinement now at once? For example, distuinguish float32/float64 dtype and give appropriate sqlalchemy type. Also int32/int64 is a possibility.

@artemyk
Copy link
Contributor Author

artemyk commented Dec 12, 2014

Unless I'm missing something, it's actually surprisingly tricky to differentiate between float32 / float64 in a robust manner. We could check dtype, but that wouldn't catch object-typed series w. NAs. inference.pyx, which we use, doesn't return this information (just returns 'floating'), nor does util.pxd (which it calls), nor does numpy_helper.h (which that calls) . It's surely doable to add those functions but that almost seems like a separate PR.

@jorisvandenbossche
Copy link
Member

I wouldn't worry about the object dtype. We are already doing the extra of inferring floating data in object data (which I already regard as a nice extra but not essential feature, as users should ensure their data have the correct dtype, and there are functions to do this like DataFrame.convert_objects).

So I would just check the dtype, when there are 'floating' data, and use the general case for object dtype.

@artemyk
Copy link
Contributor Author

artemyk commented Dec 22, 2014

@jorisvandenbossche OK, added checks for float32 and int32

@artemyk artemyk force-pushed the doubleprecisionsql branch 2 times, most recently from 0503295 to a8ba2f2 Compare December 22, 2014 20:46
@artemyk
Copy link
Contributor Author

artemyk commented Dec 22, 2014

Also made a couple of small changes that I though were logical while playing w. this --- passing through dtype argument in to_sql and get_schema, and changed the default floating MYSQL type to be DOUBLE (not SINGLE).

is_type_class = isinstance(my_type, type) and \
issubclass(my_type, types.TypeEngine)
is_type_obj = isinstance(my_type, types.TypeEngine)
if not is_type_class and not is_type_obj:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you maybe leave this out?
In any case, this is to fix #9083, so it should also have a test associated with it. And I also already have a patch for this (a bit cleaner I think, it's what I mentioned in #9083), but not yet pushed it as a PR. Sorry for not doing that directly, I will do that now.

@artemyk
Copy link
Contributor Author

artemyk commented Dec 23, 2014

I made your suggested changes.

I copied and pasted your fix from #9138 for now --- I can remove and rebase once you have merged.

Also, I can get the precision to go up to 14 digits. However, to get this I needed to insert some special code into get_table. Specifically, by default SQLAlchemy reads double-precision numbers into a 'Decimal' type w. 10 digits of precision. See http://stackoverflow.com/questions/19387882/how-to-make-sqlalchemy-return-float-instead-of-decimal-when-reflecting-tables .

@artemyk artemyk force-pushed the doubleprecisionsql branch 3 times, most recently from 1727541 to 98cb55d Compare December 24, 2014 01:22
dtype = sqlalchemy.Float if self.mode == 'sqlalchemy' else 'REAL'
create_sql = sql.get_schema(self.test_frame1, 'test', 'sqlite',
con=self.conn, dtype={'D':dtype})
self.assertTrue('CREATE' in create_sql)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just add here a check that the Float/Real is in the schema

type = 'FLOAT' if self.mode == 'sqlalchemy' else 'REAL'
self.assertTrue(type in create_sql)

but maybe make a custom df with only ints, so we are sure the float is there because of the dtype argument and not because it was already there (because now, the schema will not be different with or without using the dtype argument)

@jorisvandenbossche
Copy link
Member

@artemyk regarding the precision issue (because of the decimal use). I think that looks OK to do (as we should ensure the full precision is available), but I was wondering if this has performance implications?

@jorisvandenbossche
Copy link
Member

@artemyk I merged the other PR, so you can rebase this one

@artemyk
Copy link
Contributor Author

artemyk commented Dec 24, 2014

@jorisvandenbossche OK, rebased / upgraded test.
Re performance implications --- I don't think so, it just makes sqlalchemy skip a conversion to Decimal. It's more inline with what happens if we do sqlalchemyengine.execute('SELECT * FROM tbl') (in that case, FLOATs don't get converted to Decimal objects)

@jorisvandenbossche
Copy link
Member

OK, do you know the reason sqlalchemy does this by default? (just wondering)


# Avoid casting double-precision floats into decimals
from sqlalchemy import Numeric
for column in tbl.columns.values():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small 'style' comment: the .values is not needed here, you can just iterate through the tbl.columns

@artemyk
Copy link
Contributor Author

artemyk commented Dec 24, 2014

Good point about .values .
Re asdecimal, here's what I found on http://docs.sqlalchemy.org/en/rel_0_9/core/types.html :

When using the Numeric type, care should be taken to ensure that the asdecimal setting is apppropriate for the DBAPI in use - when Numeric applies a conversion from Decimal->float or float-> Decimal, this conversion incurs an additional performance overhead for all result columns received.
DBAPIs that return Decimal natively (e.g. psycopg2) will have better accuracy and higher performance with a setting of True, as the native translation to Decimal reduces the amount of floating- point issues at play, and the Numeric type itself doesn’t need to apply any further conversions. However, another DBAPI which returns floats natively will incur an additional conversion overhead, and is still subject to floating point data loss - in which case asdecimal=False will at least remove the extra conversion overhead.

@jorisvandenbossche
Copy link
Member

@artemyk I justed tested it with postgresql/psycopg2, and it seems that there, asdecimal is already False for me when using DOUBLE PRECISION(53) (which contrasts a bit your quote). So of course, there is no performance difference in that case.
But in any case, I suppose the check can do no harm?

@artemyk
Copy link
Contributor Author

artemyk commented Dec 25, 2014

@jorisvandenbossche I'm now wondering --- do you think its possible there is a DECIMAL-type datatype in , say, postgres, that can store arbitrary precision decimals? And that the code we have here would force conversion to Float, thereby loosing precision?

@artemyk
Copy link
Contributor Author

artemyk commented Jan 20, 2015

@jorisvandenbossche Can we merge?

@jorisvandenbossche
Copy link
Member

@artemyk Yes! (sorry for dropping the ball here)

Can you just rebase?

@artemyk artemyk force-pushed the doubleprecisionsql branch 2 times, most recently from 65f0253 to e5b7cb7 Compare January 24, 2015 18:50
Better test

Style fix

Whats new fix
@artemyk artemyk force-pushed the doubleprecisionsql branch from e5b7cb7 to bb0c2e8 Compare January 24, 2015 18:51
@artemyk
Copy link
Contributor Author

artemyk commented Jan 24, 2015

@jorisvandenbossche Rebased...

jorisvandenbossche added a commit that referenced this pull request Jan 25, 2015
ENH: Store in SQL using double precision
@jorisvandenbossche jorisvandenbossche merged commit 3fbc1ee into pandas-dev:master Jan 25, 2015
@jorisvandenbossche
Copy link
Member

@artemyk Thanks for the work on this, and really for letting this take so long!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No way to specify higher precision (e.g. Double) when saving DataFrame with floating number to MySQL
2 participants