-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG/FEATURE: to_sql data types not reflecting types accurately #35347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have had similar issues with this, and would like Pandas to at least retain the original SQLAlchemy datatypes as metadata somehow. For instance, DB2 distinguishes between dates and timestamps, but Pandas will store them both as timestamps. |
I have had similar issues with this too. My issue is that Pandas converts the int data type to real in SQLite. |
As far as I can see the Pandas docs for df.to_sql() are silent on the behavior for how pandas/numpy data types are automatically mapped to DB types like Postgress. I guess I need to read the SQLAlchemy docs/code. Does anyone have a link to where the mapping is specified? |
df.to_sql( name = table, con = engine, if_exists = if_exists, index = index, chunksize = chunk, dtype = dict_var) where dict_var is a dict: dict_var = {'column1': sqlalchemy.types.BigInteger(), |
Use INT instead of BIGINT to store integers representing years, months, and days for information about baseball people pandas-dev/pandas#35347
I'm not sure whether or not this falls into a feature/bug or perhaps an "unfulfilled/unrealistic assumption". Further, my experience here might be "the tip of the iceberg" on a bigger underlying issue with to_sql() (or it could be a red herring that is only a problem for me 😁)
Conditions
DataFrame with multiple dtypes - my example has 200+ columns with dtypes of:
{dtype('O'), dtype('int16'), dtype('uint8'), dtype('uint16'), dtype('uint32'), dtype('float64')}
. My test case is using Postgres 12 as the destination DB.Current Behavior
DataFrame.to_sql(...)
does work to Postgres. However, from the various dtypes listed above, only 3 Postgres column types are created:See: Postgres Numeric Data Types for reference.
The Issue/Effects
The ultimate issue is that the created table, although it appears to work, does not really conform to the data types in the DataFrame. Specifically, with regard to integer types. Postgres (unlike MySQL) doesn't have unsigned integer types but does have (generally) 3 integer types:
I'm still reading through the to_sql() code but if the docstring (quoted below) from SQLTable() is accurate, it would appear that Pandas is relying on SQLAlchemy to handle the conversion? If so, the assumption that SQLAlchemy is doing the conversion correctly/well looks to be unfounded.
pandas/pandas/io/sql.py
Lines 663 to 669 in bfac136
Some of the Impacts
Bigger tables. Presumably longer insert times. Potentially conversion issues (I haven't tried feeding to_sql() a uint64 .... )
Thoughts
If my understanding is correct that Pandas is relying on SQLAlchemy to do the type conversions, I guess this could be seen as either a Pandas issue or an SQLAlchemy issue:
A. As an SQLAlchemy Issue
B. As a Pandas issue
Potential Further Impacts
I'm really not well versed enough in the Pandas/Numpy/SQLAlchemy code base to understand the potential impacts. However, I'm especially curious about the handling of float values and also about the interaction with other database engines. (For example, something that seems absolutely crazy to me about SQLite3: column data types aren't strict, they are more of a suggestion ... )
Summary/The Bottom Line/TL;DR;
I'd like to see to_sql() create a table with the best (e.g. smallest/most appropriate) data type - preferably a close match to the DataFrame.dtype. Currently, at least in the case of Postgres, it does not.
Finally, perhaps someone with more knowledge than I could double check to see if the assumption that SQLAlchemy is actually reliably converting all datatypes is correct?
The text was updated successfully, but these errors were encountered: