Skip to content

BUG/FEATURE: to_sql data types not reflecting types accurately #35347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
danieldjewell opened this issue Jul 19, 2020 · 4 comments
Open

BUG/FEATURE: to_sql data types not reflecting types accurately #35347

danieldjewell opened this issue Jul 19, 2020 · 4 comments
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query

Comments

@danieldjewell
Copy link

I'm not sure whether or not this falls into a feature/bug or perhaps an "unfulfilled/unrealistic assumption". Further, my experience here might be "the tip of the iceberg" on a bigger underlying issue with to_sql() (or it could be a red herring that is only a problem for me 😁)

Conditions

DataFrame with multiple dtypes - my example has 200+ columns with dtypes of: {dtype('O'), dtype('int16'), dtype('uint8'), dtype('uint16'), dtype('uint32'), dtype('float64')}. My test case is using Postgres 12 as the destination DB.

Current Behavior

DataFrame.to_sql(...) does work to Postgres. However, from the various dtypes listed above, only 3 Postgres column types are created:

  1. bigint (for all integers)
  • equivalent to int64 (64-bit / 8-byte precision integer)
  1. double precision (for all floats)
  • techncially not directly equivalent (I'm not having an issue, but others might)
  1. text (for all objects)

See: Postgres Numeric Data Types for reference.

The Issue/Effects

The ultimate issue is that the created table, although it appears to work, does not really conform to the data types in the DataFrame. Specifically, with regard to integer types. Postgres (unlike MySQL) doesn't have unsigned integer types but does have (generally) 3 integer types:

  • smallint (2-bytes / 16-bit integer)
  • integer (4-bytes / 32-bit integer)
  • bigint (8-bytes / 64-bit integer)

I'm still reading through the to_sql() code but if the docstring (quoted below) from SQLTable() is accurate, it would appear that Pandas is relying on SQLAlchemy to handle the conversion? If so, the assumption that SQLAlchemy is doing the conversion correctly/well looks to be unfounded.

pandas/pandas/io/sql.py

Lines 663 to 669 in bfac136

class SQLTable(PandasObject):
"""
For mapping Pandas tables to SQL tables.
Uses fact that table is reflected by SQLAlchemy to
do better type conversions.
Also holds various flags needed to avoid having to
pass them between functions all the time.

Some of the Impacts

Bigger tables. Presumably longer insert times. Potentially conversion issues (I haven't tried feeding to_sql() a uint64 .... )

Thoughts

If my understanding is correct that Pandas is relying on SQLAlchemy to do the type conversions, I guess this could be seen as either a Pandas issue or an SQLAlchemy issue:

A. As an SQLAlchemy Issue

  • Pandas is feeding the data to SQLAlchemy with the assumption that it will properly convert the datatypes into various dialects (e.g. PGSQL, MySQL, SQLITE, etc.) correctly
  • Therefore, SQLAlchemy needs to do a better job of converting the input data

B. As a Pandas issue

  • AFAIK, SQLAlchemy was never designed with the plethora of "fully functional" data types provided by Numpy and which Pandas uses - it was designed around Python data types which are much simpler. So, SQLAlchemy can't be expected to make intelligent/informed decisions based on Numpy data types.
  • Therefore, Pandas needs to either provide SQLAlchemy with more metadata (I'm not sure this is possible) and/or not assume that SQLAlchemy is going to do the conversion properly

Potential Further Impacts

I'm really not well versed enough in the Pandas/Numpy/SQLAlchemy code base to understand the potential impacts. However, I'm especially curious about the handling of float values and also about the interaction with other database engines. (For example, something that seems absolutely crazy to me about SQLite3: column data types aren't strict, they are more of a suggestion ... )

Summary/The Bottom Line/TL;DR;

I'd like to see to_sql() create a table with the best (e.g. smallest/most appropriate) data type - preferably a close match to the DataFrame.dtype. Currently, at least in the case of Postgres, it does not.

Finally, perhaps someone with more knowledge than I could double check to see if the assumption that SQLAlchemy is actually reliably converting all datatypes is correct?

@danieldjewell danieldjewell added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 19, 2020
@Netbrian
Copy link

I have had similar issues with this, and would like Pandas to at least retain the original SQLAlchemy datatypes as metadata somehow. For instance, DB2 distinguishes between dates and timestamps, but Pandas will store them both as timestamps.

@jbrockmendel jbrockmendel added Enhancement IO SQL to_sql, read_sql, read_sql_query and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2020
@Prussian1870
Copy link

I have had similar issues with this too. My issue is that Pandas converts the int data type to real in SQLite.

@MattGurney
Copy link

As far as I can see the Pandas docs for df.to_sql() are silent on the behavior for how pandas/numpy data types are automatically mapped to DB types like Postgress. I guess I need to read the SQLAlchemy docs/code. Does anyone have a link to where the mapping is specified?

@takikorabi
Copy link

df.to_sql( name = table, con = engine, if_exists = if_exists, index = index, chunksize = chunk, dtype = dict_var)

where dict_var is a dict:

dict_var = {'column1': sqlalchemy.types.BigInteger(),
'column2' : sqlalchemy.types.NVARCHAR(length=30),
....}

Harmon758 added a commit to Harmon758/Harmonbot that referenced this issue Aug 19, 2023
Use INT instead of BIGINT to store integers representing years, months, and days for information about baseball people

pandas-dev/pandas#35347
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

No branches or pull requests

6 participants