BUG/FEATURE: to_sql data types not reflecting types accurately #35347

danieldjewell · 2020-07-19T23:31:24Z

I'm not sure whether or not this falls into a feature/bug or perhaps an "unfulfilled/unrealistic assumption". Further, my experience here might be "the tip of the iceberg" on a bigger underlying issue with to_sql() (or it could be a red herring that is only a problem for me 😁)

Conditions

DataFrame with multiple dtypes - my example has 200+ columns with dtypes of: {dtype('O'), dtype('int16'), dtype('uint8'), dtype('uint16'), dtype('uint32'), dtype('float64')}. My test case is using Postgres 12 as the destination DB.

Current Behavior

DataFrame.to_sql(...) does work to Postgres. However, from the various dtypes listed above, only 3 Postgres column types are created:

bigint (for all integers)

equivalent to int64 (64-bit / 8-byte precision integer)

double precision (for all floats)

techncially not directly equivalent (I'm not having an issue, but others might)

text (for all objects)

See: Postgres Numeric Data Types for reference.

The Issue/Effects

The ultimate issue is that the created table, although it appears to work, does not really conform to the data types in the DataFrame. Specifically, with regard to integer types. Postgres (unlike MySQL) doesn't have unsigned integer types but does have (generally) 3 integer types:

smallint (2-bytes / 16-bit integer)
integer (4-bytes / 32-bit integer)
bigint (8-bytes / 64-bit integer)

I'm still reading through the to_sql() code but if the docstring (quoted below) from SQLTable() is accurate, it would appear that Pandas is relying on SQLAlchemy to handle the conversion? If so, the assumption that SQLAlchemy is doing the conversion correctly/well looks to be unfounded.

pandas/pandas/io/sql.py

Lines 663 to 669 in bfac136

    
           class SQLTable(PandasObject): 
        
               """ 
        
               For mapping Pandas tables to SQL tables. 
        
               Uses fact that table is reflected by SQLAlchemy to 
        
               do better type conversions. 
        
               Also holds various flags needed to avoid having to 
        
               pass them between functions all the time.

Some of the Impacts

Bigger tables. Presumably longer insert times. Potentially conversion issues (I haven't tried feeding to_sql() a uint64 .... )

Thoughts

If my understanding is correct that Pandas is relying on SQLAlchemy to do the type conversions, I guess this could be seen as either a Pandas issue or an SQLAlchemy issue:

A. As an SQLAlchemy Issue

Pandas is feeding the data to SQLAlchemy with the assumption that it will properly convert the datatypes into various dialects (e.g. PGSQL, MySQL, SQLITE, etc.) correctly
Therefore, SQLAlchemy needs to do a better job of converting the input data

B. As a Pandas issue

AFAIK, SQLAlchemy was never designed with the plethora of "fully functional" data types provided by Numpy and which Pandas uses - it was designed around Python data types which are much simpler. So, SQLAlchemy can't be expected to make intelligent/informed decisions based on Numpy data types.
Therefore, Pandas needs to either provide SQLAlchemy with more metadata (I'm not sure this is possible) and/or not assume that SQLAlchemy is going to do the conversion properly

Potential Further Impacts

I'm really not well versed enough in the Pandas/Numpy/SQLAlchemy code base to understand the potential impacts. However, I'm especially curious about the handling of float values and also about the interaction with other database engines. (For example, something that seems absolutely crazy to me about SQLite3: column data types aren't strict, they are more of a suggestion ... )

Summary/The Bottom Line/TL;DR;

I'd like to see to_sql() create a table with the best (e.g. smallest/most appropriate) data type - preferably a close match to the DataFrame.dtype. Currently, at least in the case of Postgres, it does not.

Finally, perhaps someone with more knowledge than I could double check to see if the assumption that SQLAlchemy is actually reliably converting all datatypes is correct?

The text was updated successfully, but these errors were encountered:

Netbrian · 2020-07-25T03:02:31Z

I have had similar issues with this, and would like Pandas to at least retain the original SQLAlchemy datatypes as metadata somehow. For instance, DB2 distinguishes between dates and timestamps, but Pandas will store them both as timestamps.

Prussian1870 · 2021-06-24T15:46:39Z

I have had similar issues with this too. My issue is that Pandas converts the int data type to real in SQLite.

MattGurney · 2022-03-14T23:26:07Z

As far as I can see the Pandas docs for df.to_sql() are silent on the behavior for how pandas/numpy data types are automatically mapped to DB types like Postgress. I guess I need to read the SQLAlchemy docs/code. Does anyone have a link to where the mapping is specified?

takikorabi · 2022-10-21T10:21:42Z

df.to_sql( name = table, con = engine, if_exists = if_exists, index = index, chunksize = chunk, dtype = dict_var)

where dict_var is a dict:

dict_var = {'column1': sqlalchemy.types.BigInteger(),
'column2' : sqlalchemy.types.NVARCHAR(length=30),
....}

Use INT instead of BIGINT to store integers representing years, months, and days for information about baseball people pandas-dev/pandas#35347

danieldjewell added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 19, 2020

jbrockmendel added Enhancement IO SQL to_sql, read_sql, read_sql_query and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2020

redreamality mentioned this issue Oct 22, 2022

ENH: The need to re-design to_sql() method #49246

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/FEATURE: to_sql data types not reflecting types accurately #35347

BUG/FEATURE: to_sql data types not reflecting types accurately #35347

danieldjewell commented Jul 19, 2020

Netbrian commented Jul 25, 2020

Prussian1870 commented Jun 24, 2021

MattGurney commented Mar 14, 2022

takikorabi commented Oct 21, 2022

BUG/FEATURE: to_sql data types not reflecting types accurately #35347

BUG/FEATURE: to_sql data types not reflecting types accurately #35347

Comments

danieldjewell commented Jul 19, 2020

Conditions

Current Behavior

The Issue/Effects

Some of the Impacts

Thoughts

Potential Further Impacts

Summary/The Bottom Line/TL;DR;

Netbrian commented Jul 25, 2020

Prussian1870 commented Jun 24, 2021

MattGurney commented Mar 14, 2022

takikorabi commented Oct 21, 2022