Skip to content

BUG: custom callable in to_sql stopped working with pandas 1.4.0 #46596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
frantakalab opened this issue Apr 1, 2022 · 7 comments
Closed
2 of 3 tasks

BUG: custom callable in to_sql stopped working with pandas 1.4.0 #46596

frantakalab opened this issue Apr 1, 2022 · 7 comments
Labels
Bug IO SQL to_sql, read_sql, read_sql_query Needs Info Clarification about behavior needed to assess issue Regression Functionality that used to work in a prior pandas version

Comments

@frantakalab
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import csv

import pandas as pd

from io import StringIO
from sqlalchemy import create_engine

d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)

# define custom insert method for to_sql as written in documentation https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html?highlight=synchronize#insertion-method

def psql_insert_copy(table, conn, keys, data_iter):
    """
    Execute SQL statement inserting data

    Parameters
    ----------
    table : pandas.io.sql.SQLTable
    conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
    keys : list of str
        Column names
    data_iter : Iterable that iterates the values to be inserted
    """
    # gets a DBAPI connection that can provide a cursor
    dbapi_conn = conn.connection
    with dbapi_conn.cursor() as cur:
        s_buf = StringIO()
        writer = csv.writer(s_buf)
        writer.writerows(data_iter)
        s_buf.seek(0)

        columns = ', '.join(['"{}"'.format(k) for k in keys])
        if table.schema:
            table_name = '{}.{}'.format(table.schema, table.name)
        else:
            table_name = table.name

        sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
            table_name, columns)
        cur.copy_expert(sql=sql, file=s_buf)

# you need to have PostgreSQL running and give correct credentials here to reproduce the issue
engine = create_engine(f'postgresql://{'user'}:{'password'}@{'host'}/{'db'}')

with engine.connect() as conn:
     df.to_sql('target_table', 
               con=conn,
               schema='public',
               if_exists='append',
               index=False,
               method=psql_insert_copy)

Issue Description

When I use custom method for PostgreSQL in DataFrame.to_sqlaccording to the docummentation , the table is created with correct column names and numeric types (if it doesn't exist) but no values are imported (table is empty), no warning or error is raised in pandas nor in the database.

It works in pandas~=1.3.5 and breaks when 1.4.0 or 1.4.1 is installed (keeping sqlalchemy and psycopg2 on same versions). I am suspecting something related to the change that to_sql can now return an Int value instead or None, but couldn't find how that is related or how it can be fixed.

Expected Behavior

Full df is imported into the table including its values.

Installed Versions

INSTALLED VERSIONS

commit : 06d2301
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.0-39-generic
Version : #44~20.04.1-Ubuntu SMP Thu Mar 24 16:43:35 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.1
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : 20.0.2
setuptools : 45.2.0
Cython : None
pytest : 7.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : None
pymysql : None
psycopg2 : 2.9.3
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.4.3
numba : None
numexpr : 2.7.2
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : 1.0.8
s3fs : None
scipy : 1.6.3
sqlalchemy : 1.4.34
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
zstandard : None

@frantakalab frantakalab added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 1, 2022
@simonjayhawkins simonjayhawkins added this to the 1.4.3 milestone Apr 3, 2022
@simonjayhawkins simonjayhawkins added Regression Functionality that used to work in a prior pandas version IO SQL to_sql, read_sql, read_sql_query labels Apr 3, 2022
@simonjayhawkins
Copy link
Member

moving to 1.4.4

@simonjayhawkins simonjayhawkins modified the milestones: 1.4.3, 1.4.4 Jun 22, 2022
@mroeschke
Copy link
Member

Sorry for the slow reply, but could you provide more details regarding your testing setup with Postgres?

Even with the introduction of to_sql returning row counts in 1.4, this specific example is being in our CI using a Postgres container in Github Actions

def test_copy_from_callable_insertion_method(conn, request):
# GH 8953
# Example in io.rst found under _io.sql.method
# not available in sqlite, mysql
def psql_insert_copy(table, conn, keys, data_iter):
# gets a DBAPI connection that can provide a cursor
dbapi_conn = conn.connection
with dbapi_conn.cursor() as cur:
s_buf = StringIO()
writer = csv.writer(s_buf)
writer.writerows(data_iter)
s_buf.seek(0)
columns = ", ".join([f'"{k}"' for k in keys])
if table.schema:
table_name = f"{table.schema}.{table.name}"
else:
table_name = table.name
sql_query = f"COPY {table_name} ({columns}) FROM STDIN WITH CSV"
cur.copy_expert(sql=sql_query, file=s_buf)
conn = request.getfixturevalue(conn)
expected = DataFrame({"col1": [1, 2], "col2": [0.1, 0.2], "col3": ["a", "n"]})
expected.to_sql("test_frame", conn, index=False, method=psql_insert_copy)
result = sql.read_sql_table("test_frame", conn)
tm.assert_frame_equal(result, expected)

@mroeschke mroeschke added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 22, 2022
@leyan2015
Copy link

leyan2015 commented Jun 24, 2022

I had the same issue after I upgraded to latest Anaconda distribution ( Python 3.9, linux 64-Bit (x86) ).
pd._to_sql() pd.read_sql() would all fail.

I did some digging around and discovered that it is the special character "@" in my password mess up the log in. Looks like the parser in the library has a bug. Maybe come from the sqlAlchemy side.

The same user name password would work fine using sqlcmd:
./sqlcmd -S serverName -U yanni2006 -P 'XXX@ABC' -d dbName -Q "select * from db.schema.test"

Yet it fails in python:

import sqlalchemy
import pandas as pd
uid = 'yanni2006 '
password = 'XXX@ABC'
server = 'serverName'
db= 'dbName '
conn_str = 'mssql+pyodbc://{}:{}@{}/{}?driver=ODBC+Driver+17+for+SQL+Server'.format(uid, password, server, db)
sqlEngine = sqlalchemy.create_engine(conn_str)
conn= sqlEngine.connect()

The conn would fail here:
sqlalchemy.exc.OperationalError: (pyodbc.OperationalError) ('HYT00', '[HYT00] [Microsoft][ODBC Driver 17 for SQL Server]Login timeout expired (0) (SQLDriverConnect)')

Obviously if I do
with sqlEngine.begin() as trans:
df= pd.read_sql(con=trans, sql="select * from db.schema.test"

would fail too.

After remove the "@ " from my password, everything worked fine. Before a fix comes out, you can get around by change your password.

also, to_sql() function return -1 even though the operation is successful. the number of rows affected obviously not -1

@mroeschke
Copy link
Member

mroeschke commented Jun 24, 2022

also, to_sql() function return -1 even though the operation is successful. the number of rows affected obviously not -1

The returned number of rows is the sum of the reported rows affected by whatever underlying ORM library used.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html

The number of returned rows affected is the sum of the rowcount attribute of sqlite3.Cursor or SQLAlchemy connectable which may not reflect the exact number of written rows as stipulated in the sqlite3 or SQLAlchemy.

@mroeschke
Copy link
Member

Thanks for the report, but until we receive more context around the failure scenario this code is being tested as mentioned in #46596 (comment). Closing, but we can reopen if we get more information.

@mroeschke mroeschke removed this from the 1.4.4 milestone Aug 19, 2022
@mroeschke mroeschke added Regression Functionality that used to work in a prior pandas version and removed Regression Functionality that used to work in a prior pandas version labels Aug 19, 2022
@utsavjha
Copy link

utsavjha commented Nov 7, 2022

I can confirm the same issue being experienced by me. Using with Postgres.
Here is the debugging performed on my end.

Environment: Docker 4.12, Python v3.10-bullseye source image
Database: Postgres (v15)
My venv packages: pandas, psycopg2==2.9.5 and SQLAlchemy==1.4.43

Affected Pandas Version: >=v1.3.6. All distributions after v1.3.5 are unable to write. I tested each version from v1.3.5 to v1.5.1, same behavior.

My invocation looks like this, where the callable is referenced from official Pandas docs:

some_pandas_dataframe.to_sql(
                name=pg_table_name,
                schema=pg_schema_name,
                if_exists="append",
                index=False,
                chunksize=pg_export_chunk_size,
                con=pg_conn,
                method=PostgresConnector.psql_insert_copy,
            )

Funny behaviour that I did experience during my testing:

  • there is no error thrown in the code. Tests succeed too.
  • Since using Postgres v15, one can track the last run COPY queries using the pg_stat_progress_copy: This view was empty / did not contain the insertion query like it should. Pandas v1.3.5 successfully logs the query there and greater versions do NOT.
  • I was tracking the DB File Stats - there is an uptick in the tempDB files during the export. Yet, there is no data recovered from the database: I saw my WAL expanding viciously only DURING the export. After the exports completed in code, the expected tables were still empty.

@vroomzel
Copy link

I'm having the same issue with pandas==1.5.0 and sqlalchemy== 1.4.36. And our service accounts that use pandas 1.3.4 and sqlalchemy 1.4.22 work just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO SQL to_sql, read_sql, read_sql_query Needs Info Clarification about behavior needed to assess issue Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

6 participants