Skip to content

ENH: Integration with DuckDB #45678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pdet opened this issue Jan 28, 2022 · 1 comment
Open

ENH: Integration with DuckDB #45678

pdet opened this issue Jan 28, 2022 · 1 comment
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query

Comments

@pdet
Copy link

pdet commented Jan 28, 2022

Is your feature request related to a problem?

DuckDB supports the consumption and production of data frames. In this issue, I discuss how we could leverage these on two different ends.

  1. Loading data frames as DuckDB Tables:
    The to_sql() function can be extended to leverage the consumption to efficiently load data frames as tables in DuckDB (See Adding DuckDB as a SQL Database for the to_sql() function #45675).

  2. Run SQL on data frames:

Another possibility that I would like to propose, is to use the DuckDB engine to directly run queries on a data frame. Since DuckDB is able to consume and produce those, a user could get a data frame from running SQL directly on that data frame.

e.g.,

d = {'col1': [1, 2], 'col2': [3, 4]}
df_1 = pd.DataFrame(data=d)

d = {'col1': [2, 3]}
df_2 = pd.DataFrame(data=d)

df_result = pandas.sql("select df_1.col2 from df_1 inner join df_2 on (df_1.col1 = df_2.col2)")

This should be possible since it would run Duckdb's default connection under the hood. With the advantages of consuming/producing df, directly running SQL, and parallel query execution.

If this is something that the core-dev team/ pandas community would be interested in. I'm happy to give a go to a PR :-)

Additional context

DuckDB's blogpost of SQL on Pandas, including benchmarks. https://duckdb.org/2021/05/14/sql-on-pandas.html

@pdet pdet added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 28, 2022
@mroeschke
Copy link
Member

I think before integrating DuckDB into pandas, it would be wise to address #41728 and #36893 regarding pluggable SQL engines.

It would help scale the maintainability of different db engines in pandas without leaking engine specific details in the SQL code. Would you be interested in investigating a SQL plugin interface?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

No branches or pull requests

2 participants