Skip to content

Adding sql functionality to pandas similar to spark sql #30480

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zbrookle opened this issue Dec 26, 2019 · 25 comments · Fixed by #36370
Closed

Adding sql functionality to pandas similar to spark sql #30480

zbrookle opened this issue Dec 26, 2019 · 25 comments · Fixed by #36370
Labels
API Design Enhancement IO SQL to_sql, read_sql, read_sql_query Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@zbrookle
Copy link
Contributor

zbrookle commented Dec 26, 2019

I would like to add functionality to pandas so that data frames can be queried like database tables, similar to the way that they can be in spark-sql.

I think it should work in a similar fashion.

A table can be registered using register_temp_table(dataframe, table_name).

Then using pandas.query("select * from table_name") you can query the data frame or any other ones registered using standard sql syntax.

I've already implemented the entire thing, but I was told to open an issue for it.

Also I'm aware that there is a package called pandassql but this package actually just puts a data frame into a sql lite database, as opposed to querying a data frame directly, and transforming the sql into pandas methods that are then applied to the data frame.

Motivation:
The motivation for this enhancement is to make pandas more accessible to a crowd of users that may not be as technical and also to provide ease of transition for legacy code in systems like sas that have SQL already embedded in their programs. I'll supply a context free grammar in my documentation to show exactly what this system can handle, but it can basically handle any traditional SQL select statement, including subqueries, joins, where clauses, group by clauses, any aggregate function already supported by pandas, limit, and order by clauses. It also has support for rank and dense_rank window functions. It can't do things that sql wouldn't normally do like cross tab and you can't use a user defined function in it although I think that could be a good add-on.

Datatypes:
The interface supports all pandas datatypes, so to cast something as an integer the syntax would currently be cast(some_number as int64) or cast(some_int as object). I've played around with the idea of varchar, char, bigint and smallint, but I think those would be misleading as those aren't datatypes that are supported by pandas currently.

Errors:
Currently the exceptions that it will throw that come this api are based solely around trying to select from an unregistered table, or from submitting an improperly written sql query, both of which you wouldn't want to silence so there's only one error mode.

Api Choices:
The reason I made the register_temp_table section of the api top level was to avoid attaching a method to DataFrame although if others think it might be better as a method, I would change it in that manner (DataFrame.register_temp_table(table_name)). The reason pandas.query is a top level method is that it's relational in structure. You can select from multiple tables and join them and such and so it wouldn't make sense for it to be on a DataFrame level. The only similarity to the .query DataFrame method though is the name. DataFrame.query is just an alternate way of expressing things like DataFrame[some_condition] whereas my .query encompasses a large amount of the pandas api.

Built In:
I have two reasons that I think this would be better built in. The first is that the target audience for this is less technical pandas users. Part of making this api easier to use is lessening the burden of researching code and learning how python works, so I think that for them to go looking for an external package may be hard to begin with and they would also need to know to look for one.
My second reason is that, from using what I've built, I've found pandas a lot easier to use just as a developer.
Suppose we have a DataFrame with one column called A, it goes from
This code:

dataframe[name_1] = dataframe[a] - 1
dataframe[name_2] = dataframe[a] + 1
dataframe = dataframe[dataframe[name_1] == dataframe[name_2]]
dataframe.drop(columns=['a'], inplace=True)

To this code:
pd.query("select a - 1 as name_1, a + 1 as name_2 from some_table where name_1 = name_2")

Also although I did implement register_temp_table as an api level function, it would serve best as a method on a DataFrame so that's another thing to consider.

I can't really provide any support for the lark part, other than that it seemed like the best tool for what I was making.

I apologize for the style and such, I'll be fixing all that before I'm done. I implemented this outside of pandas first, so that's why there are so many style and documentation discrepancies.

@jreback
Copy link
Contributor

jreback commented Dec 26, 2019

why do you think this would be good to include main pandas itself? this would then present an sql interface to the user. i am not sure this actually adds any value. this might be ok as an external package.

@zbrookle
Copy link
Contributor Author

You could say the same thing about spark, and yet it does include it, because sql is such a common language and tool for people to use when working with data

@zbrookle zbrookle mentioned this issue Dec 26, 2019
@jreback jreback added API - Consistency Internal Consistency of API/Behavior Enhancement IO SQL to_sql, read_sql, read_sql_query Needs Discussion Requires discussion from core team before further action API Design and removed API - Consistency Internal Consistency of API/Behavior labels Dec 26, 2019
@jreback
Copy link
Contributor

jreback commented Dec 26, 2019

This needs some discussion from the core team. Please address the following points, by updating the top and/or in-line responses

  1. What is the motivation for this? Provide links / documentation & examples of usage. What can this api handle? What can it not handle. What datatypes are supported / not supported? As a user using this, what kind of coverage do I get of the pandas api. What are the error modes?
  2. What is the rationale for this partciular api (top-level functions), as oppossed to DataFrame methods; this looks a lot like query and eval. Why is this not simply an engine='sql'?
  3. Why is this better as an in-built pandas package, compared to an externally maintained library that has a hook into the main pandas api. This would present substantially less burden on the maintainers.
  4. This needs lark as a dependency, this adds a burden to the system (CI, user facing), and would make this experimental / optional for a while.

I have quite a number of questions about the implementation, testing and code style. Please review the contributing guide to conform.

cc @pandas-dev/pandas-core

I actually find the idea of this interesting as it provides a low friction path for non-python but SQL users to using python / pandas proper.

@datapythonista
Copy link
Member

+1 on this, seems like an interesting feature, but as said in the PR, I think this should be implemented outside pandas, at least for now.

Worth noting that Modin is also experimenting on this: https://modin.readthedocs.io/en/latest/UsingSQLonRay/index.html

@devin-petersohn thoughts on this based in your experience?

@zbrookle
Copy link
Contributor Author

@jreback I updated my description at the top to answer most of your questions and I'm also in the process of writing some more comprehensive documentation

@mrocklin
Copy link
Contributor

Hi All,

A few comments from the Dask side

  1. Processing SQL queries is one of our more common feature requests. I agree with @zbrookle
    that there are a ton of people who would use this.
  2. Has this been done before? Are there other projects that have tried this in the past? If so, what happened to them?
  3. It might make sense to release this outside of Pandas first, but to have strong connections between such a library and the Pandas library with things like references to docs, and maybe review from Pandas maintainers.
  4. I'm curious how extensive this parser is. SQL can be intense. It might be interesting to try it against some of the TPC benchmark suites and see what it can and can not handle.
  5. I personally would love it if this could handle both Pandas dataframes, and other projects that followed the Pandas API, like Vaex, RAPIDS, Modin, or Dask Dataframes.

Thank you for your effort @zbrookle . I'm excited to see where this goes!

@zbrookle
Copy link
Contributor Author

zbrookle commented Dec 26, 2019

@mrocklin I'm almost positive this hasn't been done before, and definitely the need is there. From a quick google search of "pandas sql" I think the need is very apparent, since there are hundreds of guides explaining how to use pandas if you only know sql and when/how to use pandas vs. sql (although I think they mean databases)

In terms of the projects following the pandas api, a few lines of code would make my sql parser compatible with all of them, ie DataFrame would just need to be imported from a different package

@devin-petersohn
Copy link
Contributor

Thanks for the ping @datapythonista I will give some thoughts here.

Dataframes are not relational tables, which means that the SQL equivalents of pandas API calls will be slightly different semantically. Spark does not have a "true" dataframe in the pure sense like pandas or R, but that is a different discussion. It is worth mentioning here because there was a comparison to Spark in the thread.

First there is the question of optimization. If you are simply translating SQL into pandas, there will be no optimization because pandas does not have a query optimizer. The purpose to writing SQL is to provide the system with the entire query upfront to get the optimal execution runtime. You will either have to: (1) accept lower performance for suboptimally written SQL queries, or (2) have to write your own query optimizer. There are some simple query rewriting steps that can be implemented with relatively little engineering overhead.

Second, SQL has different semantics than dataframes. For example, T1 JOIN T2 != T2 JOIN T1. This is because in a dataframe there is an implicit order that the relational data model does not have.

Modin is an academic project, so we are taking a more principled approach to solving some of these problems. We started with treating all APIs as a separate layer and querying the underlying, drilled-down version of the dataframe API, and putting the optimizations at that layer. I'm familiar with the differences between relational tables and dataframes because we are formalizing them.

I think the biggest reason @datapythonista and others would like it not to be in pandas is because there is the question of who will maintain it. If they pull it in, they are agreeing to maintain it themselves, which is a big ask. This is why it should start as an outside project and potentially be brought in later. That said, a lot of people would like this so it would probably generate a large amount of interest if you are willing to maintain it longer term, especially if you made it possible to use with the other projects @mrocklin mentioned.

@wesm
Copy link
Member

wesm commented Dec 26, 2019

Ideally I would like to see a pandas-independent SQL parser that can generate a reasonable logical query plan (similar to stuff we've done in https://github.com/ibis-project/ibis -- we never have tried parsing SQL, instead the inverse: opting to model SQL semantics with a pandas-like DSL) that can be mapped to data frame operations. To save you some time, you could consider creating a Cython binding for Google' ZetaSQL project (https://github.com/google/zetasql) which powers BigQuery (I haven't investigated how easy this is to use though). Then you aren't having to maintain your own parser and SQL query plan implementation yourself.

Apache Calcite is another project which provides this functionality, but it's written in Java and while some native-code systems (Apache Impala and OmniSciDB being notable exxamples) have implemented bindings to use it, to have a JNI-dependency might be unpalatable.

It would be sad IMHO if there were a SQL->Result implementation that is tightly coupled to a single dataframe-like project.

cc @emkornfield who might know more about ZetaSQL and can comment

@zbrookle
Copy link
Contributor Author

@devin-petersohn I’m aware that DataFrames aren’t really relational tables, it’s just that for sql to work normally they need to be treated logically that way in the sql but not in the actual execution. That’s one issue I had to solve when making the SQL api. What I found though was that as long as you collect the correct metadata when you register a table, you have all the logical parts needed for SQL.

I also know that spark dataframes are different on the backend in the sense that they’re mostly metadata, but I brought them up because of the similar API.

I agree it would be helpful to have some optimization of queries so that’s something I’ll have to add in to my current implementation.

In terms of the semantics, I took care of that by collecting information about execution from the sql and then executing it using dataframe operations, although there are some exceptions to this that I would need to change.

If it’s a question of who will maintain this addition, I fully intend to maintain it whether here or in another project.

@zbrookle
Copy link
Contributor Author

@wesm I agree, and that’s really the intention with this. My goal is really the inverse of what ibis does, but specifically for pandas.

Also to clarify for everyone, I already implemented this with lark parser and my own execution plan.

@emkornfield
Copy link

Regarding zetasql, I haven't looked closely but since it follows Google style guide it probably wouldn't be too hard to make cython bindings for it (based on our the experience Arrow has had with similar code guidelines).

A few other points that might or might not make it a good choice:

  1. zetasql isn't just the underlying SQL syntax for Bigquery, but also Cloud Spanner and I believe there is also aims for it to be supported in Apache Beam.
  2. As far as I know zetasql isn't currently taking outside contributions so cython binding would have to live externally.
  3. As far as I know Zetasql isn't completely ANSI SQL compliant

I believe there is a lot of complexity with things like nested tables SQL queries, etc that can be difficult to get right when building a parser from scratch, but I'm not an expert in this area. If there is interest in pursuing zetasql I can try to make an introduction to the main developer working on it at Google.

@zbrookle
Copy link
Contributor Author

So my one concern with using an external parser like zetasql is that the datatypes in pandas aren’t really the traditional datatypes in the sense that the traditional datatypes like bigint, smallint, date, varchar aren’t supported by pandas. Although these could be mapped to their corresponding datatypes on the backend, I think it would create a lot of confusion since casting as varchar for example would end up as object or casting date would end up as datetime.

@zbrookle
Copy link
Contributor Author

I would agree from a lot of the conversation that this may be best as an external library that works for all dataframe conforming frameworks like pandas, dask and rapids. @jreback what do you think about me making this into something like that?

@jreback
Copy link
Contributor

jreback commented Jan 18, 2020

@zbrookle so i think an external package that targets differences backends (dask / pandas) would be a good idea

certainly could incorporate it as an interface in pandas

as far as dtypes -
you will want for sure to translate dtypes as close to the native types as you can

@matthieubulte
Copy link

Hi! Interesting discussion here, and cool work @zbrookle !

I saw @zbrookle you mentioned not finding any existing work but while searching myself for a project implementing the discussed feature, I came across this (apparently unmaintained) repo which might be helpful for you in what you are trying to do: https://github.com/chmp/framequery , maybe it’s creator has some useful feedback for your new take on the problem.

Good luck with this project.

@zbrookle
Copy link
Contributor Author

Just wanted to update this with a link to the new package that I'm making. This will basically be a package that can take a dataframe and run a sql query against it

https://github.com/zbrookle/dataframe_sql

@wesm
Copy link
Member

wesm commented Feb 27, 2020

Looking at this, it would seem very interesting for this package to turn its parsed SQL into an Ibis expression. This could then be coupled with Ibis's pandas backend

@zbrookle
Copy link
Contributor Author

@wesm I've actually been thinking about it a lot and I think that it could pair well with Ibis. I would need to do a lot of refactoring, but I'll let you know when it's ready.

@zbrookle
Copy link
Contributor Author

zbrookle commented Jun 6, 2020

@wesm @jreback Hey I just wanted to mention that I've made a new package, sql_to_ibis, that does target ibis. You can find it here https://github.com/zbrookle/sql_to_ibis

@wesm
Copy link
Member

wesm commented Jun 6, 2020

Cool, you might let the Ibis developers know in case there is interest in merging the work into the project

@jreback
Copy link
Contributor

jreback commented Jun 6, 2020

yeah why don’t u open an issue in the ibis tracker

@zbrookle
Copy link
Contributor Author

@jreback So the package that was the original topic of this, dataframe_sql seems to be gaining a bit of traction as a tool for using sql with pandas dataframes specifically. Would it be possible for that to be added to the eco system page to make it easier for others to find? After that I would probably close this issue because I think that's the last relevant thing that could be discussed here

@jreback
Copy link
Contributor

jreback commented Sep 14, 2020

would take a doc PR to update the eco system page

@zbrookle
Copy link
Contributor Author

@jreback Ok sounds great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement IO SQL to_sql, read_sql, read_sql_query Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants