-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Adding sql functionality to pandas similar to spark sql #30480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
why do you think this would be good to include main pandas itself? this would then present an sql interface to the user. i am not sure this actually adds any value. this might be ok as an external package. |
You could say the same thing about spark, and yet it does include it, because sql is such a common language and tool for people to use when working with data |
This needs some discussion from the core team. Please address the following points, by updating the top and/or in-line responses
I have quite a number of questions about the implementation, testing and code style. Please review the contributing guide to conform. cc @pandas-dev/pandas-core I actually find the idea of this interesting as it provides a low friction path for non-python but SQL users to using python / pandas proper. |
+1 on this, seems like an interesting feature, but as said in the PR, I think this should be implemented outside pandas, at least for now. Worth noting that Modin is also experimenting on this: https://modin.readthedocs.io/en/latest/UsingSQLonRay/index.html @devin-petersohn thoughts on this based in your experience? |
@jreback I updated my description at the top to answer most of your questions and I'm also in the process of writing some more comprehensive documentation |
Hi All, A few comments from the Dask side
Thank you for your effort @zbrookle . I'm excited to see where this goes! |
@mrocklin I'm almost positive this hasn't been done before, and definitely the need is there. From a quick google search of "pandas sql" I think the need is very apparent, since there are hundreds of guides explaining how to use pandas if you only know sql and when/how to use pandas vs. sql (although I think they mean databases) In terms of the projects following the pandas api, a few lines of code would make my sql parser compatible with all of them, ie DataFrame would just need to be imported from a different package |
Thanks for the ping @datapythonista I will give some thoughts here. Dataframes are not relational tables, which means that the SQL equivalents of pandas API calls will be slightly different semantically. Spark does not have a "true" dataframe in the pure sense like pandas or R, but that is a different discussion. It is worth mentioning here because there was a comparison to Spark in the thread. First there is the question of optimization. If you are simply translating SQL into pandas, there will be no optimization because pandas does not have a query optimizer. The purpose to writing SQL is to provide the system with the entire query upfront to get the optimal execution runtime. You will either have to: (1) accept lower performance for suboptimally written SQL queries, or (2) have to write your own query optimizer. There are some simple query rewriting steps that can be implemented with relatively little engineering overhead. Second, SQL has different semantics than dataframes. For example, Modin is an academic project, so we are taking a more principled approach to solving some of these problems. We started with treating all APIs as a separate layer and querying the underlying, drilled-down version of the dataframe API, and putting the optimizations at that layer. I'm familiar with the differences between relational tables and dataframes because we are formalizing them. I think the biggest reason @datapythonista and others would like it not to be in pandas is because there is the question of who will maintain it. If they pull it in, they are agreeing to maintain it themselves, which is a big ask. This is why it should start as an outside project and potentially be brought in later. That said, a lot of people would like this so it would probably generate a large amount of interest if you are willing to maintain it longer term, especially if you made it possible to use with the other projects @mrocklin mentioned. |
Ideally I would like to see a pandas-independent SQL parser that can generate a reasonable logical query plan (similar to stuff we've done in https://github.com/ibis-project/ibis -- we never have tried parsing SQL, instead the inverse: opting to model SQL semantics with a pandas-like DSL) that can be mapped to data frame operations. To save you some time, you could consider creating a Cython binding for Google' ZetaSQL project (https://github.com/google/zetasql) which powers BigQuery (I haven't investigated how easy this is to use though). Then you aren't having to maintain your own parser and SQL query plan implementation yourself. Apache Calcite is another project which provides this functionality, but it's written in Java and while some native-code systems (Apache Impala and OmniSciDB being notable exxamples) have implemented bindings to use it, to have a JNI-dependency might be unpalatable. It would be sad IMHO if there were a SQL->Result implementation that is tightly coupled to a single dataframe-like project. cc @emkornfield who might know more about ZetaSQL and can comment |
@devin-petersohn I’m aware that DataFrames aren’t really relational tables, it’s just that for sql to work normally they need to be treated logically that way in the sql but not in the actual execution. That’s one issue I had to solve when making the SQL api. What I found though was that as long as you collect the correct metadata when you register a table, you have all the logical parts needed for SQL. I also know that spark dataframes are different on the backend in the sense that they’re mostly metadata, but I brought them up because of the similar API. I agree it would be helpful to have some optimization of queries so that’s something I’ll have to add in to my current implementation. In terms of the semantics, I took care of that by collecting information about execution from the sql and then executing it using dataframe operations, although there are some exceptions to this that I would need to change. If it’s a question of who will maintain this addition, I fully intend to maintain it whether here or in another project. |
@wesm I agree, and that’s really the intention with this. My goal is really the inverse of what ibis does, but specifically for pandas. Also to clarify for everyone, I already implemented this with lark parser and my own execution plan. |
Regarding zetasql, I haven't looked closely but since it follows Google style guide it probably wouldn't be too hard to make cython bindings for it (based on our the experience Arrow has had with similar code guidelines). A few other points that might or might not make it a good choice:
I believe there is a lot of complexity with things like nested tables SQL queries, etc that can be difficult to get right when building a parser from scratch, but I'm not an expert in this area. If there is interest in pursuing zetasql I can try to make an introduction to the main developer working on it at Google. |
So my one concern with using an external parser like zetasql is that the datatypes in pandas aren’t really the traditional datatypes in the sense that the traditional datatypes like bigint, smallint, date, varchar aren’t supported by pandas. Although these could be mapped to their corresponding datatypes on the backend, I think it would create a lot of confusion since casting as varchar for example would end up as object or casting date would end up as datetime. |
I would agree from a lot of the conversation that this may be best as an external library that works for all dataframe conforming frameworks like pandas, dask and rapids. @jreback what do you think about me making this into something like that? |
@zbrookle so i think an external package that targets differences backends (dask / pandas) would be a good idea certainly could incorporate it as an interface in pandas as far as dtypes - |
Hi! Interesting discussion here, and cool work @zbrookle ! I saw @zbrookle you mentioned not finding any existing work but while searching myself for a project implementing the discussed feature, I came across this (apparently unmaintained) repo which might be helpful for you in what you are trying to do: https://github.com/chmp/framequery , maybe it’s creator has some useful feedback for your new take on the problem. Good luck with this project. |
Just wanted to update this with a link to the new package that I'm making. This will basically be a package that can take a dataframe and run a sql query against it |
Looking at this, it would seem very interesting for this package to turn its parsed SQL into an Ibis expression. This could then be coupled with Ibis's pandas backend |
@wesm I've actually been thinking about it a lot and I think that it could pair well with Ibis. I would need to do a lot of refactoring, but I'll let you know when it's ready. |
@wesm @jreback Hey I just wanted to mention that I've made a new package, sql_to_ibis, that does target ibis. You can find it here https://github.com/zbrookle/sql_to_ibis |
Cool, you might let the Ibis developers know in case there is interest in merging the work into the project |
yeah why don’t u open an issue in the ibis tracker |
@jreback So the package that was the original topic of this, dataframe_sql seems to be gaining a bit of traction as a tool for using sql with pandas dataframes specifically. Would it be possible for that to be added to the eco system page to make it easier for others to find? After that I would probably close this issue because I think that's the last relevant thing that could be discussed here |
would take a doc PR to update the eco system page |
@jreback Ok sounds great! |
I would like to add functionality to pandas so that data frames can be queried like database tables, similar to the way that they can be in spark-sql.
I think it should work in a similar fashion.
A table can be registered using register_temp_table(dataframe, table_name).
Then using
pandas.query("select * from table_name")
you can query the data frame or any other ones registered using standard sql syntax.I've already implemented the entire thing, but I was told to open an issue for it.
Also I'm aware that there is a package called pandassql but this package actually just puts a data frame into a sql lite database, as opposed to querying a data frame directly, and transforming the sql into pandas methods that are then applied to the data frame.
Motivation:
The motivation for this enhancement is to make pandas more accessible to a crowd of users that may not be as technical and also to provide ease of transition for legacy code in systems like sas that have SQL already embedded in their programs. I'll supply a context free grammar in my documentation to show exactly what this system can handle, but it can basically handle any traditional SQL select statement, including subqueries, joins, where clauses, group by clauses, any aggregate function already supported by pandas, limit, and order by clauses. It also has support for rank and dense_rank window functions. It can't do things that sql wouldn't normally do like cross tab and you can't use a user defined function in it although I think that could be a good add-on.
Datatypes:
The interface supports all pandas datatypes, so to cast something as an integer the syntax would currently be cast(some_number as int64) or cast(some_int as object). I've played around with the idea of varchar, char, bigint and smallint, but I think those would be misleading as those aren't datatypes that are supported by pandas currently.
Errors:
Currently the exceptions that it will throw that come this api are based solely around trying to select from an unregistered table, or from submitting an improperly written sql query, both of which you wouldn't want to silence so there's only one error mode.
Api Choices:
The reason I made the register_temp_table section of the api top level was to avoid attaching a method to DataFrame although if others think it might be better as a method, I would change it in that manner (DataFrame.register_temp_table(table_name)). The reason pandas.query is a top level method is that it's relational in structure. You can select from multiple tables and join them and such and so it wouldn't make sense for it to be on a DataFrame level. The only similarity to the .query DataFrame method though is the name. DataFrame.query is just an alternate way of expressing things like DataFrame[some_condition] whereas my .query encompasses a large amount of the pandas api.
Built In:
I have two reasons that I think this would be better built in. The first is that the target audience for this is less technical pandas users. Part of making this api easier to use is lessening the burden of researching code and learning how python works, so I think that for them to go looking for an external package may be hard to begin with and they would also need to know to look for one.
My second reason is that, from using what I've built, I've found pandas a lot easier to use just as a developer.
Suppose we have a DataFrame with one column called A, it goes from
This code:
To this code:
pd.query("select a - 1 as name_1, a + 1 as name_2 from some_table where name_1 = name_2")
Also although I did implement register_temp_table as an api level function, it would serve best as a method on a DataFrame so that's another thing to consider.
I can't really provide any support for the lark part, other than that it seemed like the best tool for what I was making.
I apologize for the style and such, I'll be fixing all that before I'm done. I implemented this outside of pandas first, so that's why there are so many style and documentation discrepancies.
The text was updated successfully, but these errors were encountered: