-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
pandas.io.gbq.read_gbq() returns incorrect results #5840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @jacobschaer |
@markdregan Can you possibly share the type of query? For instance, was it something like "SELECT *", or what kind of filters were you using? Also, what do you mean by duplicates? It would also be useful if you could send your bq.py version. This can be retrieved from the command line using: |
This is BigQuery CLI v2.0.17 The query was similar to below:
Running the exact same query in the BQ Web UI produces different output. By duplicates, I mean there are rows duplicated. In BQ UI, I see 1 row. In the Pandas dataframe, there are some duplicates. Quantifying this, there are ~100k rows that have duplicate key values. Total dataframe is ~404k. Let me know if there are other tests I can do. |
On the off chance this is a factor, I am unable to access any gbq functions when I import pandas as I am only able to access read_gbq() when I import the following: Seemed unusual. Even the IPython tab completion couldn't find gbq in pandas.io |
How are you comparing the two? Do you export from Web UI as CSV and then import into pandas using
Or perhaps, when you say: 'In BQ UI, I see 1 row. ' does this mean you are getting 403.999k rows more rows from pandas.io.gbq? Please try using the command line tool to isolate the problem to our code or Google's code: As far as the imports, I had thought we were in the top level of pandas, but I normally just do: |
Correct, I execute the query in the web console. I get 404k rows. I save the result as a table so I can query it later. In pandas, I import using read_gbq and the same query as above. I get 404k rows. I note that the following query in the web console produces no sum greater than 1:
I then note that the equivalent pandas function does return many results where sum > 1
I then do some spot checks between pandas and BQ web console. I do this by filtering on specific domain_names. This showed pandas dataframe to have duplicates and the BQ web console to have only one row per domain. The latter is expected as it is a unique key for the data set. |
|
I can confirm running the query via the command line returns the same results as the web console. And that they conflict with the results from data loaded into pandas using |
I limited the query down to 3 fields. One of the fields seems to contain both INTEGERS and FLOATS (eg. 78 vs 34.12). The BQ web console lists the field as a float. When I remove this field (only 2 fields remaining) no duplicates are generated in the pandas dataframe. I then add the field back but I cast it using BQ SQL I haven't been able to exhaustively test this with other variables. I'm also aware of some fields containing Infinity values - so I wouldn't rule them out as possible culprits too. |
@markdregan Interesting- that must be the problem. While you were doing that, I was trying to duplicate this with the public shakespeare dataset from BigQuery and haven't found any dup problems yet. The csv case was The gbq case was:
The results were:
|
From your last comment, it makes me think there's a logic issue in our casting. I'm not looking at the code right now, but I do remember there was some discussion on how to handle INTEGERS/FLOATS. While BigQuery does support Integer types, they can be "Null", which means that raw int's would have problems (an issue inherited from numpy if I recall). So, we use an internal pandas method to downcast numeric columns - ultimately resulting in mixed types (which is not such a bad thing). I'll take a peek at the code and see if anything pops out to me. One thing that would be very appreciated is if you can try to replicate this using BigQuery's public sample datasets. If you can give me an example using those public datasets (I'm sure one of them must have FLOATS and INTEGERS), we can make a Unit test out of this and ensure it doesn't happen again. |
@jacobschaer When we talk to Google next, we need to include this in our discussions of test datasets. Ideally, we can not only solve the solution for testing writes, but also having a public dataset with some of these cases. Do you recall Felipe's github handle? Maybe we can copy him on this issue... |
@markdregan Can you post the dtypes of the dataframe from gbq, along with the column types reported by BigQuery UI? |
@jacobschaer Here is a comparison between BQ and Pandas dtypes when I import all fields in the dataset. All of the fields in BQ are also NULLABLE.
It seems strange that the STRING fields from BQ appeared as objects dtypes in pandas. When I import just domain_name and renewal_date_str (recall importing only these did not cause duplication in pandas dataframe) the dtypes are still "object" for both. |
@markdregan string dtypes are always |
I did some spot checking, and I was able to replicate the issue with a similarly sized dataset. This doesn't seem to occur for smaller datasets, so I'm guessing that, once again, we are having result paging issues. I'm going to keep looking at it. The gap between duplicates is at least 100k rows, which is why it wasn't an issue with our previous test suite. |
@jacobshaer as a future enhancement you might want to offer an iterator/chunk size option for read_gbq for returning results (that may or may not involve different page sizes for results from gbq) see io.pytables for an example |
@jreback : That is definitely a feature on the backlog. We had spoken with Google about some API thoughts, so we were waiting on that. @fhoffa : I'm having trouble seeing what has gone wrong with our codebase. I looked through |
I tried another comparison between csv downloaded from the UI using Then I plotted the result using a histogram with 250 bins:
Given 500,000 unique values and 250 bins, each bin should have a count of 2,000 if the distribution is equal. Looking at the plots below, the So, it seems like the paging is somehow off with these larger datasets. Apparently, the total size of the data returned (rows x columns) is leading to this problem as I didn't see it with similar numbers of rows with fewer columns. |
Just to confirm, there are whole ranges that have no results in the
|
We created a StackOverflow question for this: |
@jacobschaer had me run the native Google BQ client to see if it is having the same issues with page tokens. It appears to have the same problem, but the results are different- there are many more duplicates, but no gaps. That is most likely due to the differences in our error handling when the Google API returns JSON with no token. The first subplot is downloading from the Google Cloud Storage UI, the second is the Google BQ client to csv, and the third is our Pandas gbq method. |
@markdregan : Sorry this hasn't been updated in a while. As @azbones said, we tested and confirmed that the bq client was having the same issue, and most of the heavylifting for our module is handled by their code. Google knows about it, and from what we heard today are still looking into it. |
Thanks @jacobschaer. Let me know if I can connect you with folks on the BQ side. I work here in Google. |
@markdregan - We've been in touch with Google for a while - in fact, we'll be dropping by the Mountain View campus next week for a visit! :-) It looks like this bug should be resolved soon per: http://stackoverflow.com/questions/20984592/bigquery-results-not-including-page-token/21009144?noredirect=1#comment32090677_21009144 |
how's this coming along? |
Surprisingly, no changes have been pushed to the client code. Shall I put this in place of the previous bug documentation in the docstrings? Per: #6096 (comment) |
@jacobschaer cerainly if their is a change you want to make in the docs (for this bug) or other minor changes...go ahead and do a PR..... going to finish up 0.13.1 shortly...next few days....did you want the unicode fixes to go in? (is their a PR for this), or still on your branch? |
@jacobschaer PR on this, or shall we move to 0.14? |
Move to 0.14 - we might not have anything to do in the client code, but it will still need validation. No release deadline was given, but we chatted with them and it's "very soon". |
famous last words! ok then |
@jreback @azbones @sean-schaefer @markdregan |
@jreback @markdregan Google informed me they pushed the fix into production and I used my same test with 500K rows returned to validate the API is working correctly with our existing code. @markdregan if you could also confirm this with your dataset, that would be great! As we understand it, this particular bug was an issue with how their API returned page tokens in larger returned datasets (100k+ rows). Just FYI, as @jacobschaer mentioned, we are working to refactor our current code to reduce some of the dependencies on |
@jacobschaer still open? |
When using the
read_gbq()
function on a BigQuery table, incorrect results are returned.I compare the output from
read_gbq()
to that of a CSV export from BigQuery directly. Interestingly, there are the same number of rows in each output - however, there are many duplicates in theread_gbq()
output.I'm using Pandas '0.13.0rc1-125-g4952858' on a Mac 10.9 using Python 2.7. Numpy '1.8.0'.
The code I execute to load the data in pandas:
churn_data = gbq.read_gbq(train_query, project_id = projectid)
I can't share the underlying data. What additional data/info would be useful for root causing?
The output data is ~400k lines.
The text was updated successfully, but these errors were encountered: