Skip to content

pandas.io.gbq.read_gbq() returns incorrect results #5840

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
markdregan opened this issue Jan 3, 2014 · 34 comments · Fixed by #6937
Closed

pandas.io.gbq.read_gbq() returns incorrect results #5840

markdregan opened this issue Jan 3, 2014 · 34 comments · Fixed by #6937
Milestone

Comments

@markdregan
Copy link

When using the read_gbq() function on a BigQuery table, incorrect results are returned.

I compare the output from read_gbq() to that of a CSV export from BigQuery directly. Interestingly, there are the same number of rows in each output - however, there are many duplicates in the read_gbq() output.

I'm using Pandas '0.13.0rc1-125-g4952858' on a Mac 10.9 using Python 2.7. Numpy '1.8.0'.

The code I execute to load the data in pandas:
churn_data = gbq.read_gbq(train_query, project_id = projectid)

I can't share the underlying data. What additional data/info would be useful for root causing?

The output data is ~400k lines.

@jreback
Copy link
Contributor

jreback commented Jan 3, 2014

cc @jacobschaer

@jacobschaer
Copy link
Contributor

@markdregan Can you possibly share the type of query? For instance, was it something like "SELECT *", or what kind of filters were you using? Also, what do you mean by duplicates?

It would also be useful if you could send your bq.py version. This can be retrieved from the command line using:
bq version

@markdregan
Copy link
Author

This is BigQuery CLI v2.0.17

The query was similar to below:

SELECT variable_a, var_b, etc FROM dataset.table_name WHERE var_x IN ("String A", "String B") AND exception_flag = "5. No Flag" AND (renewal_date_str BETWEEN "2014-03-31" AND "2014-07-01" OR renewal_date_str BETWEEN "2013-01-01" AND "2013-12-31")

Running the exact same query in the BQ Web UI produces different output. By duplicates, I mean there are rows duplicated. In BQ UI, I see 1 row. In the Pandas dataframe, there are some duplicates. Quantifying this, there are ~100k rows that have duplicate key values. Total dataframe is ~404k.

Let me know if there are other tests I can do.

@markdregan markdregan reopened this Jan 3, 2014
@markdregan
Copy link
Author

On the off chance this is a factor, I am unable to access any gbq functions when I import pandas as
import pandas as pd

I am only able to access read_gbq() when I import the following:
from pandas.io import gbq

Seemed unusual. Even the IPython tab completion couldn't find gbq in pandas.io

@jacobschaer
Copy link
Contributor

How are you comparing the two? Do you export from Web UI as CSV and then import into pandas using read_csv() ? To clarify, you:

  1. Run your query in the web console: [bigquery.cloud.google.com]
  2. You get 404k rows as a result
  3. You run your query using read_gbq() and also get 404k rows as a result
  4. You somehow compare these 404k row result sets and identify that 100k rows from read_gbq() are duplicates. However, there are no duplicates in the results from the Web Console?

Or perhaps, when you say: 'In BQ UI, I see 1 row. ' does this mean you are getting 403.999k rows more rows from pandas.io.gbq?

Please try using the command line tool to isolate the problem to our code or Google's code:
bq query --format=csv 'select {YOUR QUERY} from {YOUR DATASET}.{YOUR TABLE}

As far as the imports, I had thought we were in the top level of pandas, but I normally just do:
from pandas.io import gbq

@markdregan
Copy link
Author

Correct, I execute the query in the web console. I get 404k rows. I save the result as a table so I can query it later.

In pandas, I import using read_gbq and the same query as above. I get 404k rows.

I note that the following query in the web console produces no sum greater than 1:

SELECT
  domain_name,
  SUM(1)
FROM dataset.table
GROUP BY 1

I then note that the equivalent pandas function does return many results where sum > 1

churn_data.groupby(by='domain_name').size().order(ascending=False)

I then do some spot checks between pandas and BQ web console. I do this by filtering on specific domain_names. This showed pandas dataframe to have duplicates and the BQ web console to have only one row per domain. The latter is expected as it is a unique key for the data set.

@jreback
Copy link
Contributor

jreback commented Jan 4, 2014

read_gbq is not in the top-level name space as its experimental (though it prob should be as we have other experimental modules there)....will add an issue to do this, see #5843

@markdregan
Copy link
Author

I can confirm running the query via the command line returns the same results as the web console. And that they conflict with the results from data loaded into pandas using read_gbq()

@markdregan
Copy link
Author

I limited the query down to 3 fields. One of the fields seems to contain both INTEGERS and FLOATS (eg. 78 vs 34.12). The BQ web console lists the field as a float. When I remove this field (only 2 fields remaining) no duplicates are generated in the pandas dataframe.

I then add the field back but I cast it using BQ SQL INTEGER() function. When I do so, pandas imports the data with no duplicates.

I haven't been able to exhaustively test this with other variables. I'm also aware of some fields containing Infinity values - so I wouldn't rule them out as possible culprits too.

@azbones
Copy link

azbones commented Jan 4, 2014

@markdregan Interesting- that must be the problem. While you were doing that, I was trying to duplicate this with the public shakespeare dataset from BigQuery and haven't found any dup problems yet.

The csv case was SELECT * FROM [publicdata:samples.shakespeare]; download the whole table via the Google BigQuery UI, save as temp table, export to Google Cloud Storage and then download and load into pandas via from_csv.

The gbq case was:

from pandas.io import gbq

query='SELECT * FROM [publicdata:samples.shakespeare];'
project_id=xxx
df_gbq=gbq.read_gbq(query,project_id)

The results were:

df_csv.describe()
Out[45]: 
          word_count    corpus_date
count  164656.000000  164656.000000
mean        5.744370    1551.364572
std        25.706592     275.347840
min         1.000000       0.000000
25%         1.000000    1595.000000
50%         1.000000    1599.000000
75%         3.000000    1606.000000
max       995.000000    1612.000000

[8 rows x 2 columns]

df_gbq.describe()
Out[46]: 
          word_count    corpus_date
count  164656.000000  164656.000000
mean        5.744370    1551.364572
std        25.706592     275.347840
min         1.000000       0.000000
25%         1.000000    1595.000000
50%         1.000000    1599.000000
75%         3.000000    1606.000000
max       995.000000    1612.000000

[8 rows x 2 columns]

df_csv['word'].nunique()
Out[47]: 32786

df_gbq['word'].nunique()
Out[48]: 32786

@jacobschaer
Copy link
Contributor

From your last comment, it makes me think there's a logic issue in our casting. I'm not looking at the code right now, but I do remember there was some discussion on how to handle INTEGERS/FLOATS. While BigQuery does support Integer types, they can be "Null", which means that raw int's would have problems (an issue inherited from numpy if I recall). So, we use an internal pandas method to downcast numeric columns - ultimately resulting in mixed types (which is not such a bad thing).

I'll take a peek at the code and see if anything pops out to me. One thing that would be very appreciated is if you can try to replicate this using BigQuery's public sample datasets. If you can give me an example using those public datasets (I'm sure one of them must have FLOATS and INTEGERS), we can make a Unit test out of this and ensure it doesn't happen again.

@azbones
Copy link

azbones commented Jan 4, 2014

@jacobschaer When we talk to Google next, we need to include this in our discussions of test datasets. Ideally, we can not only solve the solution for testing writes, but also having a public dataset with some of these cases. Do you recall Felipe's github handle? Maybe we can copy him on this issue...

@jacobschaer
Copy link
Contributor

@markdregan Can you post the dtypes of the dataframe from gbq, along with the column types reported by BigQuery UI?

@markdregan
Copy link
Author

@jacobschaer Here is a comparison between BQ and Pandas dtypes when I import all fields in the dataset. All of the fields in BQ are also NULLABLE.

Field BQ dtype Pandas dtype
avg_br_norm_productc_num_30da_users FLOAT float64
avg_br_norm_productb_num_30da_users FLOAT float64
avg_br_norm_producta_num_30da_users FLOAT float64
avg_br_norm_num_30da_users FLOAT float64
avg_productc_num_30da_users FLOAT int64
avg_d_max_norm_productc_num_30da_users FLOAT float64
avg_d_max_norm_productb_num_30da_users FLOAT float64
avg_d_max_norm_producta_num_30da_users FLOAT float64
avg_d_max_norm_num_30da_users FLOAT float64
avg_d_range_norm_productc_num_30da_users FLOAT float64
avg_d_range_norm_productb_num_30da_users FLOAT float64
avg_d_range_norm_producta_num_30da_users FLOAT float64
avg_d_range_norm_num_30da_users FLOAT float64
avg_productb_num_30da_users FLOAT int64
avg_producta_num_30da_users FLOAT int64
avg_num_30da_users FLOAT int64
br_norm_productc_num_30da_users FLOAT float64
br_norm_productb_num_30da_users FLOAT float64
br_norm_producta_num_30da_users FLOAT float64
br_norm_num_30da_users FLOAT float64
business_unit STRING object
productc_num_30da_users INTEGER int64
country STRING object
customer_type STRING object
d_max_norm_productc_num_30da_users FLOAT float64
d_max_norm_productb_num_30da_users FLOAT float64
d_max_norm_producta_num_30da_users FLOAT float64
d_max_norm_num_30da_users FLOAT float64
d_range_norm_productc_num_30da_users FLOAT int64
d_range_norm_productb_num_30da_users FLOAT int64
d_range_norm_producta_num_30da_users FLOAT int64
d_range_norm_num_30da_users FLOAT int64
days_to_renewal_event INTEGER int64
productb_num_30da_users INTEGER int64
domain_class STRING object
domain_name STRING object
domain_usd_for_renewal FLOAT float64
exception_flag STRING object
producta_num_30da_users INTEGER int64
has_renewed BOOLEAN bool
max_num_allowed_users INTEGER int64
min_max_range_num_allowed_users INTEGER int64
min_num_allowed_users INTEGER int64
num_30da_users INTEGER int64
order_usd_for_renewal FLOAT float64
region STRING object
renewal_classification STRING object
renewal_date_str STRING object
renewal_grace_period_expired BOOLEAN bool
sales_channel_type STRING object
sales_segment STRING object
sales_unit STRING object
snapshot_date_str STRING object
stddev_br_norm_productc_num_30da_users FLOAT int64
stddev_br_norm_productb_num_30da_users FLOAT int64
stddev_br_norm_producta_num_30da_users FLOAT int64
stddev_br_norm_num_30da_users FLOAT int64
stddev_productc_num_30da_users FLOAT int64
stddev_d_max_norm_productc_num_30da_users FLOAT int64
stddev_d_max_norm_productb_num_30da_users FLOAT int64
stddev_d_max_norm_producta_num_30da_users FLOAT int64
stddev_d_max_norm_num_30da_users FLOAT int64
stddev_d_range_norm_productc_num_30da_users FLOAT int64
stddev_d_range_norm_productb_num_30da_users FLOAT int64
stddev_d_range_norm_producta_num_30da_users FLOAT int64
stddev_d_range_norm_num_30da_users FLOAT int64
stddev_productb_num_30da_users FLOAT int64
stddev_producta_num_30da_users FLOAT int64
stddev_num_30da_users FLOAT int64
term_in_days INTEGER int64
users_for_renewal FLOAT int64
zscore_br_norm_productc_num_30da_users FLOAT int64
zscore_br_norm_productb_num_30da_users FLOAT int64
zscore_br_norm_producta_num_30da_users FLOAT int64
zscore_br_norm_num_30da_users FLOAT int64
zscore_productc_num_30da_users FLOAT int64
zscore_d_max_norm_productc_num_30da_users FLOAT int64
zscore_d_max_norm_productb_num_30da_users FLOAT int64
zscore_d_max_norm_producta_num_30da_users FLOAT int64
zscore_d_max_norm_num_30da_users FLOAT int64
zscore_d_range_norm_productc_num_30da_users FLOAT int64
zscore_d_range_norm_productb_num_30da_users FLOAT int64
zscore_d_range_norm_producta_num_30da_users FLOAT int64
zscore_d_range_norm_num_30da_users FLOAT int64
zscore_productb_num_30da_users FLOAT int64
zscore_producta_num_30da_users FLOAT int64
zscore_num_30da_users FLOAT int64

It seems strange that the STRING fields from BQ appeared as objects dtypes in pandas. When I import just domain_name and renewal_date_str (recall importing only these did not cause duplication in pandas dataframe) the dtypes are still "object" for both.

@jreback
Copy link
Contributor

jreback commented Jan 4, 2014

@markdregan string dtypes are always object in pandas

@jacobschaer
Copy link
Contributor

I did some spot checking, and I was able to replicate the issue with a similarly sized dataset. This doesn't seem to occur for smaller datasets, so I'm guessing that, once again, we are having result paging issues. I'm going to keep looking at it. The gap between duplicates is at least 100k rows, which is why it wasn't an issue with our previous test suite.

@jreback
Copy link
Contributor

jreback commented Jan 6, 2014

@jacobshaer

as a future enhancement

you might want to offer an iterator/chunk size option for read_gbq for returning results (that may or may not involve different page sizes for results from gbq)

see io.pytables for an example

@jacobschaer
Copy link
Contributor

@jreback : That is definitely a feature on the backlog. We had spoken with Google about some API thoughts, so we were waiting on that.

@fhoffa : I'm having trouble seeing what has gone wrong with our codebase. I looked through bq.py and nothing has changed. We are basically just doing what BigqueryClient.ReadSchemaAndRows() does, except we manipulate the data as we go through the pages. However, the bq command line client works perfectly still. The spacing between duplicates are fairly consistent and it looks like we might be getting the same page repeatedly.

@azbones
Copy link

azbones commented Jan 7, 2014

I tried another comparison between csv downloaded from the UI using from_csv and the read_gbq. This time I used a table with more columns from the public natality_testing dataset which we added a column to with unique row numbers (row_number) and selected 500K rows from.

Then I plotted the result using a histogram with 250 bins:

uniques_df=len(gbq_df['row_number'].drop_duplicates())
uniques_csv=len(csv_df['row_number'].drop_duplicates())

subplots_adjust(hspace=.5)
subplot(2,1,0)
title('Histogram of Unique Row Value Count From GBQ- {:,} Unique Values'.format(uniques_df))
hist(gbq_df['row_number'], bins=250)
xticks(range(0,500000,10000), rotation=90)
xlabel('Unique Row Number')
ylabel('Count')

subplot(2,1,1)
title('Histogram of Unique Row Value Count From CSV Download to DF- {:,} Unique Values'.format(uniques_csv))
hist(csv_df['row_number'], bins=250)
xticks(range(0,500000,10000), rotation=90)
xlabel('Unique Row Number')
ylabel('Count')

Given 500,000 unique values and 250 bins, each bin should have a count of 2,000 if the distribution is equal. Looking at the plots below, the read_gbq dataframe shows that some ranges have the correct number, some have none, and some have multiples. The from_csv dataframe correctly has 500,000 unique row numbers while the read_gbq has 336,382 which should be incorrect.

So, it seems like the paging is somehow off with these larger datasets. Apparently, the total size of the data returned (rows x columns) is leading to this problem as I didn't see it with similar numbers of rows with fewer columns.

df_test_hist_v_csv

@azbones
Copy link

azbones commented Jan 7, 2014

Just to confirm, there are whole ranges that have no results in the read_gbq dataframe like looking at row_number values between 40,000 and 80,000 as follows:

gbq_df[(gbq_df['row_number']>40000)&(gbq_df['row_number']<80000)]
Out[47]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 32 columns):
source_year               0  non-null values
year                      0  non-null values
month                     0  non-null values
day                       0  non-null values
wday                      0  non-null values
state                     0  non-null values
is_male                   0  non-null values
child_race                0  non-null values
weight_pounds             0  non-null values
plurality                 0  non-null values
apgar_1min                0  non-null values
apgar_5min                0  non-null values
mother_residence_state    0  non-null values
mother_race               0  non-null values
mother_age                0  non-null values
gestation_weeks           0  non-null values
lmp                       0  non-null values
mother_married            0  non-null values
mother_birth_state        0  non-null values
cigarette_use             0  non-null values
cigarettes_per_day        0  non-null values
alcohol_use               0  non-null values
drinks_per_week           0  non-null values
weight_gain_pounds        0  non-null values
born_alive_alive          0  non-null values
born_alive_dead           0  non-null values
born_dead                 0  non-null values
ever_born                 0  non-null values
father_race               0  non-null values
father_age                0  non-null values
record_weight             0  non-null values
row_number                0  non-null values
dtypes: bool(2), float64(12), int64(12), object(6)

@jacobschaer
Copy link
Contributor

We created a StackOverflow question for this:
http://stackoverflow.com/questions/20984592/bigquery-results-not-including-page-token

@azbones
Copy link

azbones commented Jan 8, 2014

@jacobschaer had me run the native Google BQ client to see if it is having the same issues with page tokens. It appears to have the same problem, but the results are different- there are many more duplicates, but no gaps. That is most likely due to the differences in our error handling when the Google API returns JSON with no token.

The first subplot is downloading from the Google Cloud Storage UI, the second is the Google BQ client to csv, and the third is our Pandas gbq method.

gbq_bquicsv_bq_csv

@jacobschaer
Copy link
Contributor

@markdregan : Sorry this hasn't been updated in a while. As @azbones said, we tested and confirmed that the bq client was having the same issue, and most of the heavylifting for our module is handled by their code. Google knows about it, and from what we heard today are still looking into it.

@markdregan
Copy link
Author

Thanks @jacobschaer. Let me know if I can connect you with folks on the BQ side. I work here in Google.

@jacobschaer
Copy link
Contributor

@markdregan - We've been in touch with Google for a while - in fact, we'll be dropping by the Mountain View campus next week for a visit! :-) It looks like this bug should be resolved soon per: http://stackoverflow.com/questions/20984592/bigquery-results-not-including-page-token/21009144?noredirect=1#comment32090677_21009144

@jreback
Copy link
Contributor

jreback commented Jan 26, 2014

@jacobschaer

how's this coming along?

@jacobschaer
Copy link
Contributor

Surprisingly, no changes have been pushed to the client code. Shall I put this in place of the previous bug documentation in the docstrings? Per: #6096 (comment)

@jreback
Copy link
Contributor

jreback commented Jan 27, 2014

@jacobschaer cerainly if their is a change you want to make in the docs (for this bug) or other minor changes...go ahead and do a PR.....

going to finish up 0.13.1 shortly...next few days....did you want the unicode fixes to go in? (is their a PR for this), or still on your branch?

@jreback
Copy link
Contributor

jreback commented Jan 29, 2014

@jacobschaer PR on this, or shall we move to 0.14?

@jacobschaer
Copy link
Contributor

Move to 0.14 - we might not have anything to do in the client code, but it will still need validation. No release deadline was given, but we chatted with them and it's "very soon".

@jreback
Copy link
Contributor

jreback commented Jan 29, 2014

famous last words!

ok then

@jacobschaer
Copy link
Contributor

@jreback @azbones @sean-schaefer @markdregan
Famous last words indeed. They said that it's fixed but not yet committed to their public repository. In the meantime, we are considering a rewrite for gbq v2 which will no longer be dependent on bq.py.

@azbones
Copy link

azbones commented Feb 21, 2014

@jreback @markdregan Google informed me they pushed the fix into production and I used my same test with 500K rows returned to validate the API is working correctly with our existing code. @markdregan if you could also confirm this with your dataset, that would be great!

As we understand it, this particular bug was an issue with how their API returned page tokens in larger returned datasets (100k+ rows).

Just FYI, as @jacobschaer mentioned, we are working to refactor our current code to reduce some of the dependencies on bq.py for a future release via using their API v2 which seems to be much improved (in documentation and function) from when we working on this in the summer.

@jreback
Copy link
Contributor

jreback commented Apr 9, 2014

@jacobschaer still open?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants