Skip to content

Faster dataframe construction #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Aug 22, 2018
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,8 @@ dist
**/wheelhouse/*
# coverage
.coverage
.testmondata
.pytest_cache

# OS generated files #
######################
Expand Down
24 changes: 10 additions & 14 deletions pandas_gbq/gbq.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@

from distutils.version import StrictVersion
from pandas import compat, DataFrame
from pandas.compat import lzip


def _check_google_client_version():
Expand Down Expand Up @@ -716,19 +715,16 @@ def _parse_data(schema, rows):
'TIMESTAMP': 'M8[ns]'}

fields = schema['fields']
col_types = [field['type'] for field in fields]
col_names = [str(field['name']) for field in fields]
col_dtypes = [
dtype_map.get(field['type'].upper(), object)
for field in fields
]
page_array = np.zeros((len(rows),), dtype=lzip(col_names, col_dtypes))
for row_num, entries in enumerate(rows):
for col_num in range(len(col_types)):
field_value = entries[col_num]
page_array[row_num][col_num] = field_value

return DataFrame(page_array, columns=col_names)

column_dtypes = {
str(field['name']):
dtype_map.get(field['type'].upper(), object) for field in fields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if our dtype_map needs some updating, as google-cloud-bigquery already does some type conversions. Travis seems to think something is up with this.

pandas_gbq/gbq.py:853: in read_gbq
    final_df[field['name']].astype(type_map[field['type'].upper()])
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/generic.py:3054: in astype
    raise_on_error=raise_on_error, **kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:3189: in astype
    return self.apply('astype', dtype=dtype, **kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:3056: in apply
    applied = getattr(b, f)(**kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:461: in astype
    values=values, **kwargs)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/core/internals.py:504: in _astype
    values = _astype_nansafe(values.ravel(), dtype, copy=True)
../../../miniconda/envs/test-environment/lib/python2.7/site-packages/pandas/types/cast.py:534: in _astype_nansafe
    return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
pandas/lib.pyx:980: in pandas.lib.astype_intsafe (pandas/lib.c:17409)
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
>   ???
E   ValueError: invalid literal for long() with base 10: '2017-12-13 17:40:39'

Sidenote (note relevant for this PR): I think this line may be important for #123. We'd want to fall back to object if the mode is REPEATED.

}

df = DataFrame(data=(iter(r) for r in rows), columns=column_dtypes.keys())
for column in df:
df[column] = df[column].astype(column_dtypes[column])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need the astype? I recall that when I dropped it last time I encountered #174. A behavior change like that would require a version bump, though, so I'm happy with keeping this a 0.6.1 release if you'd prefer to get this out as-is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point re int with nulls. I'll add a test and see what happens.

I think we do need this for dates; otherwise they'll be str

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there's already a test: https://github.com/max-sixty/pandas-gbq/blob/no-python-loop/tests/system/test_gbq.py#L165

What's happening: pandas is ignoring setting a column with NaN to int, and keeping it as object

return df


def read_gbq(query, project_id=None, index_col=None, col_order=None,
Expand Down
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ tag_prefix =
parentdir_prefix = pandas_gbq-

[flake8]
ignore = E731
ignore = E731, I002