-
Notifications
You must be signed in to change notification settings - Fork 125
Faster dataframe construction #128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
51a7fb2
61a429b
fec6dc6
59dfb01
bb4321c
a028e1d
bb68875
acfac47
87531a0
5641fc3
729b9f3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,7 +10,6 @@ | |
|
||
from distutils.version import StrictVersion | ||
from pandas import compat, DataFrame | ||
from pandas.compat import lzip | ||
|
||
|
||
def _check_google_client_version(): | ||
|
@@ -716,19 +715,16 @@ def _parse_data(schema, rows): | |
'TIMESTAMP': 'M8[ns]'} | ||
|
||
fields = schema['fields'] | ||
col_types = [field['type'] for field in fields] | ||
col_names = [str(field['name']) for field in fields] | ||
col_dtypes = [ | ||
dtype_map.get(field['type'].upper(), object) | ||
for field in fields | ||
] | ||
page_array = np.zeros((len(rows),), dtype=lzip(col_names, col_dtypes)) | ||
for row_num, entries in enumerate(rows): | ||
for col_num in range(len(col_types)): | ||
field_value = entries[col_num] | ||
page_array[row_num][col_num] = field_value | ||
|
||
return DataFrame(page_array, columns=col_names) | ||
|
||
column_dtypes = { | ||
str(field['name']): | ||
dtype_map.get(field['type'].upper(), object) for field in fields | ||
} | ||
|
||
df = DataFrame(data=(iter(r) for r in rows), columns=column_dtypes.keys()) | ||
for column in df: | ||
df[column] = df[column].astype(column_dtypes[column]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we even need the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a good point re I think we do need this for dates; otherwise they'll be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually there's already a test: https://github.com/max-sixty/pandas-gbq/blob/no-python-loop/tests/system/test_gbq.py#L165 What's happening: pandas is ignoring setting a column with |
||
return df | ||
|
||
|
||
def read_gbq(query, project_id=None, index_col=None, col_order=None, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,4 +12,4 @@ tag_prefix = | |
parentdir_prefix = pandas_gbq- | ||
|
||
[flake8] | ||
ignore = E731 | ||
ignore = E731, I002 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if our
dtype_map
needs some updating, asgoogle-cloud-bigquery
already does some type conversions. Travis seems to think something is up with this.Sidenote (note relevant for this PR): I think this line may be important for #123. We'd want to fall back to object if the mode is
REPEATED
.