Skip to content

Commit 70d7256

Browse files
authored
Merge branch 'master' into fix-melt
2 parents 455a310 + c23b1a4 commit 70d7256

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+5682
-10579
lines changed

ci/requirements-2.7.pip

-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
statsmodels
21
blosc
32
httplib2
43
google-api-python-client==1.2

ci/requirements-2.7_COMPAT.run

-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ pytz=2013b
44
scipy=0.11.0
55
xlwt=0.7.5
66
xlrd=0.9.2
7-
statsmodels=0.4.3
87
bottleneck=0.8.0
98
numexpr=2.2.2
109
pytables=3.0.0

ci/requirements-2.7_LOCALE.run

-1
Original file line numberDiff line numberDiff line change
@@ -13,5 +13,4 @@ html5lib=1.0b2
1313
lxml=3.2.1
1414
scipy=0.11.0
1515
beautiful-soup=4.2.1
16-
statsmodels=0.4.3
1716
bigquery=2.0.17

ci/requirements-2.7_SLOW.run

-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ numpy=1.8.2
44
matplotlib=1.3.1
55
scipy
66
patsy
7-
statsmodels
87
xlwt
98
openpyxl
109
xlsxwriter

ci/requirements-3.4_SLOW.run

-1
Original file line numberDiff line numberDiff line change
@@ -17,5 +17,4 @@ sqlalchemy
1717
bottleneck
1818
pymysql
1919
psycopg2
20-
statsmodels
2120
jinja2=2.8

doc/source/install.rst

+5-8
Original file line numberDiff line numberDiff line change
@@ -250,9 +250,9 @@ Optional Dependencies
250250
* `Feather Format <https://github.com/wesm/feather>`__: necessary for feather-based storage, version 0.3.1 or higher.
251251
* `SQLAlchemy <http://www.sqlalchemy.org>`__: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy, you also need a database specific driver. You can find an overview of supported drivers for each SQL dialect in the `SQLAlchemy docs <http://docs.sqlalchemy.org/en/latest/dialects/index.html>`__. Some common drivers are:
252252

253-
- `psycopg2 <http://initd.org/psycopg/>`__: for PostgreSQL
254-
- `pymysql <https://github.com/PyMySQL/PyMySQL>`__: for MySQL.
255-
- `SQLite <https://docs.python.org/3.5/library/sqlite3.html>`__: for SQLite, this is included in Python's standard library by default.
253+
* `psycopg2 <http://initd.org/psycopg/>`__: for PostgreSQL
254+
* `pymysql <https://github.com/PyMySQL/PyMySQL>`__: for MySQL.
255+
* `SQLite <https://docs.python.org/3.5/library/sqlite3.html>`__: for SQLite, this is included in Python's standard library by default.
256256

257257
* `matplotlib <http://matplotlib.org/>`__: for plotting
258258
* For Excel I/O:
@@ -272,11 +272,8 @@ Optional Dependencies
272272
<http://www.vergenet.net/~conrad/software/xsel/>`__, or `xclip
273273
<https://github.com/astrand/xclip/>`__: necessary to use
274274
:func:`~pandas.read_clipboard`. Most package managers on Linux distributions will have ``xclip`` and/or ``xsel`` immediately available for installation.
275-
* Google's `python-gflags <<https://github.com/google/python-gflags/>`__ ,
276-
`oauth2client <https://github.com/google/oauth2client>`__ ,
277-
`httplib2 <http://pypi.python.org/pypi/httplib2>`__
278-
and `google-api-python-client <http://github.com/google/google-api-python-client>`__
279-
: Needed for :mod:`~pandas.io.gbq`
275+
* For Google BigQuery I/O - see :ref:`here <io.bigquery_deps>`.
276+
280277
* `Backports.lzma <https://pypi.python.org/pypi/backports.lzma/>`__: Only for Python 2, for writing to and/or reading from an xz compressed DataFrame in CSV; Python 3 support is built into the standard library.
281278
* One of the following combinations of libraries is needed to use the
282279
top-level :func:`~pandas.read_html` function:

doc/source/io.rst

+47-14
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ object.
3939
* :ref:`read_json<io.json_reader>`
4040
* :ref:`read_msgpack<io.msgpack>`
4141
* :ref:`read_html<io.read_html>`
42-
* :ref:`read_gbq<io.bigquery_reader>`
42+
* :ref:`read_gbq<io.bigquery>`
4343
* :ref:`read_stata<io.stata_reader>`
4444
* :ref:`read_sas<io.sas_reader>`
4545
* :ref:`read_clipboard<io.clipboard>`
@@ -55,7 +55,7 @@ The corresponding ``writer`` functions are object methods that are accessed like
5555
* :ref:`to_json<io.json_writer>`
5656
* :ref:`to_msgpack<io.msgpack>`
5757
* :ref:`to_html<io.html>`
58-
* :ref:`to_gbq<io.bigquery_writer>`
58+
* :ref:`to_gbq<io.bigquery>`
5959
* :ref:`to_stata<io.stata_writer>`
6060
* :ref:`to_clipboard<io.clipboard>`
6161
* :ref:`to_pickle<io.pickle>`
@@ -4648,16 +4648,11 @@ DataFrame with a shape and data types derived from the source table.
46484648
Additionally, DataFrames can be inserted into new BigQuery tables or appended
46494649
to existing tables.
46504650

4651-
You will need to install some additional dependencies:
4652-
4653-
- Google's `python-gflags <https://github.com/google/python-gflags/>`__
4654-
- `httplib2 <http://pypi.python.org/pypi/httplib2>`__
4655-
- `google-api-python-client <http://github.com/google/google-api-python-client>`__
4656-
46574651
.. warning::
46584652

46594653
To use this module, you will need a valid BigQuery account. Refer to the
4660-
`BigQuery Documentation <https://cloud.google.com/bigquery/what-is-bigquery>`__ for details on the service itself.
4654+
`BigQuery Documentation <https://cloud.google.com/bigquery/what-is-bigquery>`__
4655+
for details on the service itself.
46614656

46624657
The key functions are:
46634658

@@ -4671,7 +4666,44 @@ The key functions are:
46714666

46724667
.. currentmodule:: pandas
46734668

4674-
.. _io.bigquery_reader:
4669+
4670+
Supported Data Types
4671+
++++++++++++++++++++
4672+
4673+
Pandas supports all these `BigQuery data types <https://cloud.google.com/bigquery/data-types>`__:
4674+
``STRING``, ``INTEGER`` (64bit), ``FLOAT`` (64 bit), ``BOOLEAN`` and
4675+
``TIMESTAMP`` (microsecond precision). Data types ``BYTES`` and ``RECORD``
4676+
are not supported.
4677+
4678+
Integer and boolean ``NA`` handling
4679+
+++++++++++++++++++++++++++++++++++
4680+
4681+
.. versionadded:: 0.20
4682+
4683+
Since all columns in BigQuery queries are nullable, and NumPy lacks of ``NA``
4684+
support for integer and boolean types, this module will store ``INTEGER`` or
4685+
``BOOLEAN`` columns with at least one ``NULL`` value as ``dtype=object``.
4686+
Otherwise those columns will be stored as ``dtype=int64`` or ``dtype=bool``
4687+
respectively.
4688+
4689+
This is opposite to default pandas behaviour which will promote integer
4690+
type to float in order to store NAs. See the :ref:`gotchas<gotchas.intna>`
4691+
for detailed explaination.
4692+
4693+
While this trade-off works well for most cases, it breaks down for storing
4694+
values greater than 2**53. Such values in BigQuery can represent identifiers
4695+
and unnoticed precision lost for identifier is what we want to avoid.
4696+
4697+
.. _io.bigquery_deps:
4698+
4699+
Dependencies
4700+
++++++++++++
4701+
4702+
This module requires following additional dependencies:
4703+
4704+
- `httplib2 <https://github.com/httplib2/httplib2>`__: HTTP client
4705+
- `google-api-python-client <http://github.com/google/google-api-python-client>`__: Google's API client
4706+
- `oauth2client <https://github.com/google/oauth2client>`__: authentication and authorization for Google's API
46754707

46764708
.. _io.bigquery_authentication:
46774709

@@ -4686,7 +4718,7 @@ Is possible to authenticate with either user account credentials or service acco
46864718
Authenticating with user account credentials is as simple as following the prompts in a browser window
46874719
which will be automatically opened for you. You will be authenticated to the specified
46884720
``BigQuery`` account using the product name ``pandas GBQ``. It is only possible on local host.
4689-
The remote authentication using user account credentials is not currently supported in Pandas.
4721+
The remote authentication using user account credentials is not currently supported in pandas.
46904722
Additional information on the authentication mechanism can be found
46914723
`here <https://developers.google.com/identity/protocols/OAuth2#clientside/>`__.
46924724

@@ -4695,8 +4727,6 @@ is particularly useful when working on remote servers (eg. jupyter iPython noteb
46954727
Additional information on service accounts can be found
46964728
`here <https://developers.google.com/identity/protocols/OAuth2#serviceaccount>`__.
46974729

4698-
You will need to install an additional dependency: `oauth2client <https://github.com/google/oauth2client>`__.
4699-
47004730
Authentication via ``application default credentials`` is also possible. This is only valid
47014731
if the parameter ``private_key`` is not provided. This method also requires that
47024732
the credentials can be fetched from the environment the code is running in.
@@ -4716,6 +4746,7 @@ Additional information on
47164746
A private key can be obtained from the Google developers console by clicking
47174747
`here <https://console.developers.google.com/permissions/serviceaccounts>`__. Use JSON key type.
47184748

4749+
.. _io.bigquery_reader:
47194750

47204751
Querying
47214752
''''''''
@@ -4775,7 +4806,6 @@ For more information about query configuration parameters see
47754806

47764807
.. _io.bigquery_writer:
47774808

4778-
47794809
Writing DataFrames
47804810
''''''''''''''''''
47814811

@@ -4865,6 +4895,8 @@ For example:
48654895
often as the service seems to be changing and evolving. BiqQuery is best for analyzing large
48664896
sets of data quickly, but it is not a direct replacement for a transactional database.
48674897

4898+
.. _io.bigquery_create_tables:
4899+
48684900
Creating BigQuery Tables
48694901
''''''''''''''''''''''''
48704902

@@ -4894,6 +4926,7 @@ produce the dictionary representation schema of the specified pandas DataFrame.
48944926
the new table with a different name. Refer to
48954927
`Google BigQuery issue 191 <https://code.google.com/p/google-bigquery/issues/detail?id=191>`__.
48964928

4929+
48974930
.. _io.stata:
48984931

48994932
Stata Format

doc/source/whatsnew/v0.20.0.txt

+6-2
Original file line numberDiff line numberDiff line change
@@ -369,7 +369,9 @@ Other API Changes
369369
- ``pd.read_csv()`` will now raise a ``ValueError`` for the C engine if the quote character is larger than than one byte (:issue:`11592`)
370370
- ``inplace`` arguments now require a boolean value, else a ``ValueError`` is thrown (:issue:`14189`)
371371
- ``pandas.api.types.is_datetime64_ns_dtype`` will now report ``True`` on a tz-aware dtype, similar to ``pandas.api.types.is_datetime64_any_dtype``
372-
- ``DataFrame.asof()`` will return a null filled ``Series`` instead the scalar ``NaN`` if a match is not found (:issue:`15118`)
372+
- ``DataFrame.asof()`` will return a null filled ``Series`` instead the scalar ``NaN`` if a match is not found (:issue:`15118`)
373+
- The :func:`pd.read_gbq` method now stores ``INTEGER`` columns as ``dtype=object`` if they contain ``NULL`` values. Otherwise they are stored as ``int64``. This prevents precision lost for integers greather than 2**53. Furthermore ``FLOAT`` columns with values above 10**4 are no more casted to ``int64`` which also caused precision lost (:issue: `14064`, :issue:`14305`).
374+
373375
.. _whatsnew_0200.deprecations:
374376

375377
Deprecations
@@ -396,7 +398,7 @@ Removal of prior version deprecations/changes
396398
- The ``pandas.io.ga`` module with a ``google-analytics`` interface is removed (:issue:`11308`).
397399
Similar functionality can be found in the `Google2Pandas <https://github.com/panalysis/Google2Pandas>`__ package.
398400
- ``pd.to_datetime`` and ``pd.to_timedelta`` have dropped the ``coerce`` parameter in favor of ``errors`` (:issue:`13602`)
399-
401+
- ``pandas.stats.fama_macbeth``, ``pandas.stats.ols``, ``pandas.stats.plm`` and ``pandas.stats.var``, as well as the top-level ``pandas.fama_macbeth`` and ``pandas.ols`` routines are removed. Similar functionaility can be found in the `statsmodels <shttp://www.statsmodels.org/dev/>`__ package. (:issue:`11898`)
400402

401403

402404

@@ -439,6 +441,7 @@ Bug Fixes
439441

440442
- Bug in ``DataFrame.loc`` with indexing a ``MultiIndex`` with a ``Series`` indexer (:issue:`14730`)
441443

444+
442445
- Bug in ``pd.read_msgpack()`` in which ``Series`` categoricals were being improperly processed (:issue:`14901`)
443446
- Bug in ``Series.ffill()`` with mixed dtypes containing tz-aware datetimes. (:issue:`14956`)
444447

@@ -518,3 +521,4 @@ Bug Fixes
518521
- Bug in ``DataFrame.boxplot`` where ``fontsize`` was not applied to the tick labels on both axes (:issue:`15108`)
519522
- Bug in ``Series.replace`` and ``DataFrame.replace`` which failed on empty replacement dicts (:issue:`15289`)
520523
- Bug in ``pd.melt`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`)
524+
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)

pandas/api/tests/test_api.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ class TestPDApi(Base, tm.TestCase):
4242
'json', 'lib', 'index', 'parser']
4343

4444
# these are already deprecated; awaiting removal
45-
deprecated_modules = ['ols', 'stats', 'datetools']
45+
deprecated_modules = ['stats', 'datetools']
4646

4747
# misc
4848
misc = ['IndexSlice', 'NaT']
@@ -109,7 +109,7 @@ class TestPDApi(Base, tm.TestCase):
109109
'expanding_max', 'expanding_mean', 'expanding_median',
110110
'expanding_min', 'expanding_quantile',
111111
'expanding_skew', 'expanding_std', 'expanding_sum',
112-
'expanding_var', 'fama_macbeth', 'rolling_apply',
112+
'expanding_var', 'rolling_apply',
113113
'rolling_corr', 'rolling_count', 'rolling_cov',
114114
'rolling_kurt', 'rolling_max', 'rolling_mean',
115115
'rolling_median', 'rolling_min', 'rolling_quantile',

pandas/computation/eval.py

+2-3
Original file line numberDiff line numberDiff line change
@@ -236,7 +236,7 @@ def eval(expr, parser='pandas', engine=None, truediv=True,
236236
first_expr = True
237237
if isinstance(expr, string_types):
238238
_check_expression(expr)
239-
exprs = [e for e in expr.splitlines() if e != '']
239+
exprs = [e.strip() for e in expr.splitlines() if e.strip() != '']
240240
else:
241241
exprs = [expr]
242242
multi_line = len(exprs) > 1
@@ -254,8 +254,7 @@ def eval(expr, parser='pandas', engine=None, truediv=True,
254254
_check_for_locals(expr, level, parser)
255255

256256
# get our (possibly passed-in) scope
257-
level += 1
258-
env = _ensure_scope(level, global_dict=global_dict,
257+
env = _ensure_scope(level + 1, global_dict=global_dict,
259258
local_dict=local_dict, resolvers=resolvers,
260259
target=target)
261260

pandas/computation/tests/test_eval.py

+15-4
Original file line numberDiff line numberDiff line change
@@ -1274,7 +1274,6 @@ def test_assignment_fails(self):
12741274
local_dict={'df': df, 'df2': df2})
12751275

12761276
def test_assignment_column(self):
1277-
tm.skip_if_no_ne('numexpr')
12781277
df = DataFrame(np.random.randn(5, 2), columns=list('ab'))
12791278
orig_df = df.copy()
12801279

@@ -1346,7 +1345,6 @@ def test_column_in(self):
13461345

13471346
def assignment_not_inplace(self):
13481347
# GH 9297
1349-
tm.skip_if_no_ne('numexpr')
13501348
df = DataFrame(np.random.randn(5, 2), columns=list('ab'))
13511349

13521350
actual = df.eval('c = a + b', inplace=False)
@@ -1365,7 +1363,6 @@ def assignment_not_inplace(self):
13651363

13661364
def test_multi_line_expression(self):
13671365
# GH 11149
1368-
tm.skip_if_no_ne('numexpr')
13691366
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
13701367
expected = df.copy()
13711368

@@ -1393,7 +1390,6 @@ def test_multi_line_expression(self):
13931390

13941391
def test_multi_line_expression_not_inplace(self):
13951392
# GH 11149
1396-
tm.skip_if_no_ne('numexpr')
13971393
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
13981394
expected = df.copy()
13991395

@@ -1411,6 +1407,21 @@ def test_multi_line_expression_not_inplace(self):
14111407
e = a + 2""", inplace=False)
14121408
assert_frame_equal(expected, df)
14131409

1410+
def test_multi_line_expression_local_variable(self):
1411+
# GH 15342
1412+
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
1413+
expected = df.copy()
1414+
1415+
local_var = 7
1416+
expected['c'] = expected['a'] * local_var
1417+
expected['d'] = expected['c'] + local_var
1418+
ans = df.eval("""
1419+
c = a * @local_var
1420+
d = c + @local_var
1421+
""", inplace=True)
1422+
assert_frame_equal(expected, df)
1423+
self.assertIsNone(ans)
1424+
14141425
def test_assignment_in_query(self):
14151426
# GH 8664
14161427
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

pandas/io/gbq.py

+13-11
Original file line numberDiff line numberDiff line change
@@ -603,18 +603,14 @@ def _parse_data(schema, rows):
603603
# see:
604604
# http://pandas.pydata.org/pandas-docs/dev/missing_data.html
605605
# #missing-data-casting-rules-and-indexing
606-
dtype_map = {'INTEGER': np.dtype(float),
607-
'FLOAT': np.dtype(float),
608-
# This seems to be buggy without nanosecond indicator
606+
dtype_map = {'FLOAT': np.dtype(float),
609607
'TIMESTAMP': 'M8[ns]'}
610608

611609
fields = schema['fields']
612610
col_types = [field['type'] for field in fields]
613611
col_names = [str(field['name']) for field in fields]
614612
col_dtypes = [dtype_map.get(field['type'], object) for field in fields]
615-
page_array = np.zeros((len(rows),),
616-
dtype=lzip(col_names, col_dtypes))
617-
613+
page_array = np.zeros((len(rows),), dtype=lzip(col_names, col_dtypes))
618614
for row_num, raw_row in enumerate(rows):
619615
entries = raw_row.get('f', [])
620616
for col_num, field_type in enumerate(col_types):
@@ -628,7 +624,9 @@ def _parse_data(schema, rows):
628624
def _parse_entry(field_value, field_type):
629625
if field_value is None or field_value == 'null':
630626
return None
631-
if field_type == 'INTEGER' or field_type == 'FLOAT':
627+
if field_type == 'INTEGER':
628+
return int(field_value)
629+
elif field_type == 'FLOAT':
632630
return float(field_value)
633631
elif field_type == 'TIMESTAMP':
634632
timestamp = datetime.utcfromtimestamp(float(field_value))
@@ -757,10 +755,14 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None,
757755
'Column order does not match this DataFrame.'
758756
)
759757

760-
# Downcast floats to integers and objects to booleans
761-
# if there are no NaN's. This is presently due to a
762-
# limitation of numpy in handling missing data.
763-
final_df._data = final_df._data.downcast(dtypes='infer')
758+
# cast BOOLEAN and INTEGER columns from object to bool/int
759+
# if they dont have any nulls
760+
type_map = {'BOOLEAN': bool, 'INTEGER': int}
761+
for field in schema['fields']:
762+
if field['type'] in type_map and \
763+
final_df[field['name']].notnull().all():
764+
final_df[field['name']] = \
765+
final_df[field['name']].astype(type_map[field['type']])
764766

765767
connector.print_elapsed_seconds(
766768
'Total time taken',

0 commit comments

Comments
 (0)