Skip to content

Commit 4300489

Browse files
committed
Merge commit 'v0.14.0-345-g8cd3dd6' into debian
* commit 'v0.14.0-345-g8cd3dd6': (73 commits) PERF: allow slice indexers to be computed faster PERF: allow dst transition computations to be handled much faster if the end-points are ok (GH7633) Revert "Merge pull request pandas-dev#7591 from mcwitt/parse-index-cols-c" TST: fixes for 2.6 comparisons BUG: Error in rolling_var if window is larger than array, fixes pandas-dev#7297 REGR: Add back #N/A N/A as a default NA value (regresion from 0.12) (GH5521) BUG: xlim on plots with shared axes (GH2960, GH3490) BUG: Bug in Series.get with a boolean accessor (GH7407) DOC: add v0.15.0.txt template DOC: small doc build fixes DOC: v0.14.1 edits BUG: doc example in groupby.rst (GH7559 / GH7628) PERF: optimize MultiIndex.from_product for large iterables ENH: change BlockManager pickle format to work with dup items BUG: {expanding,rolling}_{cov,corr} don't handle arguments with different index sets properly CLN/DEPR: Fix instances of 'U'/'rU' in open(...) CLN: Fix typo TST: fix groupby test on windows (related GH7580) COMPAT: make numpy NaT comparison use a view to avoid implicit conversions BUG: Bug in to_timedelta that accepted invalid units and misinterpreted m/h (GH7611, GH6423) ...
2 parents 95aed53 + 8cd3dd6 commit 4300489

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

78 files changed

+3394
-1986
lines changed

ci/requirements-2.6.txt

-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ python-dateutil==1.5
44
pytz==2013b
55
http://www.crummy.com/software/BeautifulSoup/bs4/download/4.2/beautifulsoup4-4.2.0.tar.gz
66
html5lib==1.0b2
7-
bigquery==2.0.17
87
numexpr==1.4.2
98
sqlalchemy==0.7.1
109
pymysql==0.6.0

ci/requirements-2.7.txt

+3-1
Original file line numberDiff line numberDiff line change
@@ -19,5 +19,7 @@ lxml==3.2.1
1919
scipy==0.13.3
2020
beautifulsoup4==4.2.1
2121
statsmodels==0.5.0
22-
bigquery==2.0.17
2322
boto==2.26.1
23+
httplib2==0.8
24+
python-gflags==2.0
25+
google-api-python-client==1.2

ci/requirements-3.4.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ xlsxwriter
55
xlrd
66
html5lib
77
numpy==1.8.0
8-
cython==0.20.0
8+
cython==0.20.2
99
scipy==0.13.3
1010
numexpr==2.4
1111
tables==3.1.0

doc/source/cookbook.rst

+22
Original file line numberDiff line numberDiff line change
@@ -663,3 +663,25 @@ To globally provide aliases for axis names, one can define these 2 functions:
663663
df2 = DataFrame(randn(3,2),columns=['c1','c2'],index=['i1','i2','i3'])
664664
df2.sum(axis='myaxis2')
665665
clear_axis_alias(DataFrame,'columns', 'myaxis2')
666+
667+
Creating Example Data
668+
---------------------
669+
670+
To create a dataframe from every combination of some given values, like R's ``expand.grid()``
671+
function, we can create a dict where the keys are column names and the values are lists
672+
of the data values:
673+
674+
.. ipython:: python
675+
676+
import itertools
677+
678+
def expand_grid(data_dict):
679+
rows = itertools.product(*data_dict.values())
680+
return pd.DataFrame.from_records(rows, columns=data_dict.keys())
681+
682+
df = expand_grid(
683+
{'height': [60, 70],
684+
'weight': [100, 140, 180],
685+
'sex': ['Male', 'Female']}
686+
)
687+
df

doc/source/install.rst

+3-1
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,9 @@ Optional Dependencies
112112
:func:`~pandas.io.clipboard.read_clipboard`. Most package managers on Linux
113113
distributions will have xclip and/or xsel immediately available for
114114
installation.
115-
* `Google bq Command Line Tool <https://developers.google.com/bigquery/bq-command-line-tool/>`__
115+
* Google's `python-gflags` and `google-api-python-client`
116+
* Needed for :mod:`~pandas.io.gbq`
117+
* `httplib2`
116118
* Needed for :mod:`~pandas.io.gbq`
117119
* One of the following combinations of libraries is needed to use the
118120
top-level :func:`~pandas.io.html.read_html` function:

doc/source/io.rst

+78-54
Original file line numberDiff line numberDiff line change
@@ -98,8 +98,10 @@ They can take a number of arguments:
9898
data. Defaults to 0 if no ``names`` passed, otherwise ``None``. Explicitly
9999
pass ``header=0`` to be able to replace existing names. The header can be
100100
a list of integers that specify row locations for a multi-index on the columns
101-
E.g. [0,1,3]. Intervening rows that are not specified will be skipped.
102-
(E.g. 2 in this example are skipped)
101+
E.g. [0,1,3]. Intervening rows that are not specified will be
102+
skipped (e.g. 2 in this example are skipped). Note that this parameter
103+
ignores commented lines, so header=0 denotes the first line of
104+
data rather than the first line of the file.
103105
- ``skiprows``: A collection of numbers for rows in the file to skip. Can
104106
also be an integer to skip the first ``n`` rows
105107
- ``index_col``: column number, column name, or list of column numbers/names,
@@ -145,8 +147,12 @@ They can take a number of arguments:
145147
Acceptable values are 0, 1, 2, and 3 for QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONE, and QUOTE_NONNUMERIC, respectively.
146148
- ``skipinitialspace`` : boolean, default ``False``, Skip spaces after delimiter
147149
- ``escapechar`` : string, to specify how to escape quoted data
148-
- ``comment``: denotes the start of a comment and ignores the rest of the line.
149-
Currently line commenting is not supported.
150+
- ``comment``: Indicates remainder of line should not be parsed. If found at the
151+
beginning of a line, the line will be ignored altogether. This parameter
152+
must be a single character. Also, fully commented lines
153+
are ignored by the parameter `header` but not by `skiprows`. For example,
154+
if comment='#', parsing '#empty\n1,2,3\na,b,c' with `header=0` will
155+
result in '1,2,3' being treated as the header.
150156
- ``nrows``: Number of rows to read out of the file. Useful to only read a
151157
small portion of a large file
152158
- ``iterator``: If True, return a ``TextFileReader`` to enable reading a file
@@ -252,6 +258,27 @@ after a delimiter:
252258
data = 'a, b, c\n1, 2, 3\n4, 5, 6'
253259
print(data)
254260
pd.read_csv(StringIO(data), skipinitialspace=True)
261+
262+
Moreover, ``read_csv`` ignores any completely commented lines:
263+
264+
.. ipython:: python
265+
266+
data = 'a,b,c\n# commented line\n1,2,3\n#another comment\n4,5,6'
267+
print(data)
268+
pd.read_csv(StringIO(data), comment='#')
269+
270+
.. note::
271+
272+
The presence of ignored lines might create ambiguities involving line numbers;
273+
the parameter ``header`` uses row numbers (ignoring commented
274+
lines), while ``skiprows`` uses line numbers (including commented lines):
275+
276+
.. ipython:: python
277+
278+
data = '#comment\na,b,c\nA,B,C\n1,2,3'
279+
pd.read_csv(StringIO(data), comment='#', header=1)
280+
data = 'A,B,C\n#comment\na,b,c\n1,2,3'
281+
pd.read_csv(StringIO(data), comment='#', skiprows=2)
255282
256283
The parsers make every attempt to "do the right thing" and not be very
257284
fragile. Type inference is a pretty big deal. So if a column can be coerced to
@@ -3373,83 +3400,80 @@ Google BigQuery (Experimental)
33733400
The :mod:`pandas.io.gbq` module provides a wrapper for Google's BigQuery
33743401
analytics web service to simplify retrieving results from BigQuery tables
33753402
using SQL-like queries. Result sets are parsed into a pandas
3376-
DataFrame with a shape derived from the source table. Additionally,
3377-
DataFrames can be uploaded into BigQuery datasets as tables
3378-
if the source datatypes are compatible with BigQuery ones.
3403+
DataFrame with a shape and data types derived from the source table.
3404+
Additionally, DataFrames can be appended to existing BigQuery tables if
3405+
the destination table is the same shape as the DataFrame.
33793406

33803407
For specifics on the service itself, see `here <https://developers.google.com/bigquery/>`__
33813408

3382-
As an example, suppose you want to load all data from an existing table
3383-
: `test_dataset.test_table`
3384-
into BigQuery and pull it into a DataFrame.
3409+
As an example, suppose you want to load all data from an existing BigQuery
3410+
table : `test_dataset.test_table` into a DataFrame using the :func:`~pandas.io.read_gbq`
3411+
function.
33853412

33863413
.. code-block:: python
33873414
3388-
from pandas.io import gbq
3389-
33903415
# Insert your BigQuery Project ID Here
3391-
# Can be found in the web console, or
3392-
# using the command line tool `bq ls`
3416+
# Can be found in the Google web console
33933417
projectid = "xxxxxxxx"
33943418
3395-
data_frame = gbq.read_gbq('SELECT * FROM test_dataset.test_table', project_id = projectid)
3419+
data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table', project_id = projectid)
33963420
3397-
The user will then be authenticated by the `bq` command line client -
3398-
this usually involves the default browser opening to a login page,
3399-
though the process can be done entirely from command line if necessary.
3400-
Datasets and additional parameters can be either configured with `bq`,
3401-
passed in as options to `read_gbq`, or set using Google's gflags (this
3402-
is not officially supported by this module, though care was taken
3403-
to ensure that they should be followed regardless of how you call the
3404-
method).
3421+
You will then be authenticated to the specified BigQuery account
3422+
via Google's Oauth2 mechanism. In general, this is as simple as following the
3423+
prompts in a browser window which will be opened for you. Should the browser not
3424+
be available, or fail to launch, a code will be provided to complete the process
3425+
manually. Additional information on the authentication mechanism can be found
3426+
`here <https://developers.google.com/accounts/docs/OAuth2#clientside/>`__
34053427

3406-
Additionally, you can define which column to use as an index as well as a preferred column order as follows:
3428+
You can define which column from BigQuery to use as an index in the
3429+
destination DataFrame as well as a preferred column order as follows:
34073430

34083431
.. code-block:: python
34093432
3410-
data_frame = gbq.read_gbq('SELECT * FROM test_dataset.test_table',
3433+
data_frame = pd.read_gbq('SELECT * FROM test_dataset.test_table',
34113434
index_col='index_column_name',
3412-
col_order='[col1, col2, col3,...]', project_id = projectid)
3413-
3414-
Finally, if you would like to create a BigQuery table, `my_dataset.my_table`, from the rows of DataFrame, `df`:
3435+
col_order=['col1', 'col2', 'col3'], project_id = projectid)
3436+
3437+
Finally, you can append data to a BigQuery table from a pandas DataFrame
3438+
using the :func:`~pandas.io.to_gbq` function. This function uses the
3439+
Google streaming API which requires that your destination table exists in
3440+
BigQuery. Given the BigQuery table already exists, your DataFrame should
3441+
match the destination table in column order, structure, and data types.
3442+
DataFrame indexes are not supported. By default, rows are streamed to
3443+
BigQuery in chunks of 10,000 rows, but you can pass other chuck values
3444+
via the ``chunksize`` argument. You can also see the progess of your
3445+
post via the ``verbose`` flag which defaults to ``True``. The http
3446+
response code of Google BigQuery can be successful (200) even if the
3447+
append failed. For this reason, if there is a failure to append to the
3448+
table, the complete error response from BigQuery is returned which
3449+
can be quite long given it provides a status for each row. You may want
3450+
to start with smaller chuncks to test that the size and types of your
3451+
dataframe match your destination table to make debugging simpler.
34153452

34163453
.. code-block:: python
34173454
34183455
df = pandas.DataFrame({'string_col_name' : ['hello'],
34193456
'integer_col_name' : [1],
34203457
'boolean_col_name' : [True]})
3421-
schema = ['STRING', 'INTEGER', 'BOOLEAN']
3422-
data_frame = gbq.to_gbq(df, 'my_dataset.my_table',
3423-
if_exists='fail', schema = schema, project_id = projectid)
3458+
df.to_gbq('my_dataset.my_table', project_id = projectid)
34243459
3425-
To add more rows to this, simply:
3460+
The BigQuery SQL query language has some oddities, see `here <https://developers.google.com/bigquery/query-reference>`__
34263461

3427-
.. code-block:: python
3462+
While BigQuery uses SQL-like syntax, it has some important differences
3463+
from traditional databases both in functionality, API limitations (size and
3464+
qunatity of queries or uploads), and how Google charges for use of the service.
3465+
You should refer to Google documentation often as the service seems to
3466+
be changing and evolving. BiqQuery is best for analyzing large sets of
3467+
data quickly, but it is not a direct replacement for a transactional database.
34283468

3429-
df2 = pandas.DataFrame({'string_col_name' : ['hello2'],
3430-
'integer_col_name' : [2],
3431-
'boolean_col_name' : [False]})
3432-
data_frame = gbq.to_gbq(df2, 'my_dataset.my_table', if_exists='append', project_id = projectid)
3433-
3434-
.. note::
3435-
3436-
A default project id can be set using the command line:
3437-
`bq init`.
3438-
3439-
There is a hard cap on BigQuery result sets, at 128MB compressed. Also, the BigQuery SQL query language has some oddities,
3440-
see `here <https://developers.google.com/bigquery/query-reference>`__
3441-
3442-
You can access the management console to determine project id's by:
3443-
<https://code.google.com/apis/console/b/0/?noredirect>
3469+
You can access the management console to determine project id's by:
3470+
<https://code.google.com/apis/console/b/0/?noredirect>
34443471

34453472
.. warning::
34463473

3447-
To use this module, you will need a BigQuery account. See
3448-
<https://cloud.google.com/products/big-query> for details.
3449-
3450-
As of 1/28/14, a known bug is present that could possibly cause data duplication in the resultant dataframe. A fix is imminent,
3451-
but any client changes will not make it into 0.13.1. See:
3452-
http://stackoverflow.com/questions/20984592/bigquery-results-not-including-page-token/21009144?noredirect=1#comment32090677_21009144
3474+
To use this module, you will need a valid BigQuery account. See
3475+
<https://cloud.google.com/products/big-query> for details on the
3476+
service.
34533477

34543478
.. _io.stata:
34553479

doc/source/remote_data.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ Yahoo! Finance
5252
f=web.DataReader("F", 'yahoo', start, end)
5353
f.ix['2010-01-04']
5454
55-
.. _remote_data.yahoo_Options:
55+
.. _remote_data.yahoo_options:
5656

5757
Yahoo! Finance Options
5858
----------------------

doc/source/timeseries.rst

+4-3
Original file line numberDiff line numberDiff line change
@@ -1280,9 +1280,10 @@ To supply the time zone, you can use the ``tz`` keyword to ``date_range`` and
12801280
other functions. Dateutil time zone strings are distinguished from ``pytz``
12811281
time zones by starting with ``dateutil/``.
12821282

1283-
- In ``pytz`` you can find a list of common (and less common) time zones using ``from pytz import common_timezones, all_timezones``.
1283+
- In ``pytz`` you can find a list of common (and less common) time zones using
1284+
``from pytz import common_timezones, all_timezones``.
12841285
- ``dateutil`` uses the OS timezones so there isn't a fixed list available. For
1285-
common zones, the names are the same as ``pytz``.
1286+
common zones, the names are the same as ``pytz``.
12861287

12871288
.. ipython:: python
12881289
@@ -1448,7 +1449,7 @@ Elements can be set to ``NaT`` using ``np.nan`` analagously to datetimes
14481449
y[1] = np.nan
14491450
y
14501451
1451-
Operands can also appear in a reversed order (a singluar object operated with a Series)
1452+
Operands can also appear in a reversed order (a singular object operated with a Series)
14521453

14531454
.. ipython:: python
14541455

0 commit comments

Comments
 (0)