@@ -98,8 +98,10 @@ They can take a number of arguments:
98
98
data. Defaults to 0 if no ``names `` passed, otherwise ``None ``. Explicitly
99
99
pass ``header=0 `` to be able to replace existing names. The header can be
100
100
a list of integers that specify row locations for a multi-index on the columns
101
- E.g. [0,1,3]. Intervening rows that are not specified will be skipped.
102
- (E.g. 2 in this example are skipped)
101
+ E.g. [0,1,3]. Intervening rows that are not specified will be
102
+ skipped (e.g. 2 in this example are skipped). Note that this parameter
103
+ ignores commented lines, so header=0 denotes the first line of
104
+ data rather than the first line of the file.
103
105
- ``skiprows ``: A collection of numbers for rows in the file to skip. Can
104
106
also be an integer to skip the first ``n `` rows
105
107
- ``index_col ``: column number, column name, or list of column numbers/names,
@@ -145,8 +147,12 @@ They can take a number of arguments:
145
147
Acceptable values are 0, 1, 2, and 3 for QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONE, and QUOTE_NONNUMERIC, respectively.
146
148
- ``skipinitialspace `` : boolean, default ``False ``, Skip spaces after delimiter
147
149
- ``escapechar `` : string, to specify how to escape quoted data
148
- - ``comment ``: denotes the start of a comment and ignores the rest of the line.
149
- Currently line commenting is not supported.
150
+ - ``comment ``: Indicates remainder of line should not be parsed. If found at the
151
+ beginning of a line, the line will be ignored altogether. This parameter
152
+ must be a single character. Also, fully commented lines
153
+ are ignored by the parameter `header ` but not by `skiprows `. For example,
154
+ if comment='#', parsing '#empty\n 1,2,3\n a,b,c' with `header=0 ` will
155
+ result in '1,2,3' being treated as the header.
150
156
- ``nrows ``: Number of rows to read out of the file. Useful to only read a
151
157
small portion of a large file
152
158
- ``iterator ``: If True, return a ``TextFileReader `` to enable reading a file
@@ -252,6 +258,27 @@ after a delimiter:
252
258
data = ' a, b, c\n 1, 2, 3\n 4, 5, 6'
253
259
print (data)
254
260
pd.read_csv(StringIO(data), skipinitialspace = True )
261
+
262
+ Moreover, ``read_csv `` ignores any completely commented lines:
263
+
264
+ .. ipython :: python
265
+
266
+ data = ' a,b,c\n # commented line\n 1,2,3\n #another comment\n 4,5,6'
267
+ print (data)
268
+ pd.read_csv(StringIO(data), comment = ' #' )
269
+
270
+ .. note ::
271
+
272
+ The presence of ignored lines might create ambiguities involving line numbers;
273
+ the parameter ``header `` uses row numbers (ignoring commented
274
+ lines), while ``skiprows `` uses line numbers (including commented lines):
275
+
276
+ .. ipython :: python
277
+
278
+ data = ' #comment\n a,b,c\n A,B,C\n 1,2,3'
279
+ pd.read_csv(StringIO(data), comment = ' #' , header = 1 )
280
+ data = ' A,B,C\n #comment\n a,b,c\n 1,2,3'
281
+ pd.read_csv(StringIO(data), comment = ' #' , skiprows = 2 )
255
282
256
283
The parsers make every attempt to "do the right thing" and not be very
257
284
fragile. Type inference is a pretty big deal. So if a column can be coerced to
@@ -3373,83 +3400,80 @@ Google BigQuery (Experimental)
3373
3400
The :mod: `pandas.io.gbq ` module provides a wrapper for Google's BigQuery
3374
3401
analytics web service to simplify retrieving results from BigQuery tables
3375
3402
using SQL-like queries. Result sets are parsed into a pandas
3376
- DataFrame with a shape derived from the source table. Additionally,
3377
- DataFrames can be uploaded into BigQuery datasets as tables
3378
- if the source datatypes are compatible with BigQuery ones .
3403
+ DataFrame with a shape and data types derived from the source table.
3404
+ Additionally, DataFrames can be appended to existing BigQuery tables if
3405
+ the destination table is the same shape as the DataFrame .
3379
3406
3380
3407
For specifics on the service itself, see `here <https://developers.google.com/bigquery/ >`__
3381
3408
3382
- As an example, suppose you want to load all data from an existing table
3383
- : `test_dataset.test_table `
3384
- into BigQuery and pull it into a DataFrame .
3409
+ As an example, suppose you want to load all data from an existing BigQuery
3410
+ table : `test_dataset.test_table ` into a DataFrame using the :func: ` ~pandas.io.read_gbq `
3411
+ function .
3385
3412
3386
3413
.. code-block :: python
3387
3414
3388
- from pandas.io import gbq
3389
-
3390
3415
# Insert your BigQuery Project ID Here
3391
- # Can be found in the web console, or
3392
- # using the command line tool `bq ls`
3416
+ # Can be found in the Google web console
3393
3417
projectid = " xxxxxxxx"
3394
3418
3395
- data_frame = gbq .read_gbq(' SELECT * FROM test_dataset.test_table' , project_id = projectid)
3419
+ data_frame = pd .read_gbq(' SELECT * FROM test_dataset.test_table' , project_id = projectid)
3396
3420
3397
- The user will then be authenticated by the `bq ` command line client -
3398
- this usually involves the default browser opening to a login page,
3399
- though the process can be done entirely from command line if necessary.
3400
- Datasets and additional parameters can be either configured with `bq `,
3401
- passed in as options to `read_gbq `, or set using Google's gflags (this
3402
- is not officially supported by this module, though care was taken
3403
- to ensure that they should be followed regardless of how you call the
3404
- method).
3421
+ You will then be authenticated to the specified BigQuery account
3422
+ via Google's Oauth2 mechanism. In general, this is as simple as following the
3423
+ prompts in a browser window which will be opened for you. Should the browser not
3424
+ be available, or fail to launch, a code will be provided to complete the process
3425
+ manually. Additional information on the authentication mechanism can be found
3426
+ `here <https://developers.google.com/accounts/docs/OAuth2#clientside/ >`__
3405
3427
3406
- Additionally, you can define which column to use as an index as well as a preferred column order as follows:
3428
+ You can define which column from BigQuery to use as an index in the
3429
+ destination DataFrame as well as a preferred column order as follows:
3407
3430
3408
3431
.. code-block :: python
3409
3432
3410
- data_frame = gbq .read_gbq(' SELECT * FROM test_dataset.test_table' ,
3433
+ data_frame = pd .read_gbq(' SELECT * FROM test_dataset.test_table' ,
3411
3434
index_col = ' index_column_name' ,
3412
- col_order = ' [col1, col2, col3,...]' , project_id = projectid)
3413
-
3414
- Finally, if you would like to create a BigQuery table, `my_dataset.my_table `, from the rows of DataFrame, `df `:
3435
+ col_order = [' col1' , ' col2' , ' col3' ], project_id = projectid)
3436
+
3437
+ Finally, you can append data to a BigQuery table from a pandas DataFrame
3438
+ using the :func: `~pandas.io.to_gbq ` function. This function uses the
3439
+ Google streaming API which requires that your destination table exists in
3440
+ BigQuery. Given the BigQuery table already exists, your DataFrame should
3441
+ match the destination table in column order, structure, and data types.
3442
+ DataFrame indexes are not supported. By default, rows are streamed to
3443
+ BigQuery in chunks of 10,000 rows, but you can pass other chuck values
3444
+ via the ``chunksize `` argument. You can also see the progess of your
3445
+ post via the ``verbose `` flag which defaults to ``True ``. The http
3446
+ response code of Google BigQuery can be successful (200) even if the
3447
+ append failed. For this reason, if there is a failure to append to the
3448
+ table, the complete error response from BigQuery is returned which
3449
+ can be quite long given it provides a status for each row. You may want
3450
+ to start with smaller chuncks to test that the size and types of your
3451
+ dataframe match your destination table to make debugging simpler.
3415
3452
3416
3453
.. code-block :: python
3417
3454
3418
3455
df = pandas.DataFrame({' string_col_name' : [' hello' ],
3419
3456
' integer_col_name' : [1 ],
3420
3457
' boolean_col_name' : [True ]})
3421
- schema = [' STRING' , ' INTEGER' , ' BOOLEAN' ]
3422
- data_frame = gbq.to_gbq(df, ' my_dataset.my_table' ,
3423
- if_exists = ' fail' , schema = schema, project_id = projectid)
3458
+ df.to_gbq(' my_dataset.my_table' , project_id = projectid)
3424
3459
3425
- To add more rows to this, simply:
3460
+ The BigQuery SQL query language has some oddities, see ` here < https://developers.google.com/bigquery/query-reference >`__
3426
3461
3427
- .. code-block :: python
3462
+ While BigQuery uses SQL-like syntax, it has some important differences
3463
+ from traditional databases both in functionality, API limitations (size and
3464
+ qunatity of queries or uploads), and how Google charges for use of the service.
3465
+ You should refer to Google documentation often as the service seems to
3466
+ be changing and evolving. BiqQuery is best for analyzing large sets of
3467
+ data quickly, but it is not a direct replacement for a transactional database.
3428
3468
3429
- df2 = pandas.DataFrame({' string_col_name' : [' hello2' ],
3430
- ' integer_col_name' : [2 ],
3431
- ' boolean_col_name' : [False ]})
3432
- data_frame = gbq.to_gbq(df2, ' my_dataset.my_table' , if_exists = ' append' , project_id = projectid)
3433
-
3434
- .. note ::
3435
-
3436
- A default project id can be set using the command line:
3437
- `bq init `.
3438
-
3439
- There is a hard cap on BigQuery result sets, at 128MB compressed. Also, the BigQuery SQL query language has some oddities,
3440
- see `here <https://developers.google.com/bigquery/query-reference >`__
3441
-
3442
- You can access the management console to determine project id's by:
3443
- <https://code.google.com/apis/console/b/0/?noredirect>
3469
+ You can access the management console to determine project id's by:
3470
+ <https://code.google.com/apis/console/b/0/?noredirect>
3444
3471
3445
3472
.. warning ::
3446
3473
3447
- To use this module, you will need a BigQuery account. See
3448
- <https://cloud.google.com/products/big-query> for details.
3449
-
3450
- As of 1/28/14, a known bug is present that could possibly cause data duplication in the resultant dataframe. A fix is imminent,
3451
- but any client changes will not make it into 0.13.1. See:
3452
- http://stackoverflow.com/questions/20984592/bigquery-results-not-including-page-token/21009144?noredirect=1#comment32090677_21009144
3474
+ To use this module, you will need a valid BigQuery account. See
3475
+ <https://cloud.google.com/products/big-query> for details on the
3476
+ service.
3453
3477
3454
3478
.. _io.stata :
3455
3479
0 commit comments