@@ -380,6 +380,102 @@ location for the data set to be saved there.
380
380
From here you can train a model using this data set and then perform
381
381
inference.
382
382
383
+ .. rubric :: Using the Offline Store SDK: Getting Started
384
+ :name: bCe9CA61b79
385
+
386
+ The Feature Store Offline SDK provides the ability to quickly and easily
387
+ build ML-ready datasets for use by ML model training or pre-processing.
388
+ The SDK makes it easy to build datasets from SQL join, point-in-time accurate
389
+ join, and event range time frames, all without the need to write any SQL code.
390
+ This functionality is accessed via the DatasetBuilder class which is the
391
+ primary entry point for the SDK functionality.
392
+
393
+ .. code :: python
394
+
395
+ from sagemaker.feature_store.feature_store import FeatureStore
396
+
397
+ feature_store = FeatureStore(sagemaker_session = feature_store_session)
398
+
399
+ .. code :: python
400
+
401
+ base_feature_group = identity_feature_group
402
+ target_feature_group = transaction_feature_group
403
+
404
+ You can create dataset using `create_dataset ` of feature store API.
405
+ `base ` can either be a feature group or a pandas dataframe.
406
+
407
+ .. code :: python
408
+
409
+ result_df, query = feature_store.create_dataset(
410
+ base = base_feature_group,
411
+ output_path = f " s3:// { s3_bucket_name} "
412
+ ).to_dataframe()
413
+
414
+ If you want to join other feature group, you can specify extra
415
+ feature group using `with_feature_group ` method.
416
+
417
+ .. code :: python
418
+
419
+ dataset_builder = feature_store.create_dataset(
420
+ base = base_feature_group,
421
+ output_path = f " s3:// { s3_bucket_name} "
422
+ ).with_feature_group(target_feature_group, record_identifier_name)
423
+
424
+ result_df, query = dataset_builder.to_dataframe()
425
+
426
+ .. rubric :: Using the Offline Store SDK: Configuring the DatasetBuilder
427
+ :name: bCe9CA61b80
428
+
429
+ How the DatasetBuilder produces the resulting dataframe can be configured
430
+ in various ways.
431
+
432
+ By default the Python SDK will exclude all deleted and duplicate records.
433
+ However if you need either of them in returned dataset, you can call
434
+ `include_duplicated_records ` or `include_deleted_records ` when creating
435
+ dataset builder.
436
+
437
+ .. code :: python
438
+
439
+ dataset_builder.include_duplicated_records()
440
+ dataset_builder.include_deleted_records()
441
+
442
+ The DatasetBuilder provides `with_number_of_records_from_query_results ` and
443
+ `with_number_of_recent_records_by_record_identifier ` methods to limit the
444
+ number of records returned for the offline snapshot.
445
+
446
+ `with_number_of_records_from_query_results ` will limit the number of records
447
+ in the output. For example, when N = 100, only 100 records are going to be
448
+ returned in either the csv or dataframe.
449
+
450
+ .. code :: python
451
+
452
+ dataset_builder.with_number_of_records_from_query_results(number_of_records = N)
453
+
454
+ On the other hand, `with_number_of_recent_records_by_record_identifier ` is
455
+ used to deal with records which have the same identifier. They are going
456
+ to be sorted according to `event_time ` and return at most N recent records
457
+ in the output.
458
+
459
+ .. code :: python
460
+
461
+ dataset_builder.with_number_of_recent_records_by_record_identifier(number_of_recent_records = N)
462
+
463
+ Since these functions return the dataset builder, these functions can
464
+ be chained.
465
+
466
+ .. code :: python
467
+
468
+ dataset_builder
469
+ .with_number_of_records_from_query_results(number_of_records = N)
470
+ .include_duplicated_records()
471
+ .with_number_of_recent_records_by_record_identifier(number_of_recent_records = N)
472
+ .to_dataframe()
473
+
474
+ There are additional configurations that can be made for various use cases,
475
+ such as time travel and point-in-time join. These are outlined in the
476
+ Feature Store `DatasetBuilder API Reference
477
+ <https://sagemaker.readthedocs.io/en/stable/api/prep_data/feature_store.html#dataset-builder> `__.
478
+
383
479
.. rubric :: Delete a feature group
384
480
:name: bCe9CA61b78
385
481
@@ -395,3 +491,4 @@ The following code example is from the fraud detection example.
395
491
396
492
identity_feature_group.delete()
397
493
transaction_feature_group.delete()
494
+
0 commit comments