feature store user guide updates

patrickmcarlos · patrickmcarlos · commit cb0a3a4e4869 · 2023-03-01T09:17:33.000-08:00
diff --git a/doc/amazon_sagemaker_featurestore.rst b/doc/amazon_sagemaker_featurestore.rst
@@ -380,6 +380,99 @@ location for the data set to be saved there.
 From here you can train a model using this data set and then perform
 inference.
 
+.. rubric:: Using the Offline Store SDK: Getting Started
+   :name: bCe9CA61b79
+
+The Feature Store Offline SDK provides the ability to quickly and easily
+build ML-ready datasets for use by ML model training or pre-processing.
+The SDK makes it easy to build datasets from SQL join, point-in-time accurate
+join, and event range time frames, all without the need to write any SQL code.
+This functionality is accessed via the DatasetBuilder class which is the
+primary entry point for the SDK functionality.
+
+.. code:: python
+
+   from sagemaker.feature_store.feature_store import FeatureStore
+
+   feature_store = FeatureStore(sagemaker_session=feature_store_session)
+
+.. code:: python
+
+   base_feature_group = identity_feature_group
+   target_feature_group = transaction_feature_group
+
+You can create dataset using `create_dataset` of feature store API.
+`base` can either be a feature group or a pandas dataframe.
+
+.. code:: python
+
+   result_df, query = feature_store.create_dataset(
+      base=base_feature_group,
+      output_path=f"s3://{s3_bucket_name}"
+   ).to_dataframe()
+
+If you want to join other feature group, you can specify extra
+feature group using `with_feature_group` method.
+
+.. code:: python
+
+   dataset_builder = feature_store.create_dataset(
+      base=base_feature_group,
+      output_path=f"s3://{s3_bucket_name}"
+   ).with_feature_group(target_feature_group, record_identifier_name)
+
+   result_df, query = dataset_builder.to_dataframe()
+
+.. rubric:: Using the Offline Store SDK: Configuring the DatasetBuilder
+   :name: bCe9CA61b80
+
+How the DatasetBuilder produces the resulting dataframe can be configured
+in various ways.
+
+By default the Python SDK will exclude all deleted and duplicate records.
+However if you need either of them in returned dataset, you can call
+`include_duplicated_records` or `include_deleted_records` when creating
+dataset builder.
+
+.. code:: python
+
+   dataset_builder.include_duplicated_records().to_dataframe()
+
+The DatasetBuilder provides `with_number_of_records_from_query_results` and
+`with_number_of_recent_records_by_record_identifier` methods to limit the
+number of records returned for the offline snapshot.
+
+.. code:: python
+
+   dataset_builder.with_number_of_recent_records_by_record_identifier(number_of_recent_records=1).to_dataframe()
+
+`with_number_of_records_from_query_results` will limit the number of records
+in the output. For example, when N = 100, only 100 records are going to be
+returned in either the csv or dataframe.
+
+.. code:: python
+
+   dataset_builder.with_number_of_records_from_query_results(number_of_records=100).to_dataframe()
+
+On the other hand, `with_number_of_recent_records_by_record_identifier` is
+used to deal with records which have the same identifier. They are going
+to be sorted according to `event_time` and return at most N recent records
+in the output.
+
+Since these functions return the dataset builder, these functions can
+be chained.
+
+.. code:: python
+   
+   dataset_builder
+      .with_number_of_records_from_query_results(number_of_records=100)
+      .include_duplicated_records()
+      .to_dataframe()
+
+There are additional configurations that can be made for various use cases,
+such as time travel and point-in-time join. These are outlined in the
+Feature Store DatasetBuilder API Reference.
+
 .. rubric:: Delete a feature group
    :name: bCe9CA61b78
 
@@ -395,3 +488,4 @@ The following code example is from the fraud detection example.
 
    identity_feature_group.delete()
    transaction_feature_group.delete()
+