documentation: update feature store dataset builder docs (aws#3683)

patrickmcarlos · JoseJuan98 · commit 9589b92693d9 · 2023-03-04T13:58:28.000+01:00
* updates docstrings for datasetbuilder members

* feature store user guide updates

* update to chain code snippet, updates api ref link

* update description to be before code

* removes trailing whitespace

* remove whitespace from user guide
diff --git a/doc/amazon_sagemaker_featurestore.rst b/doc/amazon_sagemaker_featurestore.rst
@@ -380,6 +380,102 @@ location for the data set to be saved there.
 From here you can train a model using this data set and then perform
 inference.
 
+.. rubric:: Using the Offline Store SDK: Getting Started
+   :name: bCe9CA61b79
+
+The Feature Store Offline SDK provides the ability to quickly and easily
+build ML-ready datasets for use by ML model training or pre-processing.
+The SDK makes it easy to build datasets from SQL join, point-in-time accurate
+join, and event range time frames, all without the need to write any SQL code.
+This functionality is accessed via the DatasetBuilder class which is the
+primary entry point for the SDK functionality.
+
+.. code:: python
+
+   from sagemaker.feature_store.feature_store import FeatureStore
+
+   feature_store = FeatureStore(sagemaker_session=feature_store_session)
+
+.. code:: python
+
+   base_feature_group = identity_feature_group
+   target_feature_group = transaction_feature_group
+
+You can create dataset using `create_dataset` of feature store API.
+`base` can either be a feature group or a pandas dataframe.
+
+.. code:: python
+
+   result_df, query = feature_store.create_dataset(
+      base=base_feature_group,
+      output_path=f"s3://{s3_bucket_name}"
+   ).to_dataframe()
+
+If you want to join other feature group, you can specify extra
+feature group using `with_feature_group` method.
+
+.. code:: python
+
+   dataset_builder = feature_store.create_dataset(
+      base=base_feature_group,
+      output_path=f"s3://{s3_bucket_name}"
+   ).with_feature_group(target_feature_group, record_identifier_name)
+
+   result_df, query = dataset_builder.to_dataframe()
+
+.. rubric:: Using the Offline Store SDK: Configuring the DatasetBuilder
+   :name: bCe9CA61b80
+
+How the DatasetBuilder produces the resulting dataframe can be configured
+in various ways.
+
+By default the Python SDK will exclude all deleted and duplicate records.
+However if you need either of them in returned dataset, you can call
+`include_duplicated_records` or `include_deleted_records` when creating
+dataset builder.
+
+.. code:: python
+
+   dataset_builder.include_duplicated_records()
+   dataset_builder.include_deleted_records()
+
+The DatasetBuilder provides `with_number_of_records_from_query_results` and
+`with_number_of_recent_records_by_record_identifier` methods to limit the
+number of records returned for the offline snapshot.
+
+`with_number_of_records_from_query_results` will limit the number of records
+in the output. For example, when N = 100, only 100 records are going to be
+returned in either the csv or dataframe.
+
+.. code:: python
+
+   dataset_builder.with_number_of_records_from_query_results(number_of_records=N)
+
+On the other hand, `with_number_of_recent_records_by_record_identifier` is
+used to deal with records which have the same identifier. They are going
+to be sorted according to `event_time` and return at most N recent records
+in the output.
+
+.. code:: python
+
+   dataset_builder.with_number_of_recent_records_by_record_identifier(number_of_recent_records=N)
+
+Since these functions return the dataset builder, these functions can
+be chained.
+
+.. code:: python
+
+   dataset_builder
+      .with_number_of_records_from_query_results(number_of_records=N)
+      .include_duplicated_records()
+      .with_number_of_recent_records_by_record_identifier(number_of_recent_records=N)
+      .to_dataframe()
+
+There are additional configurations that can be made for various use cases,
+such as time travel and point-in-time join. These are outlined in the
+Feature Store `DatasetBuilder API Reference
+<https://sagemaker.readthedocs.io/en/stable/api/prep_data/feature_store.html#dataset-builder>`__.
+
 .. rubric:: Delete a feature group
    :name: bCe9CA61b78
 
@@ -395,3 +491,4 @@ The following code example is from the fraud detection example.
 
    identity_feature_group.delete()
    transaction_feature_group.delete()
+
diff --git a/src/sagemaker/feature_store/dataset_builder.py b/src/sagemaker/feature_store/dataset_builder.py
@@ -171,24 +171,33 @@ class DatasetBuilder:
         _event_time_identifier_feature_name (str): A string representing the event time identifier
             feature if base is a DataFrame (default: None).
         _included_feature_names (List[str]): A list of strings representing features to be
-            included in the output (default: None).
-        _kms_key_id (str): An KMS key id. If set, will be used to encrypt the result file
+            included in the output. If not set, all features will be included in the output.
             (default: None).
-        _point_in_time_accurate_join (bool): A boolean representing whether using point in time join
-            or not (default: False).
-        _include_duplicated_records (bool): A boolean representing whether including duplicated
-            records or not (default: False).
-        _include_deleted_records (bool): A boolean representing whether including deleted records or
-            not (default: False).
-        _number_of_recent_records (int): An int that how many records will be returned for each
-            record identifier (default: 1).
-        _number_of_records (int): An int that how many records will be returned (default: None).
-        _write_time_ending_timestamp (datetime.datetime): A datetime that all records' write time in
-            dataset will be before it (default: None).
-        _event_time_starting_timestamp (datetime.datetime): A datetime that all records' event time
-            in dataset will be after it (default: None).
-        _event_time_ending_timestamp (datetime.datetime): A datetime that all records' event time in
-            dataset will be before it (default: None).
+        _kms_key_id (str): A KMS key id. If set, will be used to encrypt the result file
+            (default: None).
+        _point_in_time_accurate_join (bool): A boolean representing if point-in-time join
+            is applied to the resulting dataframe when calling "to_dataframe".
+            When set to True, users can retrieve data using “row-level time travel”
+            according to the event times provided to the DatasetBuilder. This requires that the
+            entity dataframe with event times is submitted as the base in the constructor
+            (default: False).
+        _include_duplicated_records (bool): A boolean representing whether the resulting dataframe
+            when calling "to_dataframe" should include duplicated records (default: False).
+        _include_deleted_records (bool): A boolean representing whether the resulting
+            dataframe when calling "to_dataframe" should include deleted records (default: False).
+        _number_of_recent_records (int): An integer representing how many records will be
+            returned for each record identifier (default: 1).
+        _number_of_records (int): An integer representing the number of records that should be
+            returned in the resulting dataframe when calling "to_dataframe" (default: None).
+        _write_time_ending_timestamp (datetime.datetime): A datetime that represents the latest
+            write time for a record to be included in the resulting dataset. Records with a
+            newer write time will be omitted from the resulting dataset. (default: None).
+        _event_time_starting_timestamp (datetime.datetime): A datetime that represents the earliest
+            event time for a record to be included in the resulting dataset. Records
+            with an older event time will be omitted from the resulting dataset. (default: None).
+        _event_time_ending_timestamp (datetime.datetime): A datetime that represents the latest
+            event time for a record to be included in the resulting dataset. Records
+            with a newer event time will be omitted from the resulting dataset. (default: None).
         _feature_groups_to_be_merged (List[FeatureGroupToBeMerged]): A list of
             FeatureGroupToBeMerged which will be joined to base (default: []).
         _event_time_identifier_feature_type (FeatureTypeEnum): A FeatureTypeEnum representing the
@@ -247,7 +256,7 @@ def with_feature_group(
         return self
 
     def point_in_time_accurate_join(self):
-        """Set join type as point in time accurate join.
+        """Enable point-in-time accurate join.
 
         Returns:
             This DatasetBuilder object.