Skip to content

docs: update feature store dataset builder docs #3683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions doc/amazon_sagemaker_featurestore.rst
Original file line number Diff line number Diff line change
Expand Up @@ -380,6 +380,102 @@ location for the data set to be saved there.
From here you can train a model using this data set and then perform
inference.

.. rubric:: Using the Offline Store SDK: Getting Started
:name: bCe9CA61b79

The Feature Store Offline SDK provides the ability to quickly and easily
build ML-ready datasets for use by ML model training or pre-processing.
The SDK makes it easy to build datasets from SQL join, point-in-time accurate
join, and event range time frames, all without the need to write any SQL code.
This functionality is accessed via the DatasetBuilder class which is the
primary entry point for the SDK functionality.

.. code:: python

from sagemaker.feature_store.feature_store import FeatureStore

feature_store = FeatureStore(sagemaker_session=feature_store_session)

.. code:: python

base_feature_group = identity_feature_group
target_feature_group = transaction_feature_group

You can create dataset using `create_dataset` of feature store API.
`base` can either be a feature group or a pandas dataframe.

.. code:: python

result_df, query = feature_store.create_dataset(
base=base_feature_group,
output_path=f"s3://{s3_bucket_name}"
).to_dataframe()

If you want to join other feature group, you can specify extra
feature group using `with_feature_group` method.

.. code:: python

dataset_builder = feature_store.create_dataset(
base=base_feature_group,
output_path=f"s3://{s3_bucket_name}"
).with_feature_group(target_feature_group, record_identifier_name)

result_df, query = dataset_builder.to_dataframe()

.. rubric:: Using the Offline Store SDK: Configuring the DatasetBuilder
:name: bCe9CA61b80

How the DatasetBuilder produces the resulting dataframe can be configured
in various ways.

By default the Python SDK will exclude all deleted and duplicate records.
However if you need either of them in returned dataset, you can call
`include_duplicated_records` or `include_deleted_records` when creating
dataset builder.

.. code:: python

dataset_builder.include_duplicated_records()
dataset_builder.include_deleted_records()

The DatasetBuilder provides `with_number_of_records_from_query_results` and
`with_number_of_recent_records_by_record_identifier` methods to limit the
number of records returned for the offline snapshot.

`with_number_of_records_from_query_results` will limit the number of records
in the output. For example, when N = 100, only 100 records are going to be
returned in either the csv or dataframe.

.. code:: python

dataset_builder.with_number_of_records_from_query_results(number_of_records=N)

On the other hand, `with_number_of_recent_records_by_record_identifier` is
used to deal with records which have the same identifier. They are going
to be sorted according to `event_time` and return at most N recent records
in the output.

.. code:: python

dataset_builder.with_number_of_recent_records_by_record_identifier(number_of_recent_records=N)

Since these functions return the dataset builder, these functions can
be chained.

.. code:: python

dataset_builder
.with_number_of_records_from_query_results(number_of_records=N)
.include_duplicated_records()
.with_number_of_recent_records_by_record_identifier(number_of_recent_records=N)
.to_dataframe()

There are additional configurations that can be made for various use cases,
such as time travel and point-in-time join. These are outlined in the
Feature Store `DatasetBuilder API Reference
<https://sagemaker.readthedocs.io/en/stable/api/prep_data/feature_store.html#dataset-builder>`__.

.. rubric:: Delete a feature group
:name: bCe9CA61b78

Expand All @@ -395,3 +491,4 @@ The following code example is from the fraud detection example.

identity_feature_group.delete()
transaction_feature_group.delete()

45 changes: 27 additions & 18 deletions src/sagemaker/feature_store/dataset_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,24 +171,33 @@ class DatasetBuilder:
_event_time_identifier_feature_name (str): A string representing the event time identifier
feature if base is a DataFrame (default: None).
_included_feature_names (List[str]): A list of strings representing features to be
included in the output (default: None).
_kms_key_id (str): An KMS key id. If set, will be used to encrypt the result file
included in the output. If not set, all features will be included in the output.
(default: None).
_point_in_time_accurate_join (bool): A boolean representing whether using point in time join
or not (default: False).
_include_duplicated_records (bool): A boolean representing whether including duplicated
records or not (default: False).
_include_deleted_records (bool): A boolean representing whether including deleted records or
not (default: False).
_number_of_recent_records (int): An int that how many records will be returned for each
record identifier (default: 1).
_number_of_records (int): An int that how many records will be returned (default: None).
_write_time_ending_timestamp (datetime.datetime): A datetime that all records' write time in
dataset will be before it (default: None).
_event_time_starting_timestamp (datetime.datetime): A datetime that all records' event time
in dataset will be after it (default: None).
_event_time_ending_timestamp (datetime.datetime): A datetime that all records' event time in
dataset will be before it (default: None).
_kms_key_id (str): A KMS key id. If set, will be used to encrypt the result file
(default: None).
_point_in_time_accurate_join (bool): A boolean representing if point-in-time join
is applied to the resulting dataframe when calling "to_dataframe".
When set to True, users can retrieve data using “row-level time travel”
according to the event times provided to the DatasetBuilder. This requires that the
entity dataframe with event times is submitted as the base in the constructor
(default: False).
_include_duplicated_records (bool): A boolean representing whether the resulting dataframe
when calling "to_dataframe" should include duplicated records (default: False).
_include_deleted_records (bool): A boolean representing whether the resulting
dataframe when calling "to_dataframe" should include deleted records (default: False).
_number_of_recent_records (int): An integer representing how many records will be
returned for each record identifier (default: 1).
_number_of_records (int): An integer representing the number of records that should be
returned in the resulting dataframe when calling "to_dataframe" (default: None).
_write_time_ending_timestamp (datetime.datetime): A datetime that represents the latest
write time for a record to be included in the resulting dataset. Records with a
newer write time will be omitted from the resulting dataset. (default: None).
_event_time_starting_timestamp (datetime.datetime): A datetime that represents the earliest
event time for a record to be included in the resulting dataset. Records
with an older event time will be omitted from the resulting dataset. (default: None).
_event_time_ending_timestamp (datetime.datetime): A datetime that represents the latest
event time for a record to be included in the resulting dataset. Records
with a newer event time will be omitted from the resulting dataset. (default: None).
_feature_groups_to_be_merged (List[FeatureGroupToBeMerged]): A list of
FeatureGroupToBeMerged which will be joined to base (default: []).
_event_time_identifier_feature_type (FeatureTypeEnum): A FeatureTypeEnum representing the
Expand Down Expand Up @@ -247,7 +256,7 @@ def with_feature_group(
return self

def point_in_time_accurate_join(self):
"""Set join type as point in time accurate join.
"""Enable point-in-time accurate join.

Returns:
This DatasetBuilder object.
Expand Down