Skip to content

Commit 9589b92

Browse files
patrickmcarlosJoseJuan98
authored andcommitted
documentation: update feature store dataset builder docs (aws#3683)
* updates docstrings for datasetbuilder members * feature store user guide updates * update to chain code snippet, updates api ref link * update description to be before code * removes trailing whitespace * remove whitespace from user guide
1 parent a682321 commit 9589b92

File tree

2 files changed

+124
-18
lines changed

2 files changed

+124
-18
lines changed

doc/amazon_sagemaker_featurestore.rst

+97
Original file line numberDiff line numberDiff line change
@@ -380,6 +380,102 @@ location for the data set to be saved there.
380380
From here you can train a model using this data set and then perform
381381
inference.
382382

383+
.. rubric:: Using the Offline Store SDK: Getting Started
384+
:name: bCe9CA61b79
385+
386+
The Feature Store Offline SDK provides the ability to quickly and easily
387+
build ML-ready datasets for use by ML model training or pre-processing.
388+
The SDK makes it easy to build datasets from SQL join, point-in-time accurate
389+
join, and event range time frames, all without the need to write any SQL code.
390+
This functionality is accessed via the DatasetBuilder class which is the
391+
primary entry point for the SDK functionality.
392+
393+
.. code:: python
394+
395+
from sagemaker.feature_store.feature_store import FeatureStore
396+
397+
feature_store = FeatureStore(sagemaker_session=feature_store_session)
398+
399+
.. code:: python
400+
401+
base_feature_group = identity_feature_group
402+
target_feature_group = transaction_feature_group
403+
404+
You can create dataset using `create_dataset` of feature store API.
405+
`base` can either be a feature group or a pandas dataframe.
406+
407+
.. code:: python
408+
409+
result_df, query = feature_store.create_dataset(
410+
base=base_feature_group,
411+
output_path=f"s3://{s3_bucket_name}"
412+
).to_dataframe()
413+
414+
If you want to join other feature group, you can specify extra
415+
feature group using `with_feature_group` method.
416+
417+
.. code:: python
418+
419+
dataset_builder = feature_store.create_dataset(
420+
base=base_feature_group,
421+
output_path=f"s3://{s3_bucket_name}"
422+
).with_feature_group(target_feature_group, record_identifier_name)
423+
424+
result_df, query = dataset_builder.to_dataframe()
425+
426+
.. rubric:: Using the Offline Store SDK: Configuring the DatasetBuilder
427+
:name: bCe9CA61b80
428+
429+
How the DatasetBuilder produces the resulting dataframe can be configured
430+
in various ways.
431+
432+
By default the Python SDK will exclude all deleted and duplicate records.
433+
However if you need either of them in returned dataset, you can call
434+
`include_duplicated_records` or `include_deleted_records` when creating
435+
dataset builder.
436+
437+
.. code:: python
438+
439+
dataset_builder.include_duplicated_records()
440+
dataset_builder.include_deleted_records()
441+
442+
The DatasetBuilder provides `with_number_of_records_from_query_results` and
443+
`with_number_of_recent_records_by_record_identifier` methods to limit the
444+
number of records returned for the offline snapshot.
445+
446+
`with_number_of_records_from_query_results` will limit the number of records
447+
in the output. For example, when N = 100, only 100 records are going to be
448+
returned in either the csv or dataframe.
449+
450+
.. code:: python
451+
452+
dataset_builder.with_number_of_records_from_query_results(number_of_records=N)
453+
454+
On the other hand, `with_number_of_recent_records_by_record_identifier` is
455+
used to deal with records which have the same identifier. They are going
456+
to be sorted according to `event_time` and return at most N recent records
457+
in the output.
458+
459+
.. code:: python
460+
461+
dataset_builder.with_number_of_recent_records_by_record_identifier(number_of_recent_records=N)
462+
463+
Since these functions return the dataset builder, these functions can
464+
be chained.
465+
466+
.. code:: python
467+
468+
dataset_builder
469+
.with_number_of_records_from_query_results(number_of_records=N)
470+
.include_duplicated_records()
471+
.with_number_of_recent_records_by_record_identifier(number_of_recent_records=N)
472+
.to_dataframe()
473+
474+
There are additional configurations that can be made for various use cases,
475+
such as time travel and point-in-time join. These are outlined in the
476+
Feature Store `DatasetBuilder API Reference
477+
<https://sagemaker.readthedocs.io/en/stable/api/prep_data/feature_store.html#dataset-builder>`__.
478+
383479
.. rubric:: Delete a feature group
384480
:name: bCe9CA61b78
385481

@@ -395,3 +491,4 @@ The following code example is from the fraud detection example.
395491
396492
identity_feature_group.delete()
397493
transaction_feature_group.delete()
494+

src/sagemaker/feature_store/dataset_builder.py

+27-18
Original file line numberDiff line numberDiff line change
@@ -171,24 +171,33 @@ class DatasetBuilder:
171171
_event_time_identifier_feature_name (str): A string representing the event time identifier
172172
feature if base is a DataFrame (default: None).
173173
_included_feature_names (List[str]): A list of strings representing features to be
174-
included in the output (default: None).
175-
_kms_key_id (str): An KMS key id. If set, will be used to encrypt the result file
174+
included in the output. If not set, all features will be included in the output.
176175
(default: None).
177-
_point_in_time_accurate_join (bool): A boolean representing whether using point in time join
178-
or not (default: False).
179-
_include_duplicated_records (bool): A boolean representing whether including duplicated
180-
records or not (default: False).
181-
_include_deleted_records (bool): A boolean representing whether including deleted records or
182-
not (default: False).
183-
_number_of_recent_records (int): An int that how many records will be returned for each
184-
record identifier (default: 1).
185-
_number_of_records (int): An int that how many records will be returned (default: None).
186-
_write_time_ending_timestamp (datetime.datetime): A datetime that all records' write time in
187-
dataset will be before it (default: None).
188-
_event_time_starting_timestamp (datetime.datetime): A datetime that all records' event time
189-
in dataset will be after it (default: None).
190-
_event_time_ending_timestamp (datetime.datetime): A datetime that all records' event time in
191-
dataset will be before it (default: None).
176+
_kms_key_id (str): A KMS key id. If set, will be used to encrypt the result file
177+
(default: None).
178+
_point_in_time_accurate_join (bool): A boolean representing if point-in-time join
179+
is applied to the resulting dataframe when calling "to_dataframe".
180+
When set to True, users can retrieve data using “row-level time travel”
181+
according to the event times provided to the DatasetBuilder. This requires that the
182+
entity dataframe with event times is submitted as the base in the constructor
183+
(default: False).
184+
_include_duplicated_records (bool): A boolean representing whether the resulting dataframe
185+
when calling "to_dataframe" should include duplicated records (default: False).
186+
_include_deleted_records (bool): A boolean representing whether the resulting
187+
dataframe when calling "to_dataframe" should include deleted records (default: False).
188+
_number_of_recent_records (int): An integer representing how many records will be
189+
returned for each record identifier (default: 1).
190+
_number_of_records (int): An integer representing the number of records that should be
191+
returned in the resulting dataframe when calling "to_dataframe" (default: None).
192+
_write_time_ending_timestamp (datetime.datetime): A datetime that represents the latest
193+
write time for a record to be included in the resulting dataset. Records with a
194+
newer write time will be omitted from the resulting dataset. (default: None).
195+
_event_time_starting_timestamp (datetime.datetime): A datetime that represents the earliest
196+
event time for a record to be included in the resulting dataset. Records
197+
with an older event time will be omitted from the resulting dataset. (default: None).
198+
_event_time_ending_timestamp (datetime.datetime): A datetime that represents the latest
199+
event time for a record to be included in the resulting dataset. Records
200+
with a newer event time will be omitted from the resulting dataset. (default: None).
192201
_feature_groups_to_be_merged (List[FeatureGroupToBeMerged]): A list of
193202
FeatureGroupToBeMerged which will be joined to base (default: []).
194203
_event_time_identifier_feature_type (FeatureTypeEnum): A FeatureTypeEnum representing the
@@ -247,7 +256,7 @@ def with_feature_group(
247256
return self
248257

249258
def point_in_time_accurate_join(self):
250-
"""Set join type as point in time accurate join.
259+
"""Enable point-in-time accurate join.
251260
252261
Returns:
253262
This DatasetBuilder object.

0 commit comments

Comments
 (0)