Skip to content

Commit cb0a3a4

Browse files
feature store user guide updates
1 parent effed64 commit cb0a3a4

File tree

1 file changed

+94
-0
lines changed

1 file changed

+94
-0
lines changed

doc/amazon_sagemaker_featurestore.rst

+94
Original file line numberDiff line numberDiff line change
@@ -380,6 +380,99 @@ location for the data set to be saved there.
380380
From here you can train a model using this data set and then perform
381381
inference.
382382

383+
.. rubric:: Using the Offline Store SDK: Getting Started
384+
:name: bCe9CA61b79
385+
386+
The Feature Store Offline SDK provides the ability to quickly and easily
387+
build ML-ready datasets for use by ML model training or pre-processing.
388+
The SDK makes it easy to build datasets from SQL join, point-in-time accurate
389+
join, and event range time frames, all without the need to write any SQL code.
390+
This functionality is accessed via the DatasetBuilder class which is the
391+
primary entry point for the SDK functionality.
392+
393+
.. code:: python
394+
395+
from sagemaker.feature_store.feature_store import FeatureStore
396+
397+
feature_store = FeatureStore(sagemaker_session=feature_store_session)
398+
399+
.. code:: python
400+
401+
base_feature_group = identity_feature_group
402+
target_feature_group = transaction_feature_group
403+
404+
You can create dataset using `create_dataset` of feature store API.
405+
`base` can either be a feature group or a pandas dataframe.
406+
407+
.. code:: python
408+
409+
result_df, query = feature_store.create_dataset(
410+
base=base_feature_group,
411+
output_path=f"s3://{s3_bucket_name}"
412+
).to_dataframe()
413+
414+
If you want to join other feature group, you can specify extra
415+
feature group using `with_feature_group` method.
416+
417+
.. code:: python
418+
419+
dataset_builder = feature_store.create_dataset(
420+
base=base_feature_group,
421+
output_path=f"s3://{s3_bucket_name}"
422+
).with_feature_group(target_feature_group, record_identifier_name)
423+
424+
result_df, query = dataset_builder.to_dataframe()
425+
426+
.. rubric:: Using the Offline Store SDK: Configuring the DatasetBuilder
427+
:name: bCe9CA61b80
428+
429+
How the DatasetBuilder produces the resulting dataframe can be configured
430+
in various ways.
431+
432+
By default the Python SDK will exclude all deleted and duplicate records.
433+
However if you need either of them in returned dataset, you can call
434+
`include_duplicated_records` or `include_deleted_records` when creating
435+
dataset builder.
436+
437+
.. code:: python
438+
439+
dataset_builder.include_duplicated_records().to_dataframe()
440+
441+
The DatasetBuilder provides `with_number_of_records_from_query_results` and
442+
`with_number_of_recent_records_by_record_identifier` methods to limit the
443+
number of records returned for the offline snapshot.
444+
445+
.. code:: python
446+
447+
dataset_builder.with_number_of_recent_records_by_record_identifier(number_of_recent_records=1).to_dataframe()
448+
449+
`with_number_of_records_from_query_results` will limit the number of records
450+
in the output. For example, when N = 100, only 100 records are going to be
451+
returned in either the csv or dataframe.
452+
453+
.. code:: python
454+
455+
dataset_builder.with_number_of_records_from_query_results(number_of_records=100).to_dataframe()
456+
457+
On the other hand, `with_number_of_recent_records_by_record_identifier` is
458+
used to deal with records which have the same identifier. They are going
459+
to be sorted according to `event_time` and return at most N recent records
460+
in the output.
461+
462+
Since these functions return the dataset builder, these functions can
463+
be chained.
464+
465+
.. code:: python
466+
467+
dataset_builder
468+
.with_number_of_records_from_query_results(number_of_records=100)
469+
.include_duplicated_records()
470+
.to_dataframe()
471+
472+
There are additional configurations that can be made for various use cases,
473+
such as time travel and point-in-time join. These are outlined in the
474+
Feature Store DatasetBuilder API Reference.
475+
383476
.. rubric:: Delete a feature group
384477
:name: bCe9CA61b78
385478

@@ -395,3 +488,4 @@ The following code example is from the fraud detection example.
395488
396489
identity_feature_group.delete()
397490
transaction_feature_group.delete()
491+

0 commit comments

Comments
 (0)