Skip to content

Commit 3bd2209

Browse files
Update and reorganize RTD docs (#795)
* Move intro text to pipeline's TOC section * Reorganize RTD pages related to artifact storage * Add a new section on using existing artifacts * Update index.rst minor change * Update index.rst minor tweaks * Update s3.rst minor tweak * Update s3.rst @yoonspark let me know if this modification makes sense * Fix typos; remove duplacate doc * For Postgres and S3, make clearer distinction between storing artifact values vs. metadata * Add phrasing suggested by MMA Co-authored-by: Moustafa AbdelBaky <[email protected]>
1 parent d4d9c9e commit 3bd2209

File tree

10 files changed

+213
-65
lines changed

10 files changed

+213
-65
lines changed

docs/source/fundamentals/concepts.rst

+6
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33
Concepts
44
========
55

6+
.. _artifact_concept:
7+
68
Artifact
79
--------
810

@@ -15,6 +17,8 @@ LineaPy not only records the state (i.e. value) of the variable but also traces
1517
leading to this state --- as code. Such a complete development history or *lineage* then allows LineaPy to fully reproduce
1618
the given artifact. Furthermore, it provides the ground to automate data engineering work to bring data science from development to production.
1719

20+
.. _artifact_store_concept:
21+
1822
Artifact Store
1923
--------------
2024

@@ -25,6 +29,8 @@ This unified global storage is designed to accelerate the overall development pr
2529
Moreover, it can facilitate collaboration between different teams
2630
as it provides a single source of truth for all prior relevant work.
2731

32+
.. _pipeline_concept:
33+
2834
Pipeline
2935
--------
3036

docs/source/guide/build_pipelines/index.rst

+8
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,14 @@
11
Pipelines
22
=========
33

4+
Data science workflows revolve around building and refining pipelines, i.e., a series of processes that transform data into useful information/product
5+
(read more about pipelines :ref:`here <pipeline_concept>`).
6+
7+
Traditionally, this is often manual and time-consuming work as data scientists (or production engineers) need to transform messy development code
8+
into deployable scripts for the target system (e.g., Airflow).
9+
10+
Having the complete development process stored in artifacts, LineaPy can automate such code transformation, accelerating transition from development to production.
11+
412
.. toctree::
513
:maxdepth: 1
614

docs/source/guide/build_pipelines/pipeline_basics.rst

+3-11
Original file line numberDiff line numberDiff line change
@@ -5,24 +5,16 @@ Basics
55

66
.. include:: ../../snippets/slack_support.rstinc
77

8-
Data science workflows revolve around building and refining pipelines, i.e., a series of processes that transform data into useful information/product
9-
(read more about pipelines :ref:`here <concepts>`).
10-
11-
Traditionally, this is often manual and time-consuming work as data scientists (or production engineers) need to transform messy development code
12-
into deployable scripts for the target system (e.g., Airflow).
13-
14-
Having the complete development process stored in artifacts, LineaPy can automate such code transformation, accelerating transition from development to production.
8+
Pipeline Creation
9+
-----------------
1510

16-
For example, consider a simple pipeline that 1) pre-processes raw data and 2) trains a model with the pre-processed data.
11+
Consider a simple pipeline that 1) pre-processes raw data and 2) trains a model with the pre-processed data.
1712

1813
.. image:: pipeline.png
1914
:width: 600
2015
:align: center
2116
:alt: Pipeline Example
2217

23-
Pipeline Creation
24-
-----------------
25-
2618
Once we have the pre-processed data and the trained model stored as LineaPy artifacts (which can be done during development sessions),
2719
building a pipeline reduces to “stitching” these artifacts, like so:
2820

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
Using Existing Artifacts
2+
========================
3+
4+
Once connected to an artifact store (it can be an individual or shared one), we can query existing artifacts, like so:
5+
6+
.. code:: python
7+
8+
lineapy.artifact_store()
9+
10+
which would print a list looking as the following:
11+
12+
.. code:: none
13+
14+
iris_preprocessed:0 created on 2022-09-29 01:22:39.612871
15+
iris_preprocessed:1 created on 2022-09-29 01:22:41.336159
16+
iris_preprocessed:2 created on 2022-09-29 01:22:43.511112
17+
iris_model:0 created on 2022-09-29 01:22:45.381132
18+
iris_model:1 created on 2022-09-29 01:22:46.786414
19+
iris_model:2 created on 2022-09-29 01:22:47.990517
20+
iris_model:3 created on 2022-09-29 01:22:49.366484
21+
toy_artifact:0 created on 2022-09-29 01:22:50.189060
22+
toy_artifact:1 created on 2022-09-29 01:22:50.676276
23+
toy_artifact:2 created on 2022-09-29 01:22:51.084704
24+
25+
Each line contains three pieces of information about an existing artifact: its name, version, and time of creation.
26+
Hence, for an artifact named ``iris_model``, we have four versions created at different times.
27+
28+
Now, say we are interested in reusing the first version of this artifact. We can retrieve the desired artifact as follows:
29+
30+
.. code:: python
31+
32+
model_artifact = lineapy.get("iris_model", version=0)
33+
34+
Note that what has been retrieved and saved in ``model_artifact`` is not the model itself; it is the model *artifact*,
35+
which contains more than the model itself, e.g., the code that was used to generate the model. Hence, to resuse the model,
36+
we need to extract the artifact's value:
37+
38+
.. code:: python
39+
40+
model = model_artifact.get_value()
41+
42+
However, we actually do not fully know how to reuse this model as we are missing the memory (or knowledge, if the artifact
43+
was created by someone else) of its context such as input details. Thankfully, the artifact also stores the code that was
44+
used to generate its value, so we can check it out:
45+
46+
.. code:: python
47+
48+
print(data_artifact.get_code())
49+
50+
which prints:
51+
52+
.. code:: none
53+
54+
import lineapy
55+
from sklearn.linear_model import LinearRegression
56+
57+
art_df_processed = lineapy.get("iris_preprocessed", version=2)
58+
df_processed = art_df_processed.get_value()
59+
mod = LinearRegression()
60+
mod.fit(
61+
X=df_processed[["petal.width", "d_versicolor", "d_virginica"]],
62+
y=df_processed["sepal.width"],
63+
)
64+
65+
With this, we now know the source and shape of the data that was used to train this model,
66+
which enables us to adapt and reuse the model in our context. Specifically, we can check out the
67+
training data by loading the corresponding artifact, like so:
68+
69+
.. code:: python
70+
71+
art_df_processed = lineapy.get("iris_preprocessed", version=2)
72+
df_processed = art_df_processed.get_value()
73+
print(df_processed)
74+
75+
Based on the values in the data, we would have a more concrete understanding of the model and its job,
76+
which would enable us to make new predictions, like so:
77+
78+
.. code:: python
79+
80+
import pandas as pd
81+
82+
# Create data to make predictions on
83+
df = pd.DataFrame({
84+
"petal.width": [1.3, 5.2, 0.3, 1.5, 4.9],
85+
"d_versicolor": [1, 0, 0, 1, 0],
86+
"d_virginica": [0, 1, 0, 0, 1],
87+
})
88+
89+
# Make new predictions
90+
df["sepal.width.pred"] = model.predict(df)
91+
92+
This example illustrates the benefit of LineaPy's unified storage framework:
93+
encapsulating both value and code as well as other metadata, LineaPy's artifact store
94+
enables the user to explore the history and relations among different works,
95+
hence rendering them more reusable.

docs/source/guide/manage_artifacts/database/index.rst

-7
This file was deleted.
+42-3
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,46 @@
1-
Managing Artifacts
2-
==================
1+
Artifact Storage
2+
================
3+
4+
In LineaPy, an artifact store is a centralized repository for artifacts
5+
(check :ref:`here <artifact_store_concept>` for a conceptual explanation).
6+
Under the hood, it is a collection of two data structures:
7+
8+
- Serialized artifact values (i.e., pickle files)
9+
- Database that stores artifact metadata (e.g., timestamp, version, code, pointer to the serialized value)
10+
11+
Encapsulating both value and code, as well as other metadata such as creation time and version,
12+
LineaPy's artifact store provides a more unified and streamlined experience to save, manage, and reuse
13+
works from different people over time. Contrast this with a typical setup where the team stores their
14+
outputs in one place (e.g., a key-value store) and the code in another (e.g., GitHub repo) --- we can
15+
imagine how difficult it would be to maintain correlations between the two. LineaPy simplifies lineage tracking by storing all correlations in one framework: artifact store.
16+
17+
.. note::
18+
19+
By default, the serialized values and the metadata are stored in ``.lineapy/linea_pickles/``
20+
and ``.lineapy/db.sqlite``, respectively, where ``.lineapy/`` is created under
21+
the system's home directory.
22+
23+
This default location can be overridden by modifying the configuration file:
24+
25+
.. code:: json
26+
27+
{
28+
"artifact_storage_dir": [NEW-PATH-TO-STORE-SERIALIZED-VALUES],
29+
"database_url": [NEW-DATABASE-URL-FOR-STORING-METADATA],
30+
...
31+
}
32+
33+
or making these updates in each interactive session:
34+
35+
.. code:: python
36+
37+
lineapy.options.set('artifact_storage_dir', [NEW-PATH-TO-STORE-SERIALIZED-VALUES])
38+
lineapy.options.set('database_url', [NEW-DATABASE-URL-FOR-STORING-METADATA])
39+
40+
Read more about configuration :ref:`here <configurations>`.
341

442
.. toctree::
543
:maxdepth: 1
644

7-
database/index
45+
artifact_reuse
46+
storage_location/index
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
Changing Storage Location
2+
=========================
3+
4+
Out of the box, LineaPy comes with a local SQLite database.
5+
As the user gets into more serious applications, however, this lightweight database
6+
poses limitations (e.g., single writes).
7+
Accordingly, LineaPy supports other storage options such as PostgreSQL.
8+
This support is essential for team collaborations as it enables the artifact store
9+
to be hosted in a shared environment that can be accessed by different team members.
10+
11+
.. toctree::
12+
:maxdepth: 1
13+
14+
postgres
15+
s3

docs/source/guide/manage_artifacts/database/postgres.rst renamed to docs/source/guide/manage_artifacts/storage_location/postgres.rst

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
.. _postgres:
22

3-
PostgreSQL
4-
==========
3+
Storing Artifact Metadata in PostgreSQL
4+
=======================================
55

66
.. include:: ../../../snippets/slack_support.rstinc
77

8-
By default, LineaPy uses SQLite for artifact store, which keeps the package light and simple.
8+
By default, LineaPy uses SQLite to store artifact metadata (e.g., name, version, code), which keeps the package light and simple.
99
Given the limitations of SQLite (e.g., single write access to a database at a time), however,
1010
we may want to use a more advanced database such as PostgreSQL.
1111

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
.. _s3:
2+
3+
Storing Artifact Values in Amazon S3
4+
------------------------------------
5+
6+
.. include:: ../../../snippets/slack_support.rstinc
7+
8+
To use S3 as LineaPy's serialized value location, you can run the following command in your notebook to change your storage backend:
9+
10+
.. code:: python
11+
12+
lineapy.options.set('artifact_storage_dir', 's3://your-bucket/your-artifact-folder')
13+
14+
You should configure your AWS account just like you would for `AWS CLI <https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html>`_ or `boto3 <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html>`_,
15+
and LineaPy will use the default AWS credentials to access the S3 bucket.
16+
17+
If you want to use other profiles available in your AWS configuration, you can set the profile name with:
18+
19+
.. code:: python
20+
21+
lineapy.options.set('storage_options', {'profile': 'ANOTHER_AWS_PROFILE'})
22+
23+
which is equivalent to setting your environment variable ``AWS_PROFILE`` to the profile name.
24+
25+
If you really need to set your AWS credentials directly in the running session (strongly discouraged as it may result in accidentally saving these credentials in plain text), you can set them with:
26+
27+
.. code:: python
28+
29+
lineapy.options.set('storage_options', {'key': 'AWS KEY', 'secret': 'AWS SECRET'})
30+
31+
which is equivalent to setting environment variables ``AWS_ACCESS_KEY_ID`` and ``AWS_SECRET_ACCESS_KEY``.
32+
33+
To learn more about which S3 configuration items that you can set in ``storage_options``, you can see the parameters of `s3fs.S3FileSystem <https://s3fs.readthedocs.io/en/latest/api.html>`_ since ``fsspec`` is passing ``storage_options`` items to ``s3fs.S3FileSystem`` to access S3 under the hood.

docs/source/references/configurations.rst

+8-41
Original file line numberDiff line numberDiff line change
@@ -59,22 +59,21 @@ The configuration file shall look like this:
5959
6060
6161
62-
Interactive Mode
63-
----------------
62+
.. note::
6463

65-
During an interactive session, you can see current configuration items by typing ``lineapy.options``.
64+
During an interactive session, you can see current configuration items by typing ``lineapy.options``.
6665

67-
You can also change the lineapy configuration items listed above with ``lineapy.options.set(key, value)``.
68-
However, it only makes sense to reset the session when the backend database is changed since you cannot retrieve previous information from the new database.
69-
Thus, the only place to change the LineaPy database is at the beginning of the notebook.
66+
You can also change the lineapy configuration items listed above with ``lineapy.options.set(key, value)``.
67+
However, it only makes sense to reset the session when the backend database is changed since you cannot retrieve previous information from the new database.
68+
Thus, the only place to change the LineaPy database is at the beginning of the notebook.
7069

71-
Note that, you need to make sure whenever you are setting `LINEAPY_DATABASE_URL`, you point to the `LINEAPY_ARTIFACT_STORAGE_DIR`.
72-
If not, ``Artifact.get_value`` might return an error that is related cannot find underlying pickle object.
70+
Note that, you need to make sure whenever you are setting `LINEAPY_DATABASE_URL`, you point to the `LINEAPY_ARTIFACT_STORAGE_DIR`.
71+
If not, ``Artifact.get_value`` might return an error that is related cannot find underlying pickle object.
7372

7473

7574

7675
Artifact Storage Location
77-
=========================
76+
-------------------------
7877

7978
You can change the artifact storage location by setting the `LINEAPY_ARTIFACT_STORAGE_DIR` environmental variable,
8079
or other ways mentioned in the above section.
@@ -108,35 +107,3 @@ Instead, if you want ot use environmental variables, you should configure it thr
108107

109108
Note that, which ``storage_options`` items you can set are depends on the filesystem you are using.
110109
In the following section, we will discuss how to set the storage options for S3.
111-
112-
Using S3 as an artifact storage location
113-
----------------------------------------
114-
115-
To use S3 as LineaPy artifact storage location, you can run the following command in your notebook to change your storage backend(both artifact locations and LineaPy database)
116-
117-
.. code:: python
118-
119-
lineapy.options.set('artifact_storage_dir','s3://your-bucket/your-artifact-folder')
120-
lineapy.options.set('database_url','corresponding-database-url')
121-
122-
You should configure your AWS account just like `AWS CLI <https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html>`_ or `boto3 <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html>`_,
123-
and LineaPy will use the default AWS credentials to access the S3 bucket.
124-
125-
If you want to use other profiles available in your AWS configuration, you can set the profile name with
126-
127-
.. code:: python
128-
129-
lineapy.options.set('storage_options',{'profile':'ANOTHER_AWS_PROFILE'})
130-
131-
which is equivalent to setting your environment variable ``AWS_PROFILE`` to the profile name.
132-
133-
If you really need to use your AWS key and secret directly(strongly not recommended), you can set them with
134-
135-
.. code:: python
136-
137-
lineapy.options.set('storage_options',{'key':'AWS KEY','secret':'AWS SECRET'})
138-
139-
which is equivalent to setting your environment variables ``AWS_ACCESS_KEY_ID`` and ``AWS_SECRET_ACCESS_KEY```.
140-
141-
To learn more about which S3 configuration items that you can set in ``storage_options``, you can see the parameters of `s3fs.S3FileSystem <https://s3fs.readthedocs.io/en/latest/api.html>`_ since ``fsspec`` is passing ``storage_options`` items to ``s3fs.S3FileSystem`` to access S3 under the hood.
142-

0 commit comments

Comments
 (0)