Update and reorganize RTD docs (#795)

yoonspark · moustafa-a · web-flow · commit 3bd220987409 · 2022-09-30T09:49:20.000-05:00
* Move intro text to pipeline's TOC section * Reorganize RTD pages related to artifact storage * Add a new section on using existing artifacts * Update index.rst minor change * Update index.rst minor tweaks * Update s3.rst minor tweak * Update s3.rst @yoonspark let me know if this modification makes sense * Fix typos; remove duplacate doc * For Postgres and S3, make clearer distinction between storing artifact values vs. metadata * Add phrasing suggested by MMA Co-authored-by: Moustafa AbdelBaky <9140797+moustafa-a@users.noreply.github.com>
diff --git a/docs/source/fundamentals/concepts.rst b/docs/source/fundamentals/concepts.rst
@@ -3,6 +3,8 @@
 Concepts
 ========
 
+.. _artifact_concept:
+
 Artifact
 --------
 
@@ -15,6 +17,8 @@ LineaPy not only records the state (i.e. value) of the variable but also traces
 leading to this state --- as code. Such a complete development history or *lineage* then allows LineaPy to fully reproduce
 the given artifact. Furthermore, it provides the ground to automate data engineering work to bring data science from development to production.
 
+.. _artifact_store_concept:
+
 Artifact Store
 --------------
 
@@ -25,6 +29,8 @@ This unified global storage is designed to accelerate the overall development pr
 Moreover, it can facilitate collaboration between different teams
 as it provides a single source of truth for all prior relevant work.
 
+.. _pipeline_concept:
+
 Pipeline
 --------
 
diff --git a/docs/source/guide/build_pipelines/index.rst b/docs/source/guide/build_pipelines/index.rst
@@ -1,6 +1,14 @@
 Pipelines
 =========
 
+Data science workflows revolve around building and refining pipelines, i.e., a series of processes that transform data into useful information/product
+(read more about pipelines :ref:`here <pipeline_concept>`).
+
+Traditionally, this is often manual and time-consuming work as data scientists (or production engineers) need to transform messy development code
+into deployable scripts for the target system (e.g., Airflow).
+
+Having the complete development process stored in artifacts, LineaPy can automate such code transformation, accelerating transition from development to production.
+
 .. toctree::
    :maxdepth: 1
 
diff --git a/docs/source/guide/build_pipelines/pipeline_basics.rst b/docs/source/guide/build_pipelines/pipeline_basics.rst
@@ -5,24 +5,16 @@ Basics
 
 .. include:: ../../snippets/slack_support.rstinc
 
-Data science workflows revolve around building and refining pipelines, i.e., a series of processes that transform data into useful information/product
-(read more about pipelines :ref:`here <concepts>`).
-
-Traditionally, this is often manual and time-consuming work as data scientists (or production engineers) need to transform messy development code
-into deployable scripts for the target system (e.g., Airflow).
-
-Having the complete development process stored in artifacts, LineaPy can automate such code transformation, accelerating transition from development to production.
+Pipeline Creation
+-----------------
 
-For example, consider a simple pipeline that 1) pre-processes raw data and 2) trains a model with the pre-processed data.
+Consider a simple pipeline that 1) pre-processes raw data and 2) trains a model with the pre-processed data.
 
 .. image:: pipeline.png
   :width: 600
   :align: center
   :alt: Pipeline Example
 
-Pipeline Creation
------------------
-
 Once we have the pre-processed data and the trained model stored as LineaPy artifacts (which can be done during development sessions),
 building a pipeline reduces to “stitching” these artifacts, like so:
 
diff --git a/docs/source/guide/manage_artifacts/artifact_reuse.rst b/docs/source/guide/manage_artifacts/artifact_reuse.rst
@@ -0,0 +1,95 @@
+Using Existing Artifacts
+========================
+
+Once connected to an artifact store (it can be an individual or shared one), we can query existing artifacts, like so:
+
+.. code:: python
+
+    lineapy.artifact_store()
+
+which would print a list looking as the following:
+
+.. code:: none
+
+    iris_preprocessed:0 created on 2022-09-29 01:22:39.612871
+    iris_preprocessed:1 created on 2022-09-29 01:22:41.336159
+    iris_preprocessed:2 created on 2022-09-29 01:22:43.511112
+    iris_model:0 created on 2022-09-29 01:22:45.381132
+    iris_model:1 created on 2022-09-29 01:22:46.786414
+    iris_model:2 created on 2022-09-29 01:22:47.990517
+    iris_model:3 created on 2022-09-29 01:22:49.366484
+    toy_artifact:0 created on 2022-09-29 01:22:50.189060
+    toy_artifact:1 created on 2022-09-29 01:22:50.676276
+    toy_artifact:2 created on 2022-09-29 01:22:51.084704
+
+Each line contains three pieces of information about an existing artifact: its name, version, and time of creation.
+Hence, for an artifact named ``iris_model``, we have four versions created at different times.
+
+Now, say we are interested in reusing the first version of this artifact. We can retrieve the desired artifact as follows:
+
+.. code:: python
+
+    model_artifact = lineapy.get("iris_model", version=0)
+
+Note that what has been retrieved and saved in ``model_artifact`` is not the model itself; it is the model *artifact*,
+which contains more than the model itself, e.g., the code that was used to generate the model. Hence, to resuse the model,
+we need to extract the artifact's value:
+
+.. code:: python
+
+    model = model_artifact.get_value()
+
+However, we actually do not fully know how to reuse this model as we are missing the memory (or knowledge, if the artifact
+was created by someone else) of its context such as input details. Thankfully, the artifact also stores the code that was
+used to generate its value, so we can check it out:
+
+.. code:: python
+
+    print(data_artifact.get_code())
+
+which prints:
+
+.. code:: none
+
+    import lineapy
+    from sklearn.linear_model import LinearRegression
+
+    art_df_processed = lineapy.get("iris_preprocessed", version=2)
+    df_processed = art_df_processed.get_value()
+    mod = LinearRegression()
+    mod.fit(
+        X=df_processed[["petal.width", "d_versicolor", "d_virginica"]],
+        y=df_processed["sepal.width"],
+    )
+
+With this, we now know the source and shape of the data that was used to train this model,
+which enables us to adapt and reuse the model in our context. Specifically, we can check out the
+training data by loading the corresponding artifact, like so:
+
+.. code:: python
+
+    art_df_processed = lineapy.get("iris_preprocessed", version=2)
+    df_processed = art_df_processed.get_value()
+    print(df_processed)
+
+Based on the values in the data, we would have a more concrete understanding of the model and its job,
+which would enable us to make new predictions, like so:
+
+.. code:: python
+
+    import pandas as pd
+
+    # Create data to make predictions on
+    df = pd.DataFrame({
+        "petal.width": [1.3, 5.2, 0.3, 1.5, 4.9],
+        "d_versicolor": [1, 0, 0, 1, 0],
+        "d_virginica": [0, 1, 0, 0, 1],
+    })
+
+    # Make new predictions
+    df["sepal.width.pred"] = model.predict(df)
+
+This example illustrates the benefit of LineaPy's unified storage framework:
+encapsulating both value and code as well as other metadata, LineaPy's artifact store
+enables the user to explore the history and relations among different works,
+hence rendering them more reusable.
diff --git a/docs/source/guide/manage_artifacts/database/index.rst b/docs/source/guide/manage_artifacts/database/index.rst
diff --git a/docs/source/guide/manage_artifacts/index.rst b/docs/source/guide/manage_artifacts/index.rst
@@ -1,7 +1,46 @@
-Managing Artifacts
-==================
+Artifact Storage
+================
+
+In LineaPy, an artifact store is a centralized repository for artifacts
+(check :ref:`here <artifact_store_concept>` for a conceptual explanation).
+Under the hood, it is a collection of two data structures:
+
+- Serialized artifact values (i.e., pickle files)
+- Database that stores artifact metadata (e.g., timestamp, version, code, pointer to the serialized value)
+
+Encapsulating both value and code, as well as other metadata such as creation time and version,
+LineaPy's artifact store provides a more unified and streamlined experience to save, manage, and reuse
+works from different people over time. Contrast this with a typical setup where the team stores their
+outputs in one place (e.g., a key-value store) and the code in another (e.g., GitHub repo) --- we can
+imagine how difficult it would be to maintain correlations between the two. LineaPy simplifies lineage tracking by storing all correlations in one framework: artifact store.
+
+.. note::
+
+   By default, the serialized values and the metadata are stored in ``.lineapy/linea_pickles/``
+   and ``.lineapy/db.sqlite``, respectively, where ``.lineapy/`` is created under
+   the system's home directory.
+
+   This default location can be overridden by modifying the configuration file:
+
+   .. code:: json
+
+      {
+         "artifact_storage_dir": [NEW-PATH-TO-STORE-SERIALIZED-VALUES],
+         "database_url": [NEW-DATABASE-URL-FOR-STORING-METADATA],
+         ...
+      }
+
+   or making these updates in each interactive session:
+
+   .. code:: python
+
+      lineapy.options.set('artifact_storage_dir', [NEW-PATH-TO-STORE-SERIALIZED-VALUES])
+      lineapy.options.set('database_url', [NEW-DATABASE-URL-FOR-STORING-METADATA])
+   
+   Read more about configuration :ref:`here <configurations>`.
 
 .. toctree::
    :maxdepth: 1
 
-   database/index
+   artifact_reuse
+   storage_location/index
diff --git a/docs/source/guide/manage_artifacts/storage_location/index.rst b/docs/source/guide/manage_artifacts/storage_location/index.rst
@@ -0,0 +1,15 @@
+Changing Storage Location
+=========================
+
+Out of the box, LineaPy comes with a local SQLite database.
+As the user gets into more serious applications, however, this lightweight database
+poses limitations (e.g., single writes).
+Accordingly, LineaPy supports other storage options such as PostgreSQL.
+This support is essential for team collaborations as it enables the artifact store
+to be hosted in a shared environment that can be accessed by different team members.
+
+.. toctree::
+   :maxdepth: 1
+
+   postgres
+   s3
diff --git a/docs/source/guide/manage_artifacts/storage_location/postgres.rst b/docs/source/guide/manage_artifacts/storage_location/postgres.rst
@@ -1,11 +1,11 @@
 .. _postgres:
 
-PostgreSQL
-==========
+Storing Artifact Metadata in PostgreSQL
+=======================================
 
 .. include:: ../../../snippets/slack_support.rstinc
 
-By default, LineaPy uses SQLite for artifact store, which keeps the package light and simple.
+By default, LineaPy uses SQLite to store artifact metadata (e.g., name, version, code), which keeps the package light and simple.
 Given the limitations of SQLite (e.g., single write access to a database at a time), however,
 we may want to use a more advanced database such as PostgreSQL.
 
diff --git a/docs/source/guide/manage_artifacts/storage_location/s3.rst b/docs/source/guide/manage_artifacts/storage_location/s3.rst
@@ -0,0 +1,33 @@
+.. _s3:
+
+Storing Artifact Values in Amazon S3
+------------------------------------
+
+.. include:: ../../../snippets/slack_support.rstinc
+
+To use S3 as LineaPy's serialized value location, you can run the following command in your notebook to change your storage backend:
+
+.. code:: python
+
+    lineapy.options.set('artifact_storage_dir', 's3://your-bucket/your-artifact-folder')
+
+You should configure your AWS account just like you would for `AWS CLI <https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html>`_ or `boto3 <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html>`_,
+and LineaPy will use the default AWS credentials to access the S3 bucket.
+
+If you want to use other profiles available in your AWS configuration, you can set the profile name with:
+
+.. code:: python
+
+    lineapy.options.set('storage_options', {'profile': 'ANOTHER_AWS_PROFILE'})
+
+which is equivalent to setting your environment variable ``AWS_PROFILE`` to the profile name.
+
+If you really need to set your AWS credentials directly in the running session (strongly discouraged as it may result in accidentally saving these credentials in plain text), you can set them with:
+
+.. code:: python
+
+    lineapy.options.set('storage_options', {'key': 'AWS KEY', 'secret': 'AWS SECRET'})
+
+which is equivalent to setting environment variables ``AWS_ACCESS_KEY_ID`` and ``AWS_SECRET_ACCESS_KEY``.
+
+To learn more about which S3 configuration items that you can set in ``storage_options``, you can see the parameters of `s3fs.S3FileSystem <https://s3fs.readthedocs.io/en/latest/api.html>`_ since ``fsspec`` is passing ``storage_options`` items to ``s3fs.S3FileSystem`` to access S3 under the hood.
diff --git a/docs/source/references/configurations.rst b/docs/source/references/configurations.rst
@@ -59,22 +59,21 @@ The configuration file shall look like this:
     
 
 
-Interactive Mode
-----------------
+.. note::
 
-During an interactive session, you can see current configuration items by typing ``lineapy.options``.
+    During an interactive session, you can see current configuration items by typing ``lineapy.options``.
 
-You can also change the lineapy configuration items listed above with ``lineapy.options.set(key, value)``.
-However, it only makes sense to reset the session when the backend database is changed since you cannot retrieve previous information from the new database.
-Thus, the only place to change the LineaPy database is at the beginning of the notebook.
+    You can also change the lineapy configuration items listed above with ``lineapy.options.set(key, value)``.
+    However, it only makes sense to reset the session when the backend database is changed since you cannot retrieve previous information from the new database.
+    Thus, the only place to change the LineaPy database is at the beginning of the notebook.
 
-Note that, you need to make sure whenever you are setting `LINEAPY_DATABASE_URL`, you point to the  `LINEAPY_ARTIFACT_STORAGE_DIR`.
-If not, ``Artifact.get_value`` might return an error that is related cannot find underlying pickle object.
+    Note that, you need to make sure whenever you are setting `LINEAPY_DATABASE_URL`, you point to the  `LINEAPY_ARTIFACT_STORAGE_DIR`.
+    If not, ``Artifact.get_value`` might return an error that is related cannot find underlying pickle object.
 
 
 
 Artifact Storage Location
-=========================
+-------------------------
 
 You can change the artifact storage location by setting the `LINEAPY_ARTIFACT_STORAGE_DIR` environmental variable, 
 or other ways mentioned in the above section.
@@ -108,35 +107,3 @@ Instead, if you want ot use environmental variables, you should configure it thr
 
 Note that, which ``storage_options`` items you can set are depends on the filesystem you are using.
 In the following section, we will discuss how to set the storage options for S3.
-
-Using S3 as an artifact storage location
-----------------------------------------
-
-To use S3 as LineaPy artifact storage location, you can run the following command in your notebook to change your storage backend(both artifact locations and LineaPy database)
-
-.. code:: python
-
-    lineapy.options.set('artifact_storage_dir','s3://your-bucket/your-artifact-folder')
-    lineapy.options.set('database_url','corresponding-database-url')
-
-You should configure your AWS account just like `AWS CLI <https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html>`_ or `boto3 <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html>`_,
-and LineaPy will use the default AWS credentials to access the S3 bucket.
-
-If you want to use other profiles available in your AWS configuration, you can set the profile name with
-
-.. code:: python
-
-    lineapy.options.set('storage_options',{'profile':'ANOTHER_AWS_PROFILE'})
-
-which is equivalent to setting your environment variable ``AWS_PROFILE`` to the profile name.
-
-If you really need to use your AWS key and secret directly(strongly not recommended), you can set them with
-
-.. code:: python
-
-    lineapy.options.set('storage_options',{'key':'AWS KEY','secret':'AWS SECRET'})
-
-which is equivalent to setting your environment variables ``AWS_ACCESS_KEY_ID`` and ``AWS_SECRET_ACCESS_KEY```.
-
-To learn more about which S3 configuration items that you can set in ``storage_options``, you can see the parameters of `s3fs.S3FileSystem <https://s3fs.readthedocs.io/en/latest/api.html>`_ since ``fsspec`` is passing ``storage_options`` items to ``s3fs.S3FileSystem`` to access S3 under the hood.
-