CodeForPhilly
diff --git a/‎.gitbook/assets/image (1).png
156 KB b/‎.gitbook/assets/image (1).png
156 KB
diff --git a/‎.gitbook/assets/image (2).png
69.9 KB b/‎.gitbook/assets/image (2).png
69.9 KB
diff --git a/‎.gitbook/assets/image (3).png
69.9 KB b/‎.gitbook/assets/image (3).png
69.9 KB
diff --git a/‎.gitbook/assets/image (4).png
65.2 KB b/‎.gitbook/assets/image (4).png
65.2 KB
diff --git a/‎.gitbook/assets/image.png
98.7 KB b/‎.gitbook/assets/image.png
98.7 KB
diff --git a/‎README.md
Lines changed: 2 additions & 34 deletions b/‎README.md
Lines changed: 2 additions & 34 deletions
diff --git a/‎SUMMARY.md
Lines changed: 18 additions & 2 deletions b/‎SUMMARY.md
Lines changed: 18 additions & 2 deletions
diff --git a/‎architecture/async-on-the-cheap-for-mvp.md
Lines changed: 34 additions & 0 deletions b/‎architecture/async-on-the-cheap-for-mvp.md
Lines changed: 34 additions & 0 deletions
diff --git a/‎architecture/data-flow.md
Lines changed: 7 additions & 0 deletions b/‎architecture/data-flow.md
Lines changed: 7 additions & 0 deletions
diff --git a/‎architecture/database-schema.md
Lines changed: 8 additions & 0 deletions b/‎architecture/database-schema.md
Lines changed: 8 additions & 0 deletions
diff --git a/‎architecture/execution-status-stages.md
Lines changed: 5 additions & 0 deletions b/‎architecture/execution-status-stages.md
Lines changed: 5 additions & 0 deletions
diff --git a/‎architecture/rfm.md
Lines changed: 68 additions & 0 deletions b/‎architecture/rfm.md
Lines changed: 68 additions & 0 deletions
diff --git a/‎architecture/user-management-and-authorization.md
Lines changed: 55 additions & 0 deletions b/‎architecture/user-management-and-authorization.md
Lines changed: 55 additions & 0 deletions
diff --git a/‎deployment/README.md
Lines changed: 3 additions & 0 deletions b/‎deployment/README.md
Lines changed: 3 additions & 0 deletions
diff --git a/‎deployment/deploying-pdp-within-the-code-for-philly-cluster.md
Lines changed: 37 additions & 0 deletions b/‎deployment/deploying-pdp-within-the-code-for-philly-cluster.md
Lines changed: 37 additions & 0 deletions
diff --git a/‎deployment/kubernetes-logs.md
Lines changed: 8 additions & 0 deletions b/‎deployment/kubernetes-logs.md
Lines changed: 8 additions & 0 deletions
diff --git a/‎deployment/using-github-actions.md
Lines changed: 9 additions & 0 deletions b/‎deployment/using-github-actions.md
Lines changed: 9 additions & 0 deletions
diff --git a/‎setup/README.md
Lines changed: 3 additions & 0 deletions b/‎setup/README.md
Lines changed: 3 additions & 0 deletions
diff --git a/‎setup/accessing-apis-without-react.md
Lines changed: 35 additions & 0 deletions b/‎setup/accessing-apis-without-react.md
Lines changed: 35 additions & 0 deletions
@@ -1,35 +1,3 @@
-# Project Overview
+# Architecture
 
-## [The Philadelphia Animal Welfare Society (PAWS)](https://github.com/CodeForPhilly/paws-data-pipeline/blob/master/phillypaws.org)
-
-As the city's largest animal rescue partner and no-kill animal shelter, the [Philadelphia Animal Welfare Society (PAWS)](https://github.com/CodeForPhilly/paws-data-pipeline/blob/master/phillypaws.org) is working to make Philadelphia a place where every healthy and treatable pet is guaranteed a home. Since inception over 10 years ago, PAWS has rescued and placed 27,000+ animals in adoptive and foster homes, and has worked to prevent pet homelessness by providing 86,000+ low-cost spay/neuter services and affordable vet care to 227,000+ clinic patients. PAWS is funded 100% through donations, with 91 cents of every dollar collected going directly to the animals. Therefore, PAWS' rescue work (including 3 shelters and all rescue and animal care programs), administration and development efforts are coordinated by only about 70 staff members complemented by over 1500 volunteers.
-
-This project seeks to provide PAWS with an easy-to-use and easy-to-support tool to extract data from multiple source systems, confirm accuracy and appropriateness, clean/validate data where necessary (a data hygiene and wrangling step), and then load relevant data into one or more repositories to facilitate (1) a highly-accurate and rich 360-degree view of PAWS constituents (Salesforce is a likely candidate target system; already in use at PAWS) and (2) flexible ongoing data analysis and insights discovery (e.g. a data lake / data warehouse).
-
-Through all of its operational and service activities, PAWS accumulates data regarding donations, adoptions, fosters, volunteers, merchandise sales, event attendees (to name a few), each in their own system and/or manual (Google Sheet) tally. This vital data that can drive insights remains siloed and is usually difficult to extract, manipulate, and analyze. Taking all of this data, making it readily available, and drawing inferences through analysis can drive many benefits:
-
-PAWS operations can be better informed and use data-driven decisions to guide programs and maximize effectiveness; Supporters can be further engaged by suggesting additional opportunities for involvement based upon pattern analysis; Multi-dimensional supporters can be consistently (and accurately) acknowledged for all the ways they support PAWS (i.e. a volunteer who donates and also fosters kittens), not to mention opportunities to further tap the potential of these enthusiastic supporters.
-
-### [The Data Pipeline](https://codeforphilly.org/projects/paws\_data\_pipeline)
-
-Through all of its operational and service activities, PAWS accumulates data regarding donations, adoptions, fosters, volunteers, merchandise sales, event attendees (to name a few), each in their own system and/or manual tally. This vital data that can drive insights remains siloed and is usually difficult to extract, manipulate, and analyze.
-
-This project provides PAWS with an easy-to-use and easy-to-support tool to extract constituent data from multiple source systems, standardize extracted data, match constituents across data sources,\
-load relevant data into Salesforce, and run an automation in Salesforce to produce an RFM score. Through these processes, the PAWS data pipeline has laid the groundwork for facilitating an up-to-date 360-degree view of PAWS constituents, and flexible ongoing data analysis and insights discovery.
-
-### Uses
-
-* The pipeline can inform the PAWS development team of new constiuents through volunteer or foster engagegement
-* Instead of manually matching constituents from volunteering, donations and foster/adoptions, PAWS staff only need to upload the volunteer dataset into the pipeline, and the pipeline handles the matching
-* Volunteer and Foster data are automatically loaded into the constituent's SalesForce profile
-* An RFM score is calculated for each constituent using the most recent data
-* Data analyses can use the output of the PDP matching logic to join datasets from different sources; PAWS can benefit from such analyses in the following ways:
-  * PAWS operations can be better informed and use data-driven decisions to guide programs and maximize effectiveness;
-  * Supporters can be further engaged by suggesting additional opportunities for involvement based upon pattern analysis;
-  * Multi-dimensional supporters can be consistently (and accurately) acknowledged for all the ways they support PAWS (i.e. a volunteer who donates and also fosters kittens), not to mention opportunities to further tap the potential of these enthusiastic supporters.
-
-### [Code of Conduct](https://codeforphilly.org/pages/code\_of\_conduct)
-
-###
-
-##
+This section contains the architecture for the PAWS data pipeline project.
@@ -1,4 +1,20 @@
 # Table of contents
 
-* [Project Overview](README.md)
-* [End User Manual](end-user-manual.md)
+* [Architecture](README.md)
+  * [Async on the cheap (for MVP)](architecture/async-on-the-cheap-for-mvp.md)
+  * [Execution status stages](architecture/execution-status-stages.md)
+  * [User management and authorization](architecture/user-management-and-authorization.md)
+  * [RFM](architecture/rfm.md)
+  * [Data Flow](architecture/data-flow.md)
+  * [Database Schema](architecture/database-schema.md)
+* [Setup](setup/README.md)
+  * [Getting Started](setup/getting-started.md)
+  * [Local Setup](setup/local-setup.md)
+  * [Kubernetes Setup](setup/kubernetes-setup.md)
+  * [Accessing APIs without React](setup/accessing-apis-without-react.md)
+* [Deployment](deployment/README.md)
+  * [Using GitHub actions](deployment/using-github-actions.md)
+  * [Deploying PDP within the Code for Philly cluster](deployment/deploying-pdp-within-the-code-for-philly-cluster.md)
+  * [Kubernetes logs](deployment/kubernetes-logs.md)
+* [Troubleshooting](troubleshooting/README.md)
+  * [Dups Problem](troubleshooting/dups-problem.md)
@@ -0,0 +1,34 @@
+# Async on the cheap (for MVP)
+
+### Introduction
+
+It's recognized \[1, 2] that the best way to handle long-running tasks is to use a task queue, allowing separation of the middle layer (API server) and the execution server. But as we're trying to get an MVP out for feedback, it's not unreasonable to use a less-than-perfect solution for the interim. Here's a few ideas for discussion:
+
+### _Continue to treat execute() as synchronous but stream back status information_
+
+We've been operating (at the API server) with a model of _receive request, do work, return() with data_. But both Flask and JS support streaming data in chunks from server to client:\
+Flask: [Streaming Contents](https://flask.palletsprojects.com/en/1.1.x/patterns/streaming/)\
+JS: [Using readable streams](https://developer.mozilla.org/en-US/docs/Web/API/Streams\_API/Using\_readable\_streams)\
+\
+From the Flask side, the data it streams back would be status updates (_e.g._, every 100 rows processed) which the React client would use to update the display. When the server sends back "complete", React displays a nice completion message and the user proceeds to the 360 view.
+
+#### **Evaluation**
+
+Doesn't appear to require much heavy lifting at server or client (we would need to figure out how to feed the generator on the server) but may be a bit brittle; if there's any kind of network hiccup (or user reloads the page?) the stream would be broken and we wouldn't be able to tell the user anything useful.
+
+### _Client aborts Fetch, polls status API until completion_
+
+In this idea, instead of waiting for the execute() Fetch to complete, the React client uses an [AbortController](https://developer.mozilla.org/en-US/docs/Web/API/AbortController/abort) to cancel the pending Fetch. It then starts polling the API execution status endpoint, displaying updates until that endpoint reports that the operation is complete.
+
+**Evaluation**
+
+Using SQLAlchemy's `engine.dispose()`, and two uWSGI processes. I've got `/api/get_execution_status/<job_id>` working correctly. I'd probably want to have it find the latest job
+
+![](https://user-images.githubusercontent.com/11001850/112061042-4ceb9580-8b34-11eb-8dc7-fb9eede44d7d.png)
+
+instead of having to specify it (although we could use the streaming model above to send back the job\_id). We need to figure what side-effects there might be to cancelling the fetch. I presume the browser would drop the connection; will Flask assume it can kill the request?\
+The client could check status when the page loads to see if there's a running job so it would be more robust in the face of network issues or reloads.
+
+\[1] [https://flask.palletsprojects.com/en/1.1.x/patterns/celery/](https://flask.palletsprojects.com/en/1.1.x/patterns/celery/)\
+\[2] [https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-xxii-background-jobs](https://blog.miguelgrinberg.com/post/the-flask-mega-tutorial-part-xxii-background-jobs)
+
@@ -0,0 +1,7 @@
+# Data Flow
+
+![](../.gitbook/assets/image.png)
+
+[flow chart](https://app.lucidchart.com/invitations/accept/0602fccf-18f9-48d4-84ff-ffe5f0b03e7a)
+
+**ShelterLuv People**: This data is being pulled via a script that calls ShelterLuv and saves data as a csv into a Dropbox folder via an "app". It is set up to use config + cron job, although this is not yet active in deployment. Every time it pulls data, it pulls everything because the API doesn't support pagination. To configure automation, the config file needs to contain the app ID
@@ -0,0 +1,8 @@
+# Database Schema
+
+TODO: fix link
+
+[https://app.diagrams.net/#G1X4KbjYf7vcrfbeJLfyCj8xUPp8zGcV2k](https://app.diagrams.net/#G1X4KbjYf7vcrfbeJLfyCj8xUPp8zGcV2k)
+
+[ Add a custom footer](https://github.com/CodeForPhilly/paws-data-pipeline/wiki/\_new?wiki%5Bname%5D=\_Footer)
+
@@ -0,0 +1,5 @@
+# Execution status stages
+
+The execution\_status table will be updated for a given job\_id through the stages in the diagram.
+
+![](<../.gitbook/assets/image (1).png>)
@@ -0,0 +1,68 @@
+# RFM
+
+## RFM Data Flows
+
+![](<../.gitbook/assets/image (2).png>)
+
+## RFM Database Tables
+
+![](<../.gitbook/assets/image (4).png>)
+
+## RFME Bin Logic
+
+### Recency:
+
+If a person's last donation was:
+
+* the last 180 days: R = 5,
+* 180-365 days ago: R = 4
+* 365 - 728 days ago: R = 3,
+* 728 - 1093 days ago: R = 2
+* More than 0: R = 1
+* Never given: R = 0
+
+### Frequency:
+
+If in the last 24 months someone has made a total of
+
+* 24 or more donations: F = 5,
+* 12 - 23 donations: F = 4
+* 3 - 11 donations: F = 3
+* 2 donations: F = 2;
+* 1 donation: F = 1
+* 0 donations: F = 0
+
+### Monetary value:
+
+If someone's cumulative giving in the past 24 months is
+
+* $2001 ore more: M = 5
+* $501 - $2000: M = 4
+* $250 - $500: M = 3
+* $101 - $249: M = 2
+* $25 - $100 - $50: M = 1
+* $0 - 25: M = 0
+
+### the impact labels are as follows:
+
+* High impact: (F+M)/2 is between 4-5
+* Low impact: (F+M)/2 is between 1-3
+
+### the engagement labels are as follows:
+
+* engaged: R = 5
+* slipping: R is 3-4
+* disengaged: R is 1-2
+
+### CAN WE INTEGRATE SCORING FOR FOSTERS/VOLUNTEERS?
+
+"RFME" (E FOR ENGAGEMENT)
+
+* volunteered or fostered in the past 30 days: E = 5
+* volunteered or fostered in the past 6 months days: E = 4
+* volunteered or fostered in the past year: E = 3
+* volunteered or fostered in the past 2 years: E = 2
+* volunteered or fostered ever: E = 1
+* volunteered or fostered never: E = 0
+
+(modified from Lauren's request of: E = 5 (CURRENT), E = 4 (WITHIN THE PAST YEAR), E = 3 (WITHIN THE PAST TWO YEARS), E = 2 (EVER), E = 0 (NEVER), because "1" value was missing and needed more specific definition of "current")
@@ -0,0 +1,55 @@
+# User management and authorization
+
+### Intro
+
+Because the 360 view gives access to sensitive personal information, we need to ensure that only authorized users can access PDP pages.
+
+### Roles
+
+There are three authorization levels/user roles:
+
+* User: Can use the **Common API** to view 360 data but not make any changes
+* Editor: User role plus can use the **Editor API** to manually link existing contacts
+* Admin: Editor role plus can use the **Admin API** to upload data and manage users
+
+### Login
+
+Upon login, the user API shall return a JSON Web \[Access] Token (JWT) with a limited lifetime\[1]. The JWT includes the user's role.
+
+### Authorization
+
+The React client shall render only resources that are authorized by the current user's role. The React client shall present the JWT (using the **Authorization: Bearer** header) to the API server when making a request.\
+The API server shall verify that user represented by the JWT is authorized to access the requested API endpoint. The server API shall return a 403 status if the user is not authorized to access the endpoint.
+
+### Implementation
+
+User roles are stored in the database `pdp_user_roles` table and per-user data is stored in the `pdp_users` table.
+
+### API
+
+**No authorization required**
+
+| Endpoint              | Description                       |
+| --------------------- | --------------------------------- |
+| `/api/user/test`      | Liveness test, always returns 200 |
+| `/api/user/test_fail` | Always fails with 401             |
+| `/api/user/login`     | Login                             |
+
+**Valid JWT required**
+
+| Endpoint              | Description                                 |
+| --------------------- | ------------------------------------------- |
+| `/api/user/test_auth` | Returns 200 if valid JWT presented          |
+| `/api/user/logout`    | Logout (optional, as client can delete JWT) |
+
+**Admin role required**
+
+| Endpoint                         | Description                    |
+| -------------------------------- | ------------------------------ |
+| `/api/admin/user/create`         | Create user                    |
+| `/api/admin/user/get_user_count` | Get count of all users in DB   |
+| `/api/admin/user/get_users`      | Get list of users with details |
+
+
+
+\[1] _We need to decide on a lifetime that provides an appropriate balance between convenience and security. An expired Access token will require the user to login again. There is a Refresh-type token that allows automatic renewal of Access tokens without requiring the user to log in but the power of this kind of token poses additional security concerns._
@@ -0,0 +1,3 @@
+# Deployment
+
+This section contains deployment instructions for the PAWS data pipeline project.
@@ -0,0 +1,37 @@
+# Deploying PDP within the Code for Philly cluster
+
+## PDP hosting
+
+The PAWS Data Pipeline runs on a Kubernetes cluster donated by [Linode](https://github.com/CodeForPhilly/paws-data-pipeline/wiki/www.linode.com) to the Code for Philly (CfP) project and is managed by the CfP [civic-cloud](https://forum.codeforphilly.org/c/public-development/civic-cloud/17) team.
+
+The code and configurations for the various projects running on the cluster are managed using [hologit](https://github.com/JarvusInnovations/hologit) which
+
+> _lets you declaratively define virtual sub-branches (called holobranches) within any Git branch that mix together content from their host branch, content from other repositories/branches, and executable-driven transformations._\[1]
+
+The pieces for the sandbox clusters can be found in the `.holo` directory in the PDP repository and the [sandbox](https://github.com/CodeForPhilly/cfp-sandbox-cluster) or [live](https://github.com/CodeForPhilly/cfp-live-cluster) cluster repos as appropriate.
+
+The branch (within the PDP repo) that holds the `.holo` directory is specified at [paws-data-pipeline.toml](https://github.com/CodeForPhilly/cfp-sandbox-cluster/blob/main/.holo/sources/paws-data-pipeline.toml).
+
+RBAC roles and rights are defined at [admins](https://github.com/CodeForPhilly/cfp-sandbox-cluster/blob/main/admins/paws-data-pipeline.yaml).
+
+### Updating deployed code
+
+To deploy new code,
+
+* Bump the image tag versions in **paws-data-pipeline/src/helm-chart/values.yaml** to the value you'll use for this deployment (e.g. v.2.3.4)
+* Commit to master, tag with the above value, push to GitHub with --follow-tags
+* Open a PR against [cfp-sandbox-cluster/.holo/sources/paws-data-pipeline.toml](https://github.com/CodeForPhilly/cfp-sandbox-cluster/blob/main/.holo/sources/paws-data-pipeline.toml) setting ref = "refs/tags/v2.3.4"
+* The sysadmin folks hang out at [https://forum.codeforphilly.org/c/project-support-center/sysadmin/20](https://forum.codeforphilly.org/c/project-support-center/sysadmin/20) and you can ask for help there
+
+### Ingress controller
+
+CfP uses the [ingress-nginx](https://kubernetes.github.io/ingress-nginx) ingress controller (_not to be confused with an entirely different project called **nginx-ingress**_)
+
+The list of settings can be found here: [Settings](https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/)\
+To update settings, edit [release-values.yaml](https://github.com/CodeForPhilly/cfp-sandbox-cluster/blob/main/paws-data-pipeline/release-values.yaml) and create a pull request.
+
+SSL cert configuration can also be found in [release-values.yaml](https://github.com/CodeForPhilly/cfp-sandbox-cluster/blob/main/paws-data-pipeline/release-values.yaml)
+
+
+
+1. _“Any sufficiently advanced technology is indistinguishable from magic.”_ Arthur C. Clarke
@@ -0,0 +1,8 @@
+# Kubernetes logs
+
+Database logs are visible by attaching to paws-datapipeline-db and viewing `/var/lib/postgresql/data/log/`
+
+Since Kubernetes performs liveness tests, there are a lot of test lines in the logs which you'll want to filter out
+
+* On paws-datapipeline-server, filter on "that don't match" `/api/user/test`
+* On paws-datapipeline-client, filter on "that don't match" `GET /`
@@ -0,0 +1,9 @@
+# Using GitHub actions
+
+To run the CI/CD action:
+
+* Ensure you have `release-containers.yml` in `/paws-data-pipeline/.github/workflows`
+* Tag your code: `git tag -fa v1.4 -m "Still testing Actions"`
+* Push with `git push -f --tags`
+
+Check the [Actions](https://github.com/CodeForPhilly/paws-data-pipeline/actions) page to see the progress.
@@ -0,0 +1,3 @@
+# Setup
+
+This section contains setup instructions for the PAWS data pipeline project.
@@ -0,0 +1,35 @@
+# Accessing APIs without React
+
+As of [c863c77](https://github.com/CodeForPhilly/paws-data-pipeline/commit/c863c77cfb79901f65936a851834ec298aec5ec1) , a valid JWT is needed to access API endpoints. If you don't want to use the normal route (React) there are a few options:
+
+* Programmatically through JS: See examples in Login.js, Admin.js, Search.js
+* Programmatically through Python: See /server/test\_api.py
+* Using Postman
+
+### Using Postman
+
+To use Postman:
+
+* Get a valid JWT, which is returned by /api/user/login
+  * You can do this through Postman or by capturing the returned `access_token`value using browser devtools
+
+![Postman\_login](https://user-images.githubusercontent.com/11001850/114760059-f0dbf180-9d2c-11eb-83d9-27ea69ceaa66.png)
+
+* Tell Postman to use that value for API calls
+  * Copy the value
+  * Edit the collection (three dots when hovering to right of collection name, Edit)
+
+![Postman\_view\_more](https://user-images.githubusercontent.com/11001850/114760490-592ad300-9d2d-11eb-935b-2a67220e903c.png)
+
+* Choose Authorization, Type: Bearer Token
+* Paste value into the Token field, save
+
+![Postman\_token](https://user-images.githubusercontent.com/11001850/114760547-69db4900-9d2d-11eb-8e2c-779060b81205.png)
+
+Start issuing API calls
+
+### Error codes
+
+401 - Bad login credentials\
+403 - Tried to access an Admin endpoint with user-level credentials\
+422 - JWT value was corrupted/failed validation
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Deployment`
	`2`	`+`
	`3`	`+This section contains deployment instructions for the PAWS data pipeline project.`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# Setup`
	`2`	`+`
	`3`	`+This section contains setup instructions for the PAWS data pipeline project.`