Added psuedo-code for dataset merge functions

dtromero · dtromero · commit 853142895e20 · 2020-01-06T16:46:28.000-05:00
diff --git a/load_paws_data.ipynb b/load_paws_data.ipynb
@@ -1,5 +1,29 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# PAWS Data Pipeline\n",
+    "The objective of this script is to create a master data table that links all the PAWS datasources together.\n",
+    "## Pipeline sections\n",
+    "0. Import libraries\n",
+    "1. Create & populate database \n",
+    "2. Create ***metadata master table*** schema to link all source tables together & populate with one of the dataset (e.g. SalesForce)\n",
+    "3. For each dataset, merge each record with the ***metadata master table***. If a match is found, link the two sources. If not, create a new record. <br/>\n",
+    "    a. Petpoint<br/>\n",
+    "    b. Volgistics<br/>\n",
+    "    c. Other - TBD<br/>\n",
+    "4. Write the new table to the database"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 0. Import libraries"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -11,6 +35,13 @@
     "import re"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 1. Create & populate database "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -94,6 +125,119 @@
     "load_to_sqlite('./sample_data/CfP_PDP_salesforceDonations_deidentified.csv', 'salesforcedonations', conn)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 2. Create ***metadata master table*** schema to link all source tables together & populate with one of the dataset (e.g. SalesForce)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def create_user_master_df():\n",
+    "    \"\"\"\n",
+    "    Creates a pandas dataframe placeholder with key meta-data to fuzzy-match\n",
+    "    the users from different datasets.\n",
+    "    \n",
+    "    Pseudo-code:\n",
+    "        Create a blank pandas dataframe (e.g. pd.DataFrame) with columns for\n",
+    "        Name (first, last), address, zip code, phone number, email, etc.\n",
+    "        \n",
+    "        Include \"ID\" fields for each of the datasets that will be merged.\n",
+    "        \n",
+    "        Populate/Initialize the dataframe with data from one of the datasets\n",
+    "        (e.g. Salesforce)\n",
+    "    \"\"\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3. For each dataset, merge each record with the ***metadata master table***\n",
+    "If a match is found, link the two sources. If not, create a new record. <br/>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def fuzzy_merge(new_df, master_df):\n",
+    "    \"\"\"\n",
+    "    This function merges each new dataset with the metadata master table by\n",
+    "    going line-by-line on the new dataset and looking for a match in the \n",
+    "    existing metadata master dataset. If a match is found\n",
+    "    \n",
+    "    Pseudo-code:\n",
+    "        LOOP: For each line in the new_df, compare that line against all lines in \n",
+    "        the master_df. \n",
+    "        \n",
+    "        LOGIC: For each comparison, generate (a) a fuzzy-match score on name,\n",
+    "        (b) T/F on whether zip-code matches, (c) T/F on whether email matches,\n",
+    "        (d) T/F on whether phone number matches.\n",
+    "        \n",
+    "        OUTPUT: For each comparison if the fuzzy-match score is above a threshold (e.g. >=90%)\n",
+    "        and (b), (c) or (d) matches, consider it a match and add the new dataset \n",
+    "        id to the existing record. If it doesn't match, create a new record in the\n",
+    "        master dataset.\n",
+    "        \n",
+    "    Note: there's probably a more efficient way to do this (vs. going line-by-line)\n",
+    "    \"\"\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 3.A Petpoint merge \n",
+    "Apply function above the Petpoint dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 3.B Volgistics merge\n",
+    "Apply function above the Volgistics dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### 3.C Other - TBD - Merge"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 4. Write the new table to the database"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# load_to_sqlite(master_df, master_table, conn)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Other - placeholder - graveyard\n",
+    "Graveyard/placeholder code from previous sections"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -162,7 +306,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.3"
+   "version": "3.7.4"
   }
  },
  "nbformat": 4,