|
1 | 1 | {
|
2 | 2 | "cells": [
|
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# PAWS Data Pipeline\n", |
| 8 | + "The objective of this script is to create a master data table that links all the PAWS datasources together.\n", |
| 9 | + "## Pipeline sections\n", |
| 10 | + "0. Import libraries\n", |
| 11 | + "1. Create & populate database \n", |
| 12 | + "2. Create ***metadata master table*** schema to link all source tables together & populate with one of the dataset (e.g. SalesForce)\n", |
| 13 | + "3. For each dataset, merge each record with the ***metadata master table***. If a match is found, link the two sources. If not, create a new record. <br/>\n", |
| 14 | + " a. Petpoint<br/>\n", |
| 15 | + " b. Volgistics<br/>\n", |
| 16 | + " c. Other - TBD<br/>\n", |
| 17 | + "4. Write the new table to the database" |
| 18 | + ] |
| 19 | + }, |
| 20 | + { |
| 21 | + "cell_type": "markdown", |
| 22 | + "metadata": {}, |
| 23 | + "source": [ |
| 24 | + "### 0. Import libraries" |
| 25 | + ] |
| 26 | + }, |
3 | 27 | {
|
4 | 28 | "cell_type": "code",
|
5 | 29 | "execution_count": null,
|
|
11 | 35 | "import re"
|
12 | 36 | ]
|
13 | 37 | },
|
| 38 | + { |
| 39 | + "cell_type": "markdown", |
| 40 | + "metadata": {}, |
| 41 | + "source": [ |
| 42 | + "### 1. Create & populate database " |
| 43 | + ] |
| 44 | + }, |
14 | 45 | {
|
15 | 46 | "cell_type": "code",
|
16 | 47 | "execution_count": null,
|
|
94 | 125 | "load_to_sqlite('./sample_data/CfP_PDP_salesforceDonations_deidentified.csv', 'salesforcedonations', conn)"
|
95 | 126 | ]
|
96 | 127 | },
|
| 128 | + { |
| 129 | + "cell_type": "markdown", |
| 130 | + "metadata": {}, |
| 131 | + "source": [ |
| 132 | + "### 2. Create ***metadata master table*** schema to link all source tables together & populate with one of the dataset (e.g. SalesForce)" |
| 133 | + ] |
| 134 | + }, |
| 135 | + { |
| 136 | + "cell_type": "code", |
| 137 | + "execution_count": null, |
| 138 | + "metadata": {}, |
| 139 | + "outputs": [], |
| 140 | + "source": [ |
| 141 | + "def create_user_master_df():\n", |
| 142 | + " \"\"\"\n", |
| 143 | + " Creates a pandas dataframe placeholder with key meta-data to fuzzy-match\n", |
| 144 | + " the users from different datasets.\n", |
| 145 | + " \n", |
| 146 | + " Pseudo-code:\n", |
| 147 | + " Create a blank pandas dataframe (e.g. pd.DataFrame) with columns for\n", |
| 148 | + " Name (first, last), address, zip code, phone number, email, etc.\n", |
| 149 | + " \n", |
| 150 | + " Include \"ID\" fields for each of the datasets that will be merged.\n", |
| 151 | + " \n", |
| 152 | + " Populate/Initialize the dataframe with data from one of the datasets\n", |
| 153 | + " (e.g. Salesforce)\n", |
| 154 | + " \"\"\"" |
| 155 | + ] |
| 156 | + }, |
| 157 | + { |
| 158 | + "cell_type": "markdown", |
| 159 | + "metadata": {}, |
| 160 | + "source": [ |
| 161 | + "### 3. For each dataset, merge each record with the ***metadata master table***\n", |
| 162 | + "If a match is found, link the two sources. If not, create a new record. <br/>" |
| 163 | + ] |
| 164 | + }, |
| 165 | + { |
| 166 | + "cell_type": "code", |
| 167 | + "execution_count": null, |
| 168 | + "metadata": {}, |
| 169 | + "outputs": [], |
| 170 | + "source": [ |
| 171 | + "def fuzzy_merge(new_df, master_df):\n", |
| 172 | + " \"\"\"\n", |
| 173 | + " This function merges each new dataset with the metadata master table by\n", |
| 174 | + " going line-by-line on the new dataset and looking for a match in the \n", |
| 175 | + " existing metadata master dataset. If a match is found\n", |
| 176 | + " \n", |
| 177 | + " Pseudo-code:\n", |
| 178 | + " LOOP: For each line in the new_df, compare that line against all lines in \n", |
| 179 | + " the master_df. \n", |
| 180 | + " \n", |
| 181 | + " LOGIC: For each comparison, generate (a) a fuzzy-match score on name,\n", |
| 182 | + " (b) T/F on whether zip-code matches, (c) T/F on whether email matches,\n", |
| 183 | + " (d) T/F on whether phone number matches.\n", |
| 184 | + " \n", |
| 185 | + " OUTPUT: For each comparison if the fuzzy-match score is above a threshold (e.g. >=90%)\n", |
| 186 | + " and (b), (c) or (d) matches, consider it a match and add the new dataset \n", |
| 187 | + " id to the existing record. If it doesn't match, create a new record in the\n", |
| 188 | + " master dataset.\n", |
| 189 | + " \n", |
| 190 | + " Note: there's probably a more efficient way to do this (vs. going line-by-line)\n", |
| 191 | + " \"\"\"" |
| 192 | + ] |
| 193 | + }, |
| 194 | + { |
| 195 | + "cell_type": "markdown", |
| 196 | + "metadata": {}, |
| 197 | + "source": [ |
| 198 | + "#### 3.A Petpoint merge \n", |
| 199 | + "Apply function above the Petpoint dataset" |
| 200 | + ] |
| 201 | + }, |
| 202 | + { |
| 203 | + "cell_type": "markdown", |
| 204 | + "metadata": {}, |
| 205 | + "source": [ |
| 206 | + "#### 3.B Volgistics merge\n", |
| 207 | + "Apply function above the Volgistics dataset" |
| 208 | + ] |
| 209 | + }, |
| 210 | + { |
| 211 | + "cell_type": "markdown", |
| 212 | + "metadata": {}, |
| 213 | + "source": [ |
| 214 | + "#### 3.C Other - TBD - Merge" |
| 215 | + ] |
| 216 | + }, |
| 217 | + { |
| 218 | + "cell_type": "markdown", |
| 219 | + "metadata": {}, |
| 220 | + "source": [ |
| 221 | + "### 4. Write the new table to the database" |
| 222 | + ] |
| 223 | + }, |
| 224 | + { |
| 225 | + "cell_type": "code", |
| 226 | + "execution_count": 4, |
| 227 | + "metadata": {}, |
| 228 | + "outputs": [], |
| 229 | + "source": [ |
| 230 | + "# load_to_sqlite(master_df, master_table, conn)" |
| 231 | + ] |
| 232 | + }, |
| 233 | + { |
| 234 | + "cell_type": "markdown", |
| 235 | + "metadata": {}, |
| 236 | + "source": [ |
| 237 | + "## Other - placeholder - graveyard\n", |
| 238 | + "Graveyard/placeholder code from previous sections" |
| 239 | + ] |
| 240 | + }, |
97 | 241 | {
|
98 | 242 | "cell_type": "code",
|
99 | 243 | "execution_count": null,
|
|
162 | 306 | "name": "python",
|
163 | 307 | "nbconvert_exporter": "python",
|
164 | 308 | "pygments_lexer": "ipython3",
|
165 |
| - "version": "3.7.3" |
| 309 | + "version": "3.7.4" |
166 | 310 | }
|
167 | 311 | },
|
168 | 312 | "nbformat": 4,
|
|
0 commit comments