Skip to content

Commit 8531428

Browse files
committed
Added psuedo-code for dataset merge functions
1 parent 4f01ab3 commit 8531428

File tree

1 file changed

+145
-1
lines changed

1 file changed

+145
-1
lines changed

load_paws_data.ipynb

Lines changed: 145 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,29 @@
11
{
22
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# PAWS Data Pipeline\n",
8+
"The objective of this script is to create a master data table that links all the PAWS datasources together.\n",
9+
"## Pipeline sections\n",
10+
"0. Import libraries\n",
11+
"1. Create & populate database \n",
12+
"2. Create ***metadata master table*** schema to link all source tables together & populate with one of the dataset (e.g. SalesForce)\n",
13+
"3. For each dataset, merge each record with the ***metadata master table***. If a match is found, link the two sources. If not, create a new record. <br/>\n",
14+
" a. Petpoint<br/>\n",
15+
" b. Volgistics<br/>\n",
16+
" c. Other - TBD<br/>\n",
17+
"4. Write the new table to the database"
18+
]
19+
},
20+
{
21+
"cell_type": "markdown",
22+
"metadata": {},
23+
"source": [
24+
"### 0. Import libraries"
25+
]
26+
},
327
{
428
"cell_type": "code",
529
"execution_count": null,
@@ -11,6 +35,13 @@
1135
"import re"
1236
]
1337
},
38+
{
39+
"cell_type": "markdown",
40+
"metadata": {},
41+
"source": [
42+
"### 1. Create & populate database "
43+
]
44+
},
1445
{
1546
"cell_type": "code",
1647
"execution_count": null,
@@ -94,6 +125,119 @@
94125
"load_to_sqlite('./sample_data/CfP_PDP_salesforceDonations_deidentified.csv', 'salesforcedonations', conn)"
95126
]
96127
},
128+
{
129+
"cell_type": "markdown",
130+
"metadata": {},
131+
"source": [
132+
"### 2. Create ***metadata master table*** schema to link all source tables together & populate with one of the dataset (e.g. SalesForce)"
133+
]
134+
},
135+
{
136+
"cell_type": "code",
137+
"execution_count": null,
138+
"metadata": {},
139+
"outputs": [],
140+
"source": [
141+
"def create_user_master_df():\n",
142+
" \"\"\"\n",
143+
" Creates a pandas dataframe placeholder with key meta-data to fuzzy-match\n",
144+
" the users from different datasets.\n",
145+
" \n",
146+
" Pseudo-code:\n",
147+
" Create a blank pandas dataframe (e.g. pd.DataFrame) with columns for\n",
148+
" Name (first, last), address, zip code, phone number, email, etc.\n",
149+
" \n",
150+
" Include \"ID\" fields for each of the datasets that will be merged.\n",
151+
" \n",
152+
" Populate/Initialize the dataframe with data from one of the datasets\n",
153+
" (e.g. Salesforce)\n",
154+
" \"\"\""
155+
]
156+
},
157+
{
158+
"cell_type": "markdown",
159+
"metadata": {},
160+
"source": [
161+
"### 3. For each dataset, merge each record with the ***metadata master table***\n",
162+
"If a match is found, link the two sources. If not, create a new record. <br/>"
163+
]
164+
},
165+
{
166+
"cell_type": "code",
167+
"execution_count": null,
168+
"metadata": {},
169+
"outputs": [],
170+
"source": [
171+
"def fuzzy_merge(new_df, master_df):\n",
172+
" \"\"\"\n",
173+
" This function merges each new dataset with the metadata master table by\n",
174+
" going line-by-line on the new dataset and looking for a match in the \n",
175+
" existing metadata master dataset. If a match is found\n",
176+
" \n",
177+
" Pseudo-code:\n",
178+
" LOOP: For each line in the new_df, compare that line against all lines in \n",
179+
" the master_df. \n",
180+
" \n",
181+
" LOGIC: For each comparison, generate (a) a fuzzy-match score on name,\n",
182+
" (b) T/F on whether zip-code matches, (c) T/F on whether email matches,\n",
183+
" (d) T/F on whether phone number matches.\n",
184+
" \n",
185+
" OUTPUT: For each comparison if the fuzzy-match score is above a threshold (e.g. >=90%)\n",
186+
" and (b), (c) or (d) matches, consider it a match and add the new dataset \n",
187+
" id to the existing record. If it doesn't match, create a new record in the\n",
188+
" master dataset.\n",
189+
" \n",
190+
" Note: there's probably a more efficient way to do this (vs. going line-by-line)\n",
191+
" \"\"\""
192+
]
193+
},
194+
{
195+
"cell_type": "markdown",
196+
"metadata": {},
197+
"source": [
198+
"#### 3.A Petpoint merge \n",
199+
"Apply function above the Petpoint dataset"
200+
]
201+
},
202+
{
203+
"cell_type": "markdown",
204+
"metadata": {},
205+
"source": [
206+
"#### 3.B Volgistics merge\n",
207+
"Apply function above the Volgistics dataset"
208+
]
209+
},
210+
{
211+
"cell_type": "markdown",
212+
"metadata": {},
213+
"source": [
214+
"#### 3.C Other - TBD - Merge"
215+
]
216+
},
217+
{
218+
"cell_type": "markdown",
219+
"metadata": {},
220+
"source": [
221+
"### 4. Write the new table to the database"
222+
]
223+
},
224+
{
225+
"cell_type": "code",
226+
"execution_count": 4,
227+
"metadata": {},
228+
"outputs": [],
229+
"source": [
230+
"# load_to_sqlite(master_df, master_table, conn)"
231+
]
232+
},
233+
{
234+
"cell_type": "markdown",
235+
"metadata": {},
236+
"source": [
237+
"## Other - placeholder - graveyard\n",
238+
"Graveyard/placeholder code from previous sections"
239+
]
240+
},
97241
{
98242
"cell_type": "code",
99243
"execution_count": null,
@@ -162,7 +306,7 @@
162306
"name": "python",
163307
"nbconvert_exporter": "python",
164308
"pygments_lexer": "ipython3",
165-
"version": "3.7.3"
309+
"version": "3.7.4"
166310
}
167311
},
168312
"nbformat": 4,

0 commit comments

Comments
 (0)