upload/match pipeline refactor #480

carlos-dominguez · 2022-02-08T02:25:29Z

This PR changes the file upload pipeline to shovel raw-ish data directly to the database, rather than writing it out to intermediate csvs. We then use the raw data tables to implement a faster, simpler match that is executed almost entirely by postgres.

Recall that in existing pipeline, we:

get a file from the user
inspect it to "guess" what the data source is (volgistics, salesforce, etc.)
convert the data to a csv, encoding our guess into the filename (e.g. "volgistics-[...].csv")

Then, after the user clicks an "execute" button, we:

read the csv
use our "guess" to look up rules for how to massage the data to fit into pdp_contacts
load the data from pdp_contacts
diff the csv data and the pdp_contacts to see what's new/old/changed (using pandas)
loop through pdp_contacts and attempt to re-match contacts that might have a new match (using pandas)

There are several issues with this. Most significantly, the match is very slow when adding a large number of rows. Also, care needs to be taken to make sure that pandas picks the same datatypes when reading from the database as reading from the csv. Finally, this implementation contains (in my opinion) excessive indirection; we have multiple dictionary objects keyed on data source name from which various bits of normalization instructions are looked up, which makes it difficult for a noob to quickly answer questions like "what exactly happens to a salesforce contacts file when it's uploaded?"

In this implementation, we:

get a file from the user
inspect it to "guess" what the data source is
branch on that guess to call an appropriate function to upload that data into a source-specific table with minimal normalization.

When the user clicks "execute", we:

Delete everything from pdp_contacts
Copy the lastest data (for each id) from each source-specific table into pdp_contacts, doing any appropriate normalization
Computing a match by JOINing pdp_contacts with itself.

This addresses the aforementioned issues. First of all, the match is faster because it can use indexes on pdp_contacts to speed up the join. Second of all, we only need to think about types in one place: when inserting the raw data. Finally, all "interesting" information about, say, salesforcecontacts data, is stored in a SalesForceContacts class. That class knows about the database columns, the source column names, and how to normalize data as it comes in from the file and goes out to pdp_contacts. New processing can be added for a single file type without having to touch code that would affect the processing of other file types.

One further nice effect of this design is that we now store the history of the (relatively) raw data in each source-specific table, rather than storing the normalized histories in pdp_contacts data. This would (theoretically) make it easier to run studies on historical data from one source.

notes phone columns comment

carlos-dominguez · 2022-02-08T02:37:12Z

src/client/src/pages/Admin.js

@@ -243,7 +243,7 @@ export default function Admin(props) {
                                        <Typography variant="h5" styles={{paddingBottom: 5}}>Run New Analysis</Typography>
                                        <form onSubmit={handleExecute}>
                                            <Button type="submit" variant="contained" color="primary"
-                                                    disabled={statistics === 'Running' || isNewFileExist === false}>


The UI is a bit wonky with this change; the whole "new files" concept is now kind of meaningless.

carlos-dominguez · 2022-02-10T21:35:54Z

Just added an alembic script, so I think this should all work if you just pull, build, and bring up in docker

upload/match pipeline refactor

104c564

notes phone columns comment

carlos-dominguez commented Feb 8, 2022

View reviewed changes

alembic migration

e830d20

carlos-dominguez added 5 commits February 23, 2022 13:18

fix dedup_consecutive

d8c8150

delimited names, lowercase emails

9b44aa4

typos

1a86dd4

remove quotation marks in names

4748f3b

itsdangerous hack

221d595

sposerina merged commit 9cd4010 into CodeForPhilly:master Jun 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

upload/match pipeline refactor #480

upload/match pipeline refactor #480

Uh oh!

carlos-dominguez commented Feb 8, 2022 •

edited

Loading

Uh oh!

carlos-dominguez Feb 8, 2022

Uh oh!

carlos-dominguez commented Feb 10, 2022

Uh oh!

Uh oh!

upload/match pipeline refactor #480

upload/match pipeline refactor #480

Uh oh!

Conversation

carlos-dominguez commented Feb 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carlos-dominguez Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

carlos-dominguez commented Feb 10, 2022

Uh oh!

Uh oh!

carlos-dominguez commented Feb 8, 2022 •

edited

Loading