Skip to content

upload/match pipeline refactor #480

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 30, 2022
Merged

upload/match pipeline refactor #480

merged 7 commits into from
Jun 30, 2022

Conversation

carlos-dominguez
Copy link
Collaborator

@carlos-dominguez carlos-dominguez commented Feb 8, 2022

This PR changes the file upload pipeline to shovel raw-ish data directly to the database, rather than writing it out to intermediate csvs. We then use the raw data tables to implement a faster, simpler match that is executed almost entirely by postgres.

Recall that in existing pipeline, we:

  1. get a file from the user
  2. inspect it to "guess" what the data source is (volgistics, salesforce, etc.)
  3. convert the data to a csv, encoding our guess into the filename (e.g. "volgistics-[...].csv")

Then, after the user clicks an "execute" button, we:

  1. read the csv
  2. use our "guess" to look up rules for how to massage the data to fit into pdp_contacts
  3. load the data from pdp_contacts
  4. diff the csv data and the pdp_contacts to see what's new/old/changed (using pandas)
  5. loop through pdp_contacts and attempt to re-match contacts that might have a new match (using pandas)

There are several issues with this. Most significantly, the match is very slow when adding a large number of rows. Also, care needs to be taken to make sure that pandas picks the same datatypes when reading from the database as reading from the csv. Finally, this implementation contains (in my opinion) excessive indirection; we have multiple dictionary objects keyed on data source name from which various bits of normalization instructions are looked up, which makes it difficult for a noob to quickly answer questions like "what exactly happens to a salesforce contacts file when it's uploaded?"

In this implementation, we:

  1. get a file from the user
  2. inspect it to "guess" what the data source is
  3. branch on that guess to call an appropriate function to upload that data into a source-specific table with minimal normalization.

When the user clicks "execute", we:

  1. Delete everything from pdp_contacts
  2. Copy the lastest data (for each id) from each source-specific table into pdp_contacts, doing any appropriate normalization
  3. Computing a match by JOINing pdp_contacts with itself.

This addresses the aforementioned issues. First of all, the match is faster because it can use indexes on pdp_contacts to speed up the join. Second of all, we only need to think about types in one place: when inserting the raw data. Finally, all "interesting" information about, say, salesforcecontacts data, is stored in a SalesForceContacts class. That class knows about the database columns, the source column names, and how to normalize data as it comes in from the file and goes out to pdp_contacts. New processing can be added for a single file type without having to touch code that would affect the processing of other file types.

One further nice effect of this design is that we now store the history of the (relatively) raw data in each source-specific table, rather than storing the normalized histories in pdp_contacts data. This would (theoretically) make it easier to run studies on historical data from one source.

notes

phone

columns

comment
@@ -243,7 +243,7 @@ export default function Admin(props) {
<Typography variant="h5" styles={{paddingBottom: 5}}>Run New Analysis</Typography>
<form onSubmit={handleExecute}>
<Button type="submit" variant="contained" color="primary"
disabled={statistics === 'Running' || isNewFileExist === false}>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UI is a bit wonky with this change; the whole "new files" concept is now kind of meaningless.

@carlos-dominguez
Copy link
Collaborator Author

Just added an alembic script, so I think this should all work if you just pull, build, and bring up in docker

@sposerina sposerina merged commit 9cd4010 into CodeForPhilly:master Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants