upload/match pipeline refactor #480
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR changes the file upload pipeline to shovel raw-ish data directly to the database, rather than writing it out to intermediate csvs. We then use the raw data tables to implement a faster, simpler match that is executed almost entirely by postgres.
Recall that in existing pipeline, we:
Then, after the user clicks an "execute" button, we:
There are several issues with this. Most significantly, the match is very slow when adding a large number of rows. Also, care needs to be taken to make sure that pandas picks the same datatypes when reading from the database as reading from the csv. Finally, this implementation contains (in my opinion) excessive indirection; we have multiple dictionary objects keyed on data source name from which various bits of normalization instructions are looked up, which makes it difficult for a noob to quickly answer questions like "what exactly happens to a salesforce contacts file when it's uploaded?"
In this implementation, we:
When the user clicks "execute", we:
This addresses the aforementioned issues. First of all, the match is faster because it can use indexes on pdp_contacts to speed up the join. Second of all, we only need to think about types in one place: when inserting the raw data. Finally, all "interesting" information about, say, salesforcecontacts data, is stored in a SalesForceContacts class. That class knows about the database columns, the source column names, and how to normalize data as it comes in from the file and goes out to pdp_contacts. New processing can be added for a single file type without having to touch code that would affect the processing of other file types.
One further nice effect of this design is that we now store the history of the (relatively) raw data in each source-specific table, rather than storing the normalized histories in pdp_contacts data. This would (theoretically) make it easier to run studies on historical data from one source.