Skip to content

DRAFT: 285 rfm model #422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Oct 3, 2021
Merged

DRAFT: 285 rfm model #422

merged 39 commits into from
Oct 3, 2021

Conversation

bdeck8317
Copy link
Collaborator

Meet to discuss integration into pipeline. Will need a var passed into create_scores.py which is a str(query_date)

Updates the rfm_scores table in postgres database.

Adds a number of module dependencies to requirements.txt

bdeck8317 added 19 commits July 15, 2021 19:49
…bins func and writes txt file with bins, needs testing
…l LLCs from RFM score analysis based on recommendations
…ction which pulls data from tables and creates scores based on tabeled bins
…fo. Edges_dicts moved into rfm_functions.py. Updated requirements.txt to import modules necessary to run funcs. Data is uploaded to rfm_database once scores are calculated.
c-simpson and others added 2 commits August 17, 2021 20:24
…able/renaming var lines 85- 90. date_difference calculation now uses max close donation date instead of query date.
@bdeck8317 bdeck8317 marked this pull request as draft August 24, 2021 13:55
@bdeck8317 bdeck8317 changed the title 285 rfm model DRAFT 285 rfm model Aug 24, 2021
@bdeck8317 bdeck8317 changed the title DRAFT 285 rfm model DRAFT: 285 rfm model Aug 24, 2021
@c-simpson
Copy link
Collaborator

I'm getting this error running create_scores():

  File "C:\Projects\paws-data-pipeline\src\server\rfm_funcs\create_scores.py", line 73, in create_scores
    grouped_past_year['recency_score'] = pd.cut(grouped_past_year[('days_since','min')], bins= recency_bins, labels=recency_labels, include_lowest = True)
  File "C:\Python38\lib\site-packages\pandas\core\reshape\tile.py", line 262, in cut
    raise ValueError("bins must increase monotonically.")
ValueError: bins must increase monotonically.

with rfm_edges as:

{
"r": {"5": 0, "4": 262, "3": 1097, "2": 1910, "1": 2851}, 
"f": {"1": 0, "2": 1, "3": 2, "4": 3, "5": 4}, 
"m": {"1": 0.0, "2": 50.0, "3": 75.0, "4": 100.0, "5": 210.0}
}

It looks like the issue is with the 'r' set. Do the values need to be reversed wrt the keys? 

@bdeck8317
Copy link
Collaborator Author

bdeck8317 commented Sep 14, 2021

@c-simpson looks like a pandas issue with handling numpy.diff. Could try updating pandas to 1.3.2

Apparently, the bug is fixed in that version as I haven't had this issue. I also ran the script using the API and it seemed okay....

I'll keep digging around on this issue, but my temporary solution is to just update pandas to 1.3.2

pandas-dev/pandas#40969

@c-simpson
Copy link
Collaborator

I'm still getting it with 1.3.2

@c-simpson
Copy link
Collaborator

c-simpson commented Sep 14, 2021

We poked in this while you weren't around and noticed that

recency_bins.append(grouped_past_year[('days_since', 'min')].max())

adds 43 to the end of recency_bins, which causes the 'monotonic' issue. In the debugger, if I put 43 in the proper sort position, we don't get the monotonic complaint. We didn't know exactly what you were doing so we stopped there.

@c-simpson
Copy link
Collaborator

c-simpson commented Sep 15, 2021

@bdeck8317 I created a branch ( 285-sort-after-append ) off yours that does nothing more than sort recency_bins after you append. It runs but only creates scores for 754 matching_ids. So I'll stop messing with things I don't understand and will let you carry on!

c-simpson and others added 3 commits September 16, 2021 18:19
…ed out a bug which was appending improper max values. Now it is robust to max values which are either max for data if it is below max bin edge or will use pre-existing max bin edge
@bdeck8317 bdeck8317 marked this pull request as ready for review September 17, 2021 00:31
@bdeck8317
Copy link
Collaborator Author

@c-simpson @sposerina ready for review and initial integration. There is one conflict. Not sure how to address it.

@c-simpson
Copy link
Collaborator

The 'append 43' is making me think about robustness:

  • What happens if date_differences() gets a badly-formatted date (shouldn't happen)? A future date(could) ? Can callers handle?
  • In create_scores(), what should happen if read_rfm_edges() returns None?
  • It might be cleaner to put the r,f,m processing in a separate function for each and put a try/catch in each. What to return if a failure?
  • If one of those fails, what should we do? We don't want to push bad data. @kfettich Any thoughts?

@c-simpson
Copy link
Collaborator

Spoke to BD (who is swamped) - I'll make the above changes this week.

@kfettich
Copy link
Collaborator

Sorry, I missed the mention! I think we discussed briefly, but my thoughts if any of the 3 failures happens:

  • What happens if date_differences() gets a badly-formatted date (shouldn't happen)? A future date(could) ? Can callers handle? -> future dates should be filtered out; meanwhile, a user who has donation data with badly formatted dates should display some kind of warning (either an "error" label instead of an RFM label, or something else to notify the user that there was a problem calculating the score)

  • In create_scores(), what should happen if read_rfm_edges() returns None? -> revert to the previous set of edges that worked, and alert dev team

  • It might be cleaner to put the r,f,m processing in a separate function for each and put a try/catch in each. What to return if a failure? ->return last RFM score for person, but with some indication that it is not the most recent score

If one of those fails, what should we do? We don't want to push bad data. @kfettich Any thoughts? -> see above; I don't expect there to be dramatic changes from one cycle to the other, so using outdated RFM scores from the previous cycle should be ok; doing it more than once will be problematic though

@c-simpson c-simpson merged commit 57a6426 into master Oct 3, 2021
@c-simpson c-simpson deleted the 285-rfm-model branch October 12, 2021 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants