Skip to content

Update k_means_clust.py #8996

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 27, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 10 additions & 13 deletions machine_learning/k_means_clust.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@
- initial_centroids , initial centroid values generated by utility function(mentioned
in usage).
- maxiter , maximum number of iterations to process.
- heterogeneity , empty list that will be filled with hetrogeneity values if passed
- heterogeneity , empty list that will be filled with heterogeneity values if passed
to kmeans func.
Usage:
1. define 'k' value, 'X' features array and 'hetrogeneity' empty list
1. define 'k' value, 'X' features array and 'heterogeneity' empty list
2. create initial_centroids,
initial_centroids = get_initial_centroids(
X,
Expand All @@ -31,8 +31,8 @@
record_heterogeneity=heterogeneity,
verbose=True # whether to print logs in console or not.(default=False)
)
4. Plot the loss function, hetrogeneity values for every iteration saved in
hetrogeneity list.
4. Plot the loss function and heterogeneity values for every iteration saved in
heterogeneity list.
plot_heterogeneity(
heterogeneity,
k
Expand Down Expand Up @@ -198,13 +198,10 @@ def report_generator(
df: pd.DataFrame, clustering_variables: np.ndarray, fill_missing_report=None
) -> pd.DataFrame:
"""
Function generates easy-erading clustering report. It takes 2 arguments as an input:
DataFrame - dataframe with predicted cluester column;
FillMissingReport - dictionary of rules how we are going to fill missing
values of for final report generate (not included in modeling);
in order to run the function following libraries must be imported:
import pandas as pd
import numpy as np
Generates a clustering report. This function takes 2 arguments as input:
df - dataframe with predicted cluster column
fill_missing_report - dictionary of rules on how we are going to fill in missing
values for final generated report (not included in modelling);
>>> data = pd.DataFrame()
>>> data['numbers'] = [1, 2, 3]
>>> data['col1'] = [0.5, 2.5, 4.5]
Expand Down Expand Up @@ -306,10 +303,10 @@ def report_generator(
a.columns = report.columns # rename columns to match report
report = report.drop(
report[report.Type == "count"].index
) # drop count values except cluster size
) # drop count values except for cluster size
report = pd.concat(
[report, a, clustersize, clusterproportion], axis=0
) # concat report with clustert size and nan values
) # concat report with cluster size and nan values
report["Mark"] = report["Features"].isin(clustering_variables)
cols = report.columns.tolist()
cols = cols[0:2] + cols[-1:] + cols[2:-1]
Expand Down