Skip to content

Forecast #3219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 40 commits into from
Oct 24, 2020
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
722584e
add forecasting code
Oct 12, 2020
0b823da
add statsmodel
Oct 12, 2020
aab3a71
sort import
Oct 12, 2020
8f69418
sort import fix
Oct 12, 2020
31b8926
fixing black
Oct 12, 2020
e3ba4fa
sort requirement
Oct 12, 2020
058889b
optimize code
Oct 12, 2020
7a58104
try with limited data
Oct 12, 2020
6f0b775
sort again
Oct 12, 2020
5fd4b05
sort fix
Oct 12, 2020
4864b41
sort fix
Oct 12, 2020
122fdc6
delete warning and black
Oct 12, 2020
4785ddf
resolve error merge
Oct 12, 2020
b3df925
add code for forecasting
Oct 12, 2020
cfa6a2f
use black
Oct 12, 2020
cefc5f4
add more hints to describe
Oct 13, 2020
85b37cc
add doctest
Oct 13, 2020
13d23b6
finding whitespace
Oct 13, 2020
aa0daa5
fixing doctest
Oct 13, 2020
1e6f923
delete
Oct 13, 2020
ab98e75
revert back
Oct 13, 2020
3851388
revert back
Oct 13, 2020
6fb8a22
revert back again
Oct 13, 2020
89e60a8
Merge branch 'forecast' of github.com:nandiya/Python into forecast
Oct 13, 2020
574d25d
revert back again
Oct 13, 2020
c151dd5
revert back again
Oct 13, 2020
bf45d77
try trimming whitespace
Oct 13, 2020
4cdec16
try adding doctypeand etc
Oct 13, 2020
43c9d4c
fixing reviews
Oct 13, 2020
14ccdc8
deleting all the space
Oct 13, 2020
8194902
fixing the build
Oct 13, 2020
eab1d3b
delete x
Oct 13, 2020
d34853f
add description for safety checker
Oct 13, 2020
81effc5
deleting subscription integer
Oct 13, 2020
6c8f1af
fix docthint
Oct 13, 2020
070cba4
make def to use function parameters and return values
Oct 24, 2020
28ac649
make def to use function parameters and return values
Oct 24, 2020
da69f83
type hints on data safety checker
Oct 24, 2020
3b18a48
optimize code
Oct 24, 2020
85723b6
Update run.py
cclauss Oct 24, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file.
114 changes: 114 additions & 0 deletions machine_learning/forecasting/ex_data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
total_user,total_events,days
18231,0.0,1
22621,1.0,2
15675,0.0,3
23583,1.0,4
68351,5.0,5
34338,3.0,6
19238,0.0,0
24192,0.0,1
70349,0.0,2
103510,0.0,3
128355,1.0,4
148484,6.0,5
153489,3.0,6
162667,1.0,0
311430,3.0,1
435663,7.0,2
273526,0.0,3
628588,2.0,4
454989,13.0,5
539040,3.0,6
52974,1.0,0
103451,2.0,1
810020,5.0,2
580982,3.0,3
216515,0.0,4
134694,10.0,5
93563,1.0,6
55432,1.0,0
169634,1.0,1
254908,4.0,2
315285,3.0,3
191764,0.0,4
514284,7.0,5
181214,4.0,6
78459,2.0,0
161620,3.0,1
245610,4.0,2
326722,5.0,3
214578,0.0,4
312365,5.0,5
232454,4.0,6
178368,1.0,0
97152,1.0,1
222813,4.0,2
285852,4.0,3
192149,1.0,4
142241,1.0,5
173011,2.0,6
56488,3.0,0
89572,2.0,1
356082,2.0,2
172799,0.0,3
142300,1.0,4
78432,2.0,5
539023,9.0,6
62389,1.0,0
70247,1.0,1
89229,0.0,2
94583,1.0,3
102455,0.0,4
129270,0.0,5
311409,1.0,6
1837026,0.0,0
361824,0.0,1
111379,2.0,2
76337,2.0,3
96747,0.0,4
92058,0.0,5
81929,2.0,6
143423,0.0,0
82939,0.0,1
74403,1.0,2
68234,0.0,3
94556,1.0,4
80311,0.0,5
75283,3.0,6
77724,0.0,0
49229,2.0,1
65708,2.0,2
273864,1.0,3
1711281,0.0,4
1900253,5.0,5
343071,1.0,6
1551326,0.0,0
56636,1.0,1
272782,2.0,2
1785678,0.0,3
241866,0.0,4
461904,0.0,5
2191901,2.0,6
102925,0.0,0
242778,1.0,1
298608,0.0,2
322458,10.0,3
216027,9.0,4
916052,12.0,5
193278,12.0,6
263207,8.0,0
672948,10.0,1
281909,1.0,2
384562,1.0,3
1027375,2.0,4
828905,9.0,5
624188,22.0,6
392218,8.0,0
292581,10.0,1
299869,12.0,2
769455,20.0,3
316443,8.0,4
1212864,24.0,5
1397338,28.0,6
223249,8.0,0
191264,14.0,1
149 changes: 149 additions & 0 deletions machine_learning/forecasting/run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
"""
this is code for forecasting
but i modified it and used it for safety checker of data
for ex: you have a online shop and for some reason some data are
missing (the amount of data that u expected are not supposed to be)
then we can use it
*ps : 1. ofc we can use normal statistic method but in this case
the data is quite absurd and only a little^^
2. ofc u can use this and modified it for forecasting purpose
for the next 3 months sales or something,
u can just adjust it for ur own purpose
"""

import numpy as np
import pandas as pd
from sklearn.preprocessing import Normalizer
from sklearn.svm import SVR
from statsmodels.tsa.statespace.sarimax import SARIMAX


def lin_reg_pred(train_dt, train_usr, train_mtch, test_dt, test_mtch):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints? Doctests? See CONTRIBUTING.md.

"""
First method: linear regression
input : training data (date, total_user, total_event) in list of float
output : list of total user prediction in float
>>> lin_reg_pred([2,3,4,5], [5,3,4,6], [3,1,2,4], [2,1], [2,2])
5.000000000000003
"""
x = []
x = [[1, item, train_mtch[i]] for i, item in enumerate(train_dt)]
x = np.array(x)
y = np.array(train_usr)
beta = np.dot(np.dot(np.linalg.inv(np.dot(x.transpose(), x)), x.transpose()), y)
prediction = abs(beta[0] + test_dt[0] * beta[1] + test_mtch[0] + beta[2])
return prediction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prediction = abs(beta[0] + test_dt[0] * beta[1] + test_mtch[0] + beta[2])
return prediction
return abs(beta[0] + test_dt[0] * beta[1] + test_mtch[0] + beta[2])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



def sarimax_predictor(train_user, train_match, test_match):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints?

"""
second method: sarimax
sarimax is a statistic method which using previous input
and learn its pattern to predict future data
input : training data (total_user, with exog data = total_event) in list of float
output : list of total user prediction in float
>>> sarimax_predictor([4,2,6,8], [3,1,2,4], [2])
3.0000424034255513
"""
order = (1, 2, 1)
seasonal_order = (1, 1, 0, 7)
model = SARIMAX(
train_user, exog=train_match, order=order, seasonal_order=seasonal_order
)
model_fit = model.fit(disp=False, maxiter=600, method="nm")
result = model_fit.predict(1, len(test_match), exog=[test_match])
return result[0]


def support_machine_regressor(x_train, x_test, train_user):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints?

"""
Third method: Support vector regressor
svr is quite the same with svm(support vector machine)
it uses the same principles as the SVM for classification,
with only a few minor differences and the only different is that
it suits better for regression purpose
input : training data (date, total_user, total_event) in list of float
where x = list of set (date and total event)
output : list of total user prediction in float
>>> support_machine_regressor([[5,2],[1,5],[6,2]], [[3,2]], [2,1,4])
1.634932078116079
"""
regressor = SVR(kernel="rbf", C=1, gamma=0.1, epsilon=0.1)
regressor.fit(x_train, train_user)
y_pred = regressor.predict(x_test)
return y_pred[0]


def interquartile_range_checker(train_user):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints?

"""
Optional method: interquatile range
input : list of total user in float
output : low limit of input in float
this method can be used to check whether some data is outlier or not
>>> interquartile_range_checker([1,2,3,4,5,6,7,8,9,10])
2.8
"""
train_user.sort()
q1 = np.percentile(train_user, 25)
q3 = np.percentile(train_user, 75)
iqr = q3 - q1
low_lim = q1 - (iqr * 0.1)
return low_lim


def data_safety_checker(list_vote, actual_result):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hints?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arent the hints input and output that i described??
or did i do it wrongly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.python.org/3/library/typing.html is described in CONTRIBUTING.md. Or look at other Python files in this repo.

safe = 0
not_safe = 0
for i in list_vote:
if i > actual_result:
safe = not_safe + 1
else:
if abs(abs(i) - abs(actual_result[0])) <= 0.1:
safe = safe + 1
else:
not_safe = not_safe + 1
if safe > not_safe:
print("today's data is safe")
else:
print("today's data is not safe")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if safe > not_safe:
print("today's data is safe")
else:
print("today's data is not safe")
print("today's data is {'not ' if safe <= not_safe else ''}safe")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk whether that code is working, but i got ur point to use inline if print

--done



# data_input_df = pd.read_csv("ex_data.csv", header=None)
list_data = [[18231, 0.0, 1], [22621, 1.0, 2], [15675, 0.0, 3], [23583, 1.0, 4]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a more self-documenting variable name than list_data. What is this a list of? Are these stock quotes or swine flu patients or food orders? Help the reader understand what this data is so they can understand why all this analysis is worth doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, please check it if the current one can be used or not

data_input_df = pd.DataFrame(list_data, columns=["total_user", "total_even", "days"])

"""
data column = total user in a day, how much online event held in one day,
what day is that(sunday-saturday)
"""

# start normalization
normalize_df = Normalizer().fit_transform(data_input_df.values)
# split data
total_date = normalize_df[:, 2].tolist()
total_user = normalize_df[:, 0].tolist()
total_match = normalize_df[:, 1].tolist()

# for svr (input variable = total date and total match)
x = normalize_df[:, [1, 2]].tolist()
x_train = x[: len(x) - 1]
x_test = x[len(x) - 1 :]

# for linear reression & sarimax
trn_date = total_date[: len(total_date) - 1]
trn_user = total_user[: len(total_user) - 1]
trn_match = total_match[: len(total_match) - 1]

tst_date = total_date[len(total_date) - 1 :]
tst_user = total_user[len(total_user) - 1 :]
tst_match = total_match[len(total_match) - 1 :]


# voting system with forecasting
res_vote = []
res_vote.append(lin_reg_pred(trn_date, trn_user, trn_match, tst_date, tst_match))
res_vote.append(sarimax_predictor(trn_user, trn_match, tst_match))
res_vote.append(support_machine_regressor(x_train, x_test, trn_user))

# check the safety of todays'data^^
data_safety_checker(res_vote, tst_user)
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ pillow
requests
scikit-fuzzy
sklearn
statsmodels
sympy
tensorflow
xgboost