-
-
Notifications
You must be signed in to change notification settings - Fork 46.6k
Optimal coefficients, fixes #8847 #8885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
iterations = 250000 | ||
alpha = 0.000326 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't give the optimal coefficients. Like I said in my original issue, gradient descent is an approximation algorithm by definition, so tweaking the parameters will only give you a better approximation. My point with the original issue is that we should add a different algorithm, one that is a direct method rather than an iterative method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also doesn't fix the fact that the code is still calculating SSE incorrectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate on what do you mean by direct method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A direct method gives an exact solution. By contrast, an iterative method (such as gradient descent) gives increasingly accurate approximations. With an iterative method, you get closer and closer to the exact solution but you may never actually reach that exact solution.
For example, let's say you want to find the roots of a quadratic polynomial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, explain with facts why the SSE is calculated incorrectly because the code does the predictions finds the deviation and takes the squares of the deviations, and sums the errors, which is the definition of SSE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already explained this with my code sample in the original issue, but sure, let's go over the math for the SSE function:
Python/machine_learning/linear_regression.py
Lines 52 to 64 in 9e08c77
def sum_of_square_error(data_x, data_y, len_data, theta): | |
"""Return sum of square error for error calculation | |
:param data_x : contains our dataset | |
:param data_y : contains the output (result vector) | |
:param len_data : len of the dataset | |
:param theta : contains the feature vector | |
:return : sum of square error computed from given feature's | |
""" | |
prod = np.dot(theta, data_x.transpose()) | |
prod -= data_y.transpose() | |
sum_elem = np.sum(np.square(prod)) | |
error = sum_elem / (2 * len_data) | |
return error |
Likes 60–62 are fine because they calculate the SSE exactly as we expect: it calculates the differences between the predicted response and actual response values, squares them, and then sums them:
However, for some reason, the function then divides the SSE by 2 * len_data
. Dividing the SSE (sum of squared errors) by the number of datapoints gives you the MSE (mean squared error):
Naturally, it follows that the "SSE" function is actually calculating half of the MSE because it's dividing the SSE by
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i appreciate your response
Please add the issue number to the description of your PR. You'll know it worked when you see the issue appear in this section on the right-hand side of this PR: ![]() |
As far as the calculation of the SSE this is what i found : when it comes to the calculation of the Sum of Squared Errors (SSE), the division by 2 * len_data is a significant factor. This division is related to the derivative of the cost function during optimization, particularly when using gradient descent. By dividing the SSE by 2 * len_data, the derivative of the cost function (SSE) with respect to the model parameters (theta) becomes simpler and results in a cleaner expression. Additionally, it is worth noting that the cost function commonly used in linear regression is the Mean Squared Error (MSE), which is calculated as SSE divided by 2 * len_data: MSE = SSE / (2 * len_data) When taking the derivative of MSE with respect to the model parameters (theta), you get: d(MSE)/d(theta) = (1/len_data) * (theta.dot(data_x.transpose()) - data_y.transpose()).dot(data_x) Here, the division by 2 in the original SSE equation cancels out during the differentiation process, and the term 1/len_data helps scale the gradients appropriately to handle datasets of different sizes. Although the division by 2 * len_data doesn't directly impact the optimization process, it offers a more interpretable representation of the derivative of the cost function, which can be advantageous in certain contexts. Whether to include the division by 2 * len_data in the cost function is a matter of preference. Some implementations choose to omit this division and adjust the learning rate accordingly to compensate for the scale change. However, to ensure consistency throughout the optimization process and related calculations, it's essential to make a deliberate choice and adhere to it. So i don't believe your speculation is correct and the implementation of the SSE stands that's from me. |
It's perfectly fine to scale the loss function to make the math cleaner, but again the problem here is that this scaled version is no longer the SSE. SSE has a precise mathematical definition, so why should we call the scaled version the SSE when it's clearly not? Why not just call it
This is just factually wrong. MSE is defined to be SSE divided by While I have seen half of the MSE used in the context of ML (because, as you pointed out, it can make the math cleaner), we should definitely avoid misleading people with a different definition because this repo is for educational purposes. Anyway, you missed the whole point of my issue. My whole point was that the output of the function |
Closing this PR since, as I've stated already, it doesn't actually resolve the linked issue |
Describe your change:
FIxes #8847

I found the optimal iterations and learning rate for achieving the closest coefficients
Checklist: