Linear Regression
Linear Regression, Best Fit Line, Least Squares, Continuous Variables, Predictive Modeling
Linear Regression is the foundational algorithm for predicting continuous numbers. While classification algorithms predict categories (like 'Spam' or 'Not Spam'), regression predicts exact values (like forecasting a student's exam score based on how many hours they studied). It works by drawing a straight 'Best-Fit Line' right through the middle of your dataset, allowing you to estimate future outcomes based on past trends.
Equation of the Best-Fit Line
What do these variables mean?
- YThe predicted output (Dependent Variable). This is what you are trying to find.
- xThe input value (Independent Variable).
- mThe Slope (Weight): Tells you how much Y changes for every 1-unit increase in x. Calculated as:
- bThe Y-Intercept (Bias): The base value of Y when x is exactly 0. Calculated as:
- DeviationThe difference between a single data point and the average (mean) of all points. E.g., Deviation of x
How Does it Work?
Calculate the Mean (average) of all your 'x' values and all your 'y' values.
For every single row, calculate the deviations: subtract the mean of x from the row's x value, and the mean of y from the row's y value.
Multiply the x and y deviations together for each row, and also calculate the square of just the x deviations.
Sum up all your multiplied deviations, and sum up all your squared x deviations.
Divide the sum of the multiplied deviations by the sum of the squared deviations to find your slope (m).
Plug your slope (m) and your means into the intercept formula to find 'b'.
Finally, plug your new 'x' query into Y = mx + b to get your prediction!
Important Rules & Conventions
- Exam Trick 1: Draw a 6-column table! Label them: X, Y, X-Mean, Y-Mean, (X-Mean)², and (X-Mean)*(Y-Mean). This makes the math infinitely easier to track and prevents silly calculator mistakes.
- Exam Trick 2: The sum of your simple deviations (X-Mean) should ALWAYS equal exactly 0. If you add up that column and get 3.5, you calculated your mean incorrectly. Stop and fix it!
- The math you are doing manually is called the 'Least Squares Method'. It guarantees that the line you draw minimizes the Mean Squared Error (MSE) across all points.
Advantages
- ✓ Extremely simple to implement, calculate manually, and explain the results to non-technical people.
- ✓ Trains incredibly fast and doesn't require massive computational power.
- ✓ The slope (m) gives you instant insight: a high slope means that feature has a massive impact on the outcome.
Disadvantages
- × Assumes Linearity: It forces a straight line. If the real-world relationship is curved (like exponential population growth), this model will perform terribly.
- × Sensitivity to Outliers: Because it minimizes squared errors, one massive anomaly (like a billionaire in a dataset of average incomes) will drag the entire line away from the actual trend.
- × Struggles with 'Multicollinearity' in advanced versions. If multiple input features are highly correlated with each other, the model gets confused about which feature is actually causing the output to change.
Summary
Linear Regression is the 'Hello World' of predictive math. By using the Least Squares method, it finds the mathematically perfect straight line to minimize prediction errors. While it struggles with complex, curved, or outlier-heavy data, its absolute simplicity makes it an essential tool for trend analysis and forecasting.