- Data Preprocessing
- Import the dataset
- Clean the dataset
-
- Import the dataset
-
- Handle missing data
-
- Encode categorical data
-
- Split the dataset into the Training set and Test set
-
- Feature scaling
- Modelling
- Build a model
- Train the model
- Test the model
- Make Predictions
- Evaluation
- Calculate performance metrics
-
- Confusion Matrix
-
- Accuracy
-
- Precision
- Make a verdict
To split the data into training and test sets, we can use the train_test_split function from the sklearn.model_selection module. Here's an example:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)In this process, we will split the dataset into two parts: the training set and the test set. The test_size parameter specifies the proportion of the dataset that should be included in the test set. In this case, we are using 20% of the data for testing. The random_state parameter is used to ensure that the data is split in the same way each time the code is run.
The training set is used to train the machine learning model. It contains a set of input features (X_train) and the corresponding target labels (y_train). For example: 80% of the data is used for training the model.
- We use the training set to build the model and adjust its parameters to make accurate predictions. Then, we make a linear regression model that can predict the price of a house based on its size.
The test set is used to evaluate the performance of the trained model. It contains a set of input features (X_test) and the corresponding target labels (y_test). For example: 20% of the data is used for testing the model.
- We use the test set to evaluate the model's performance on unseen data. This helps us understand how well the model generalizes to new data and whether it is overfitting or underfitting.
Equation: y = b0 + b1 * x1 + b2 * x2 + ... + bn * xn
Note: Based on the comparison between the predicted value of y and the actual value of y, we can evaluate the performance of the model. We can use metrics such as the Mean Squared Error (MSE) or the R-squared value to assess the model's accuracy.
Feature scaling is a technique used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.
Example:
X1 X2 X3 X4
1000 1 0.5 10
2000 2 1 20
3000 3 1.5 30
4000 4 2 40
....
,...
Feature scaling always applies to the columns of the dataset, not the rows. It is important to apply feature scaling to ensure that all features contribute equally to the model training process and prevent any one feature from dominating the others.
- Standardization (Z-score normalization)
Standardization scales the data to have a mean of 0 and a standard deviation of 1. It is calculated as follows:
X_std = (X - X_mean) / X_std_dev
- Its value lies between -3 and +3. Where X is the feature value, X_mean is the mean of the feature values, and X_std_dev is the standard deviation of the feature values.
X_mean is the mean of the feature values, and X_std_dev is the standard deviation of the feature values. mean = sum(X) / n, std_dev = sqrt(sum((X - mean)^2) / n). Standard deviation is a measure of the amount of variation or dispersion of a set of values. Mean is the average of the values.
- Normalization (Min-Max scaling)
Normalization scales the data to a fixed range, usually between 0 and 1. It is calculated as follows:
X_norm = (X - X_min) / (X_max - X_min)
- Its value lies between 0 and 1. Where X is the feature value, X_min is the minimum value of the feature, and X_max is the maximum value of the feature.
X_min is the minimum value of the feature, and X_max is the maximum value of the feature. Min-Max scaling is used when the data does not follow a Gaussian distribution. It is sensitive to outliers in the data.
Lets consider a simple data set where we have annual income of different persons and age of different persons.
Annual Income (in $) Age
50000 25
60000 30
67000 32
70000 39
80000 25
90000 35
The task here is to identify whether a person with lets say salary 60000 and age 30 is similar to which other person in the dataset. This is where we use clustering algorithms like K-means to group similar data points together.
We can use feature scaling here to ensure that both the features (Annual Income and Age) contribute equally to the clustering process. If we don't scale the features, the model may give more weight to the feature with a larger range, which can affect the clustering results.
After applying Normalization, the data is scaled to a fixed range between 0 and 1. This ensures that both features contribute equally to the clustering process.
The value will look like:
Formula:
X_norm = (X - X_min) / (X_max - X_min)
Result:
Annual Income (in $) Age
0.0 0.0
0.2 0.2
0.4 0.3
0.5 0.6
0.7 0.0
1.0 0.8
- Categorical Variables
- Nominal: No order (e.g., colors, countries)
- Ordinal: Order matters (e.g., low, medium, high)
- Binary: Two categories (e.g., yes/no, true/false)
- Dummy: Represent
-
- 0: Absence of a category
-
- 1: Presence of a category
- Numerical Variables
- Discrete: Countable (e.g., number of cars)
- Continuous: Infinite values (e.g., height, weight)
- Interval: No true zero (e.g., temperature in Celsius)
- Ratio: True zero (e.g., weight, height)
- Text Variables
- Unstructured data (e.g., reviews, comments)
- Requires text preprocessing (e.g., tokenization, stemming)
- Date/Time Variables
- Time series data (e.g., stock prices, weather data)
- Extract features (e.g., day of the week, month)
- Mixed Variables
- Combination of different types (e.g., address, name)
- Requires feature engineering (e.g., one-hot encoding, feature extraction)
Sample Data:
Data.csv
Country Age Salary Purchased
France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 63777.78 Yes
France 35 58000 Yes
Spain 38.77777778 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 67000 Yes
- Independent Variables (Features)
- Input variables
- Used to predict the dependent variable
- Denoted as X
- Example: In above example, Country, Age, and Salary are independent variables.
- Dependent Variable (Target)
- Output variable
- Predicted by the independent variables
- Denoted as y
- Example: In above example, Purchased is the dependent variable. We want to predict whether a person will purchase a product based on the independent variables (Country, Age, Salary).
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values # This will select all rows and all columns except the last one
y = dataset.iloc[:, -1].values # This will select all rows and the last columnMissing data is a common problem in real-world datasets. It can occur due to various reasons, such as data corruption, human error, or system failures. Handling missing data is an essential step in the data preprocessing pipeline to ensure the quality and reliability of the data.
- Remove Rows with Missing Data
- Pros: Simple and straightforward
- Cons: May lose valuable information
- Impute Missing Data
- Pros: Retains valuable information
- Cons: May introduce bias or inaccuracies
- Use Advanced Techniques
- Pros: Handles missing data more effectively
- Cons: Requires more computational resources
- Mean/Median Imputation
- Replace missing values with the mean or median of the column
- Suitable for continuous numerical data
Example:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X[:, 1:] = imputer.fit_transform(X[:, 1:])- Mode Imputation
- Replace missing values with the mode (most frequent value) of the column
- Suitable for categorical data
Example:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
X[:, 0] = imputer.fit_transform(X[:, 0])- K-Nearest Neighbors (KNN) Imputation
- Replace missing values with the average of the K-nearest neighbors
- Suitable for both numerical and categorical data
Example:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
X = imputer.fit_transform(X)- Multiple Imputation
- Generate multiple imputed datasets and combine the results
- Suitable for complex datasets with missing data
Example:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(max_iter=10, random_state=0)
X = imputer.fit_transform(X)- Predictive Imputation
- Use machine learning models to predict missing values
- Suitable for datasets with complex relationships
Example:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(max_iter=10, random_state=0)
X = imputer.fit_transform(X)Categorical data refers to data that represents categories or groups. It can take on a limited number of values and is often represented as text or labels. Machine learning models require numerical input, so we need to encode categorical data into a numerical format.
Example:
Country Age Salary Purchased
France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 63777.78 Yes
France 35 58000 Yes
Spain 38.77777778 52000 No
France 48 79000 Yes
In above example, Country and Purchased are categorical variables. We need to encode these variables into numerical format before using them in a machine learning model. The country names (France, Spain, Germany) are nominal categorical variables and to use it in a machine learning model, we need to encode them into numerical format. THe way to encode these variables is to use One-Hot Encoding.
One-Hot Encoding is a technique used to convert categorical data into a numerical format. It creates binary columns for each category and assigns a 1 or 0 to indicate the presence of a category.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoderExample:
Country_France Country_Spain Country_Germany Purchased
1 0 0 0
0 1 0 1
0 0 1 0
0 1 0 0
0 0 1 1
1 0 0 1
0 1 0 0
1 0 0 1
After applying One-Hot Encoding to the categorical variables, we need to handle the issue of multicollinearity. Multicollinearity occurs when two or more independent variables are highly correlated, leading to redundancy in the data.
To avoid multicollinearity, we need to drop one of the binary columns created by One-Hot Encoding. This is known as the dummy variable trap.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))Output:
Country_Spain Country_Germany Country_France Age Salary Purchased
1.0 0.0 0.0 44 72000 No
0.0 0.0 1.0 27 48000 Yes
0.0 1.0 0.0 30 54000 No
0.0 0.0 1.0 38 61000 No
0.0 1.0 0.0 40 63777.78 Yes
1.0 0.0 0.0 35 58000 Yes
0.0 0.0 1.0 38.77777778 52000 No
1.0 0.0 0.0 48 79000 Yes
0.0 1.0 0.0 50 83000 No
1.0 0.0 0.0 37 67000 Yes
For the dependent variable, we can use Label Encoding to convert categorical labels into numerical format.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)Output:
0
1
0
0
1
1
0
1
0
1
Excercise:
titanic.csv
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0,D56,S
24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S
28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
53,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C
55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C
# Importing the necessary libraries
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# Load the dataset
dataset = pd.read_csv('titanic.csv')
# Identify the categorical data
categorical_features = ['Sex', 'Embarked', 'Pclass']
# categorical_indices = [dataset.columns.get_loc(col) for col in categorical_columns]
# Implement an instance of the ColumnTransformer class
ct = ColumnTransformer(
transformers=[
('encoder', OneHotEncoder(), categorical_features)
],
remainder='passthrough'
)
# Apply the fit_transform method on the instance of ColumnTransformer
X = ct.fit_transform(dataset)
# Convert the output into a NumPy array
X = np.array(X)
# Use LabelEncoder to encode binary categorical data
le = LabelEncoder()
y = le.fit_transform(dataset['Survived'])
# Print the updated matrix of features and the dependent variable vector
print("Updated matrix of features: \n", X)
print("Updated dependent variable vector: \n", y)To evaluate the performance of a machine learning model, we need to split the dataset into two parts: the training set and the test set. The training set is used to train the model, while the test set is used to evaluate its performance on unseen data.
Q. Do we need to apply feature scaling before splitting the dataset into the training set and test set or after splitting the dataset?
A. It is recommended to apply feature scaling after splitting the dataset into the training set and test set. This is because feature scaling should be done on the training set only to prevent data leakage from the test set. If feature scaling is applied before splitting the dataset, information from the test set may leak into the training set, leading to biased results.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)Excercise:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Load the Iris dataset
data = pd.read_csv('iris.csv')
# Separate features and target
X = data.iloc[:, :-1]
Y = data.iloc[:, -1]
# Split the dataset into an 80-20 training-test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# Apply feature scaling on the training and test sets
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # scaler.fit_transform is used for training set. it will learn the parameters from the training set.
X_test = scaler.transform(X_test)
# Print the scaled training and test sets
print(X_train)
print(X_test)Feature scaling is a technique used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.
- Machine Learning Algorithms
- Many machine learning algorithms use Euclidean distance to make predictions.
- Features with larger scales may dominate the distance calculation.
- Feature scaling ensures that all features contribute equally to the model.
- Convergence Speed
- Gradient descent converges faster on scaled features.
- It reduces the number of iterations required to reach the minimum.
- Regularization
- Regularization techniques like Lasso and Ridge are sensitive to feature scales.
- Feature scaling helps in regularization by penalizing large coefficients.
- Distance-Based Algorithms
- Distance-based algorithms like K-Nearest Neighbors (KNN) are sensitive to feature scales.
- Feature scaling ensures that the algorithm is not biased towards features with larger scales.
- Standardization (Z-score normalization)
- Scales the data to have a mean of 0 and a standard deviation of 1.
- Suitable for algorithms that assume normally distributed data.
- Equation:
z = (x - mean) / std_dev -
- mean is the mean of the data.
-
- std_dev is the standard deviation of the data. It is calculated as the square root of the variance which is the average of the squared differences from the mean.
- Its value will be in the range of +3 and -3
- It works well when the data follows a Gaussian distribution. It is not sensitive to outliers in the data.
- Normalization (Min-Max scaling)
- Scales the data to a fixed range, usually between 0 and 1.
- Suitable for algorithms that require data to be on the same scale.
- Equation:
x_norm = (x - min) / (max - min) -
- min is the minimum value of the data.
-
- max is the maximum value of the data.
- It always result in the range of 0 to 1
- It is recommended to use Min-Max scaling when the data follows a Gaussian distribution. It is sensitive to outliers in the data which means that it can be affected by outliers in the data. Outliers are data points that are significantly different from other data points in the dataset.
- Robust Scaling
- Scales the data based on the interquartile range (IQR).
- Suitable for datasets with outliers.
- Equation:
x_robust = (x - Q1) / (Q3 - Q1) -
- Q1 is the first quartile (25th percentile).
-
- Q3 is the third quartile (75th percentile).
- MaxAbs Scaling
- Scales the data based on the maximum absolute value.
- Suitable for sparse datasets.
- Equation:
x_maxabs = x / max(abs(x)) -
- max(abs(x)) is the maximum absolute value of the data.
- Quantile Transformation
- Maps the data to a uniform or normal distribution.
- Suitable for non-Gaussian data.
- Equation:
x_quantile = F(x) -
- F(x) is the cumulative distribution function of the data.
Example:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:]) # Apply feature scaling to the training set. It will scale the columns from 3 to the end.
X_test[:, 3:] = sc.transform(X_test[:, 3:]) # Apply feature scaling to the test set. It will scale the columns from 3 to the end.Note:
We are performing fit_transform on the training set and transform on the test set. This is because we want to scale the test set based on the parameters learned from the training set. This ensures that the scaling is consistent across both sets. If we were to use fit_transform on the test set, it would learn new parameters and the scaling would be different from the training set.
Coding exercise: Feature scaling for Machine Learning
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load the Wine Quality Red dataset
# dataset = pd.read_csv('wineqality-red.csv', sep=';')
dataset = pd.read_csv('winequality-red.csv', delimiter=';')
# Separate features and target
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]
# Split the dataset into an 80-20 training-test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create an instance of the StandardScaler class
scaler = StandardScaler()
# Fit the StandardScaler on the features from the training set and transform it
X_train = scaler.fit_transform(X_train)
# Apply the transform to the test set
X_test = scaler.transform(X_test)
# Print the scaled training and test datasets
print(X_train)
print(X_test)
Regression is a fundamental concept in statistics and machine learning, used to model and analyze relationships between variables. The primary goal of regression is to predict the value of a dependent variable (often called the "target" or "outcome") based on one or more independent variables (also known as "predictors" or "features"). Regression models are particularly useful when the dependent variable is continuous, like predicting house prices, stock prices, or temperatures.
- Dependent Variable: The outcome or the variable you are trying to predict. It's also known as the response variable.
- Independent Variables: The predictors or features that are used to predict the dependent variable.
- Regression Line: A line that best fits the data points in a regression model. In simple linear regression, it is represented as y=mx+b, where m is the slope, and b is the y-intercept.
- Residuals: The differences between observed and predicted values. Residuals are used to assess the accuracy of the regression model.
- One independent variable
- Linear relationship between variables
- Equation:
y = b0 + b1 * x1 -
- y is the dependent variable
-
- x1 is the independent variable
-
- b0 is the y-intercept
-
- b1 is the slope of the line
- Minimizes the sum of squared differences between observed and predicted values
- Finds the best-fitting line that minimizes the sum of squared residuals
from sklearn.linear_model import LinearRegression
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)Formula for OLS
b1 = sum((x - x_mean) * (y - y_mean)) / sum((x - x_mean)^2)
b0 = y_mean - b1 * x_mean
y = b0 + b1 * x1 where b0 is the y-intercept and b1 is the slope of the line.
Multiple Linear Regression is a statistical method used to understand the relationship between one dependent variable (the outcome you want to predict) and two or more independent variables (the predictors or factors that influence the outcome). It’s like extending simple linear regression, which deals with just one predictor, to handle multiple predictors.
-
Dependent Variable (Target/Outcome): This is the variable you want to predict or explain. For example, predicting house prices.
-
Independent Variables (Predictors/Features): These are the variables that are used to predict the dependent variable. For example, predicting house prices based on the size of the house, number of bedrooms, location, etc.
-
Equation: The relationship can be represented with a linear equation:
[ y = b_0 + b_1x_1 + b_2x_2 + \ldots + b_nx_n + \epsilon ]
- ( y ): Dependent variable (e.g., house price)
- ( x_1, x_2, \ldots, x_n ): Independent variables (e.g., size, bedrooms, location)
- ( b_0 ): Intercept (the value of ( y ) when all ( x )'s are zero)
- ( b_1, b_2, \ldots, b_n ): Coefficients (they show how much ( y ) changes with a change in each ( x ))
- ( \epsilon ): Error term (captures the variation in ( y ) not explained by the predictors)
-
Objective: The goal is to find the best-fitting line (or plane, or hyperplane) that minimizes the difference between the predicted values and the actual values of the dependent variable. This is done by adjusting the coefficients (( b_1, b_2, \ldots, b_n )).
-
Fit the Model: The model is trained on a dataset by finding the best coefficients that minimize the difference between the predicted values and actual values. This is usually done using a method called Least Squares which minimizes the sum of squared errors.
-
Prediction: Once the coefficients are determined, you can use the model to make predictions on new data by plugging the values of independent variables into the equation.
Suppose you want to predict the price of a house based on its size (in square feet), the number of bedrooms, and the distance to the city center.
| Size (sq ft) | Bedrooms | Distance to City Center (miles) | House Price ($) |
|---|---|---|---|
| 2000 | 3 | 5 | 300,000 |
| 1600 | 2 | 3 | 250,000 |
| 2400 | 4 | 8 | 350,000 |
| 3000 | 4 | 6 | 400,000 |
-
Define the variables:
- Dependent variable: House Price (( y ))
- Independent variables: Size (( x_1 )), Bedrooms (( x_2 )), Distance to City Center (( x_3 ))
-
Formulate the equation:
[ \text{House Price} = b_0 + b_1 \times \text{Size} + b_2 \times \text{Bedrooms} + b_3 \times \text{Distance to City Center} ]
-
Training the Model:
- Using the given data, the model learns the coefficients (( b_0, b_1, b_2, b_3 )) that best predict the house prices.
-
Making Predictions:
- For a house that is 2200 sq ft, with 3 bedrooms, and 4 miles from the city center, plug in the values into the equation:
[ \text{Predicted Price} = b_0 + b_1 \times 2200 + b_2 \times 3 + b_3 \times 4 ]
- Linearity: The relationship between dependent and independent variables is linear.
- Independence: The observations are independent of each other.
- Homoscedasticity: The variance of residual errors is constant across all levels of the independent variables.
- Normality: The residuals (errors) of the model should be normally distributed.
- No Multicollinearity: The independent variables should not be too highly correlated with each other.
-
Advantages:
- Easy to understand and implement.
- Works well when the relationship is linear and the assumptions are met.
- Can handle multiple predictors.
-
Limitations:
- Not suitable for non-linear relationships.
- Performance can be affected if the assumptions are violated.
- Sensitive to outliers which can skew the results.
-
Linearity(Linear relationship between Y and each X): The relationship between the independent and dependent variables is linear.
- The relationship between the independent and dependent variables can be represented by a straight line.
- It can be checked using a scatter plot of the data.
-
Homoscedasticity(Equal Variance): The variance of the residuals is constant.
- The residuals are equally spread across all values of the independent variables.
- The residuals are not dependent on the independent variables.
- It can be checked using a scatter plot of the residuals.
-
MultiVariate Normality(Normality of error distribution): The residuals are normally distributed.
- The residuals are normally distributed.
- The residuals have a mean of zero.
- The residuals are independent of each other.
- It can be checked using a Q-Q plot or a histogram of the residuals.
-
Independence(of observations. includes "no autocorrelation"): The residuals are independent of each other.
- The residuals are independent of each other.
- There is no correlation between the residuals.
- It is important to check for autocorrelation in time series data.
- It can be checked using a scatter plot of the residuals.
-
Lack of Multicollienarity(Predictors are not correlated with each other): The independent variables are not highly correlated with each other.
- The independent variables are not highly correlated with each other.
- Multicollinearity can lead to unstable coefficients and inaccurate predictions.
- It can be checked using a correlation matrix or a variance inflation factor (VIF).
-
The outlier check: The presence of outliers can significantly affect the regression model.
- Outliers are data points that are significantly different from other data points in the dataset.
- Outliers can affect the regression coefficients and the model's accuracy.
- It can be checked using box plots or scatter plots of the data.
Multiple Linear Regression is an extension of Simple Linear Regression that allows us to model the relationship between two or more independent variables and a continuous dependent variable. It is used to predict the value of the dependent variable based on the values of the independent variables.
Table:
| Profit | R&D Spend | Administration | Marketing Spend | State |
|---|---|---|---|---|
| 192261.83 | 165349.20 | 136897.80 | 471784.10 | New York |
| 191792.06 | 162597.70 | 151377.59 | 443898.53 | California |
| 191050.39 | 153441.51 | 101145.55 | 407934.54 | New York |
| 182901.99 | 144372.41 | 118671.85 | 383199.62 | New York |
| 166187.94 | 142107.34 | 91391.77 | 366168.42 | New York |
| 156991.12 | 131876.90 | 99814.71 | 362861.36 | New York |
| 156122.51 | 130298.13 | 145530.06 | 127716.82 | California |
| 155752.60 | 120542.52 | 148718.95 | 311613.29 | California |
Here, profit is the dependent variable, and R&D Spend, Administration, Marketing Spend, and State are the independent variables. The goal is to predict the profit based on the values of the independent variables.
- Expanding the table for categorical data (State) using One-Hot Encoding:
| Profit | R&D Spend | Administration | Marketing Spend | New York | California |
|---|---|---|---|---|---|
| 192261.83 | 165349.20 | 136897.80 | 471784.10 | 1 | 0 |
| 191792.06 | 162597.70 | 151377.59 | 443898.53 | 0 | 1 |
| 191050.39 | 153441.51 | 101145.55 | 407934.54 | 1 | 0 |
| 182901.99 | 144372.41 | 118671.85 | 383199.62 | 1 | 0 |
| 166187.94 | 142107.34 | 91391.77 | 366168.42 | 1 | 0 |
| 156991.12 | 131876.90 | 99814.71 | 362861.36 | 1 | 0 |
| 156122.51 | 130298.13 | 145530.06 | 127716.82 | 0 | 1 |
| 155752.60 | 120542.52 | 148718.95 | 311613.29 | 0 | 1 |
Here, New York and California are dummy variables created using One-Hot Encoding. They represent the presence or absence of the State in the dataset. The model will learn the coefficients for these variables to predict the profit based on the State.
The dummy variable trap is a situation in which two or more dummy variables are highly correlated. This can lead to multicollinearity, which can affect the regression coefficients and the model's accuracy. To avoid the dummy variable trap, one of the dummy variables should be dropped from the model. This can be done by using the drop_first=True parameter in the get_dummies function in pandas.
# Drop the first dummy variable to avoid the dummy variable trap
X = pd.get_dummies(X, drop_first=True)By dropping one of the dummy variables, we ensure that the model does not include redundant information and avoids multicollinearity.
The p-value is a statistical measure that helps us decide whether to reject a null hypothesis in a hypothesis test. It's essentially a way to measure the strength of the evidence against the null hypothesis.
-
Null Hypothesis (H₀): This is the default assumption that there is no effect or no difference. In regression analysis, the null hypothesis for a coefficient is that it equals zero (meaning the independent variable has no effect on the dependent variable).
-
Alternative Hypothesis (H₁): This is the opposite of the null hypothesis. It suggests that there is an effect or a difference. For a regression coefficient, it means the coefficient is not zero (the independent variable does affect the dependent variable).
-
P-value: The p-value helps us determine whether the observed data is consistent with the null hypothesis. It's a probability value that ranges from 0 to 1.
- Low p-value (< 0.05): This suggests that the observed data is unlikely under the null hypothesis, providing strong evidence against the null hypothesis. We usually reject the null hypothesis in this case.
- High p-value (> 0.05): This suggests that the observed data is likely under the null hypothesis, providing weak evidence against the null hypothesis. We usually fail to reject the null hypothesis in this case.
Let's say we want to test if a coin is fair (i.e., it has an equal chance of landing heads or tails).
- Null Hypothesis (H₀): The coin is fair (p = 0.5 for heads).
- Alternative Hypothesis (H₁): The coin is biased (p ≠ 0.5).
Suppose you toss the coin 100 times, and it lands heads 65 times. You calculate a p-value of 0.03:
- Since 0.03 is less than 0.05, you reject the null hypothesis and conclude that the coin is biased.
Suppose you're studying how marketing spend and R&D spend affect company profits using multiple linear regression.
Regression Equation: [ \text{Profit} = \beta_0 + \beta_1 (\text{R&D Spend}) + \beta_2 (\text{Marketing Spend}) + \epsilon ]
- Null Hypothesis (H₀) for (\beta_1): R&D Spend does not affect profit ((\beta_1 = 0)).
- Alternative Hypothesis (H₁) for (\beta_1): R&D Spend affects profit ((\beta_1 \neq 0)).
| Coefficient | Estimate | p-value |
|---|---|---|
| Intercept ((\beta_0)) | 50,000 | 0.04 |
| R&D Spend ((\beta_1)) | 0.8 | 0.001 |
| Marketing Spend ((\beta_2)) | 0.5 | 0.2 |
Interpretation:
-
Intercept: The p-value is 0.04, which is less than 0.05, indicating that the intercept is statistically significant.
-
R&D Spend: The p-value is 0.001, which is much less than 0.05, indicating that R&D Spend has a statistically significant effect on profit. This suggests that as R&D Spend increases, profit is likely to increase significantly.
-
Marketing Spend: The p-value is 0.2, which is greater than 0.05, indicating that Marketing Spend is not statistically significant in predicting profit. This suggests that changes in Marketing Spend do not significantly affect profit.
When performing regression analysis or any statistical test, the p-value helps determine which variables (coefficients) significantly impact the outcome. It guides decisions on whether to keep or discard variables from the model based on their impact.
For instance, in the above regression example, you might consider focusing more on R&D spend as it shows a significant positive effect on profits, while re-evaluating the role of marketing spend in your model since its impact isn't statistically clear.
- A low p-value (< 0.05) suggests strong evidence against the null hypothesis, making it statistically significant.
- A high p-value (> 0.05) suggests weak evidence against the null hypothesis, making it not statistically significant.
- Always use the p-value in the context of your hypothesis and data to make informed decisions about the significance of your results.
5 methods for building a model:
- Backward Elimination: Remove the least significant variable.
- Forward Selection: Add the most significant variable.
- Bidirectional Elimination: Remove or add variables based on p-values.
- Score Comparison: Compare the performance of different models.
- All-in: Use all the variables.
Backward Elimination is a step-by-step process used in multiple linear regression to find the most significant variables that impact the dependent variable (the variable we are trying to predict). It helps simplify the model by removing the least significant predictors one at a time until only the important ones remain.
-
Set a Significance Level: Decide on a significance level, which determines how strict you want to be when deciding whether a variable should stay in the model. A common choice is 0.05 (or 5%). This means that only variables with p-values less than 0.05 are considered statistically significant.
-
Fit the Full Model: Start by fitting the model with all possible predictors (independent variables).
-
Identify the Least Significant Predictor: Look at the p-values of all predictors. Find the predictor with the highest p-value (the least statistically significant).
-
Remove the Predictor if it's Not Significant: If the highest p-value is greater than your chosen significance level (e.g., 0.05), remove that predictor from the model.
-
Repeat the Process: Refit the model without the removed predictor and repeat the process. Continue removing the least significant predictor one at a time until all remaining predictors have p-values less than the significance level.
-
Final Model: Once all predictors have p-values less than the significance level, you have your final model.
Let's say you have a dataset with the following variables:
- Dependent Variable (Target): House Price
- Independent Variables (Predictors): Number of Bedrooms, Size of the House, Age of the House, Distance to Nearest School, Crime Rate, and Number of Bathrooms.
- Set SL = 0.05.
- Fit the model using all predictors.
| Predictor | Coefficient | p-value |
|---|---|---|
| Intercept | - | 0.03 |
| Number of Bedrooms | 2000 | 0.02 |
| Size of the House | 5000 | 0.001 |
| Age of the House | -300 | 0.15 |
| Distance to Nearest School | -100 | 0.20 |
| Crime Rate | -2000 | 0.10 |
| Number of Bathrooms | 3000 | 0.04 |
- The predictor with the highest p-value is Distance to Nearest School (p = 0.20), which is greater than 0.05. Remove this predictor.
Refit the model without "Distance to Nearest School".
| Predictor | Coefficient | p-value |
|---|---|---|
| Intercept | - | 0.03 |
| Number of Bedrooms | 2000 | 0.02 |
| Size of the House | 5000 | 0.001 |
| Age of the House | -300 | 0.10 |
| Crime Rate | -2000 | 0.06 |
| Number of Bathrooms | 3000 | 0.04 |
- The new highest p-value is Age of the House (p = 0.10). Remove this predictor.
Refit the model again without "Age of the House".
| Predictor | Coefficient | p-value |
|---|---|---|
| Intercept | - | 0.02 |
| Number of Bedrooms | 2000 | 0.01 |
| Size of the House | 5000 | 0.001 |
| Crime Rate | -2000 | 0.05 |
| Number of Bathrooms | 3000 | 0.03 |
- All remaining predictors now have p-values less than 0.05, so the process stops.
The final model includes:
- Number of Bedrooms
- Size of the House
- Crime Rate
- Number of Bathrooms
Result: By using Backward Elimination, you've removed predictors that weren't significantly contributing to the model's ability to predict house prices. The remaining predictors have a stronger statistical relationship with the house prices, making the model simpler and more reliable.
- Backward Elimination helps reduce the complexity of a regression model by systematically removing the least significant predictors.
- It relies on p-values to determine which predictors are statistically significant.
- The goal is to improve model interpretability and ensure that only the most meaningful variables are included, without compromising the model's predictive power.
Forward Selection is a stepwise approach used in multiple linear regression to build a model incrementally by adding the most significant variables one at a time. This method starts with no predictors and gradually adds them based on their statistical significance, ensuring each addition improves the model's predictive ability.
Detailed explanation of the Forward Selection process, using simple language and a practical example:
-
Set a Significance Level: Choose a significance level (SL) for including variables in the model, e.g., SL = 0.05. This level determines which variables are statistically significant enough to be added.
-
Fit Simple Regression Models: Fit a simple regression model for each predictor independently, i.e., regress the dependent variable ( y ) on each predictor ( x_n ).
-
Select the Predictor with the Lowest p-value: Among all the simple models, select the predictor with the lowest p-value (must be below the chosen SL). This predictor is added to the model.
-
Fit Models with Additional Predictors: With the selected predictor(s) from the previous step, fit new models by adding one additional predictor at a time.
-
Repeat the Process: Add the predictor with the lowest p-value (below the SL) to the model. Repeat the process of fitting models with additional predictors until no remaining predictors have a p-value below the significance level.
-
Finalize the Model: When no more predictors can be added that meet the significance level, the model is complete.
Suppose you have a dataset with the following variables:
- Dependent Variable (Target): House Price
- Independent Variables (Predictors): Number of Bedrooms, Size of the House, Age of the House, Distance to Nearest School, Crime Rate, and Number of Bathrooms.
- Set SL = 0.05.
- Fit separate models for each predictor:
| Predictor | p-value |
|---|---|
| Number of Bedrooms | 0.10 |
| Size of the House | 0.001 |
| Age of the House | 0.12 |
| Distance to Nearest School | 0.15 |
| Crime Rate | 0.07 |
| Number of Bathrooms | 0.04 |
- The predictor with the lowest p-value is Size of the House (p = 0.001). Add this predictor to the model.
Now, add one predictor at a time to the model that already includes "Size of the House".
| Model | Added Predictor | p-value (Added Predictor) |
|---|---|---|
| House Price ~ Size of the House + Bedrooms | Bedrooms | 0.08 |
| House Price ~ Size of the House + Age | Age | 0.15 |
| House Price ~ Size of the House + Distance | Distance | 0.12 |
| House Price ~ Size of the House + Crime Rate | Crime Rate | 0.03 |
| House Price ~ Size of the House + Bathrooms | Bathrooms | 0.04 |
- Add Crime Rate (p = 0.03) to the model, as it has the lowest p-value below SL.
Continue this process by testing models with "Size of the House" and "Crime Rate", plus one additional predictor at a time:
| Model | Added Predictor | p-value (Added Predictor) |
|---|---|---|
| House Price ~ Size of the House + Crime Rate + Bedrooms | Bedrooms | 0.09 |
| House Price ~ Size of the House + Crime Rate + Age | Age | 0.10 |
| House Price ~ Size of the House + Crime Rate + Distance | Distance | 0.08 |
| House Price ~ Size of the House + Crime Rate + Bathrooms | Bathrooms | 0.02 |
- The predictor with the lowest p-value is Number of Bathrooms (p = 0.02). Add this predictor.
Continue this process until no remaining predictors can be added with a p-value below the significance level. Assume now all other predictors have p-values above 0.05 when added to this model:
Final Model Includes:
- Size of the House
- Crime Rate
- Number of Bathrooms
Result: Forward Selection has helped build a model by only including variables that significantly contribute to predicting house prices, based on the statistical evidence provided by p-values.
Key Notes:
- Forward Selection is efficient in identifying significant predictors, especially when dealing with a large number of variables.
- By incrementally adding variables, it prevents overfitting by only including predictors that show statistical significance.
- The method stops when no further predictors meet the significance level, resulting in a concise and effective predictive model.
Bidirectional Elimination (also known as Stepwise Selection) is a combination of Forward Selection and Backward Elimination methods. It allows for adding significant predictors and removing insignificant ones simultaneously during model building. This approach helps in constructing an optimal model by iteratively testing which variables should enter or exit the model based on their significance.
Detailed explanation of the process with an example:
-
Set Significance Levels: Define two significance levels:
- SLENTER: Significance level to enter the model (e.g., 0.05).
- SLSTAY: Significance level to stay in the model (e.g., 0.05).
-
Forward Selection Step: Start by applying Forward Selection:
- Add predictors one by one to the model if their p-value is less than SLENTER.
-
Backward Elimination Step: After each addition, apply Backward Elimination:
- Check all the predictors currently in the model and remove any with a p-value greater than SLSTAY.
-
Repeat: Continue alternating between Forward Selection and Backward Elimination:
- Add new predictors if they meet SLENTER.
- Remove predictors if they do not meet SLSTAY.
-
Stop When Stable: The process stops when no new predictors can enter and no current predictors need to be removed.
-
Final Model: The resulting model is your final set of predictors.
Suppose you have a dataset with the following variables:
- Dependent Variable (Target): Sales
- Independent Variables (Predictors): TV Advertising Spend, Radio Advertising Spend, Newspaper Advertising Spend, Store Size, and Number of Competitors.
- SLENTER = 0.05
- SLSTAY = 0.05
-
Fit Simple Regression Models:
- Fit individual models for each predictor to find p-values:
Predictor p-value TV Advertising Spend 0.01 Radio Advertising Spend 0.03 Newspaper Advertising 0.20 Store Size 0.08 Number of Competitors 0.15 - TV Advertising Spend (p = 0.01) is added to the model, as it has the lowest p-value and is below SLENTER.
- With "TV Advertising Spend" in the model, there is no other variable to remove since it is the first addition.
Forward Step:
-
Fit new models adding one more predictor at a time to "TV Advertising Spend":
Model Added Predictor p-value (Added Predictor) Sales ~ TV Advertising + Radio Advertising Radio Advertising Spend 0.04 Sales ~ TV Advertising + Newspaper Newspaper Advertising 0.18 Sales ~ TV Advertising + Store Size Store Size 0.07 Sales ~ TV Advertising + Competitors Number of Competitors 0.12 -
Radio Advertising Spend (p = 0.04) is added since it's below SLENTER.
Backward Step:
-
Now, check all predictors in the model:
Predictor p-value TV Advertising Spend 0.02 Radio Advertising Spend 0.04 -
Both predictors remain in the model as their p-values are below SLSTAY.
Repeat until no more predictors meet the criteria for entering or exiting.
- Continue this process, adding or removing predictors based on SLENTER and SLSTAY.
- The process stops when adding more variables does not significantly improve the model, or if removing variables is not required.
Result: The final model includes only those predictors that significantly contribute to predicting sales. Bidirectional Elimination efficiently balances adding significant variables and removing those that do not contribute much.
Key Notes:
- Adaptable: The method allows for flexibility in model building by adding and removing predictors.
- Optimal Model: Helps in identifying the most effective predictors without manually trying each combination.
- Efficient: Reduces the chances of overfitting by only including significant variables.
Score Comparison is a model selection technique that evaluates all possible combinations of predictors to find the best-fitting model based on a specific criterion, such as the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), or Adjusted R-squared. This method exhaustively searches through all potential models and selects the one that optimizes the chosen criterion.
Here’s a detailed explanation of the steps involved with an example:
-
Select a Goodness of Fit Criterion: Choose a statistical measure to evaluate the quality of each model. Common criteria include:
- Akaike Information Criterion (AIC): Measures the quality of a model by penalizing the likelihood based on the number of parameters. Lower values indicate better models.
- Bayesian Information Criterion (BIC): Similar to AIC but includes a stronger penalty for models with more parameters.
- Adjusted R-squared: Adjusts the R-squared value for the number of predictors, balancing fit with complexity.
-
Fit All Possible Regression Models: With ( n ) predictors, fit ( 2^n - 1 ) models, as each predictor can either be included or excluded independently:
- For example, with 3 predictors, you would fit ( 2^3 - 1 = 7 ) models.
-
Compare Models Using the Criterion: Evaluate each model based on the selected criterion. For AIC or BIC, lower scores indicate a better balance of model fit and complexity.
-
Select the Best Model: Choose the model with the best (optimal) criterion score as your final model.
Suppose you have the following predictors for predicting house prices:
- Predictors: Size of the house, Number of bedrooms, Age of the house, and Proximity to the city center.
- Choose Akaike Information Criterion (AIC) as the measure of goodness of fit.
-
With 4 predictors, there are ( 2^4 - 1 = 15 ) possible models:
Model Predictors Included 1 Size 2 Bedrooms 3 Age 4 Proximity 5 Size, Bedrooms 6 Size, Age 7 Size, Proximity ... ... 15 Size, Bedrooms, Age, Proximity
-
Compute the AIC score for each model:
Model Predictors Included AIC Score 1 Size 150 2 Bedrooms 170 3 Age 160 4 Proximity 140 5 Size, Bedrooms 130 6 Size, Age 125 7 Size, Proximity 120 ... ... ... 15 Size, Bedrooms, Age, Proximity 110
- From the table above, Model 15 has the lowest AIC score (110), indicating it is the best model based on the AIC criterion.
Result: The final model includes all four predictors (Size, Bedrooms, Age, Proximity), as this combination provided the best trade-off between goodness of fit and model complexity.
Key Points:
- Exhaustive Search: Evaluates every possible combination, ensuring the optimal model is chosen.
- Criterion-Based: Relies on a statistical criterion to balance fit and complexity, avoiding overfitting.
- Computationally Intensive: This method can be computationally expensive for datasets with a large number of predictors due to the exponential number of combinations.
Advantages:
- Provides a thorough evaluation of all possible models.
- Ensures the best model according to the selected criterion.
Disadvantages:
- Computationally expensive for large numbers of predictors.
- May require significant processing power and time for complex datasets.
- Score Comparison is a comprehensive method for selecting the best model by evaluating all possible combinations based on a chosen criterion.
- It helps balance model fit and complexity, ensuring the final model is optimal for the given dataset.
- The method is computationally intensive for datasets with many predictors but provides a thorough evaluation of model performance.
All Possible Models is a comprehensive approach to model selection that evaluates every combination of predictors to determine the best model based on a chosen criterion, such as the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), or Adjusted R-squared. This method ensures that the selected model provides the best fit while considering model complexity.
Here’s a detailed explanation of the steps involved with an example:
-
Select a Goodness of Fit Criterion: Choose a statistical measure to evaluate the quality of each model. Common choices include:
- Akaike Information Criterion (AIC): Balances model fit with complexity; lower values indicate better models.
- Bayesian Information Criterion (BIC): Similar to AIC but with a stronger penalty for additional predictors.
- Adjusted R-squared: Adjusts R-squared to account for the number of predictors, aiming for a better balance of fit and simplicity.
-
Construct All Possible Regression Models: For ( n ) predictors, create ( 2^n - 1 ) models. Each model represents a unique combination of predictors.
- Example: With 3 predictors, there are ( 2^3 - 1 = 7 ) possible models.
-
Compare Models Using the Criterion: Evaluate each model based on the chosen criterion. The goal is to identify the model that best optimizes this criterion.
-
Select the Best Model: Choose the model with the best score according to the criterion.
Suppose you have data on car prices and want to predict prices using the following predictors:
- Predictors: Horsepower, Engine size, Mileage, Age of the car.
- Choose Adjusted R-squared to measure model fit, adjusting for the number of predictors.
-
With 4 predictors, there are ( 2^4 - 1 = 15 ) possible models:
Model Predictors Included 1 Horsepower 2 Engine size 3 Mileage 4 Age of the car 5 Horsepower, Engine size 6 Horsepower, Mileage 7 Horsepower, Age of the car ... ... 15 Horsepower, Engine size, Mileage, Age
-
Calculate the Adjusted R-squared value for each model:
Model Predictors Included Adjusted R-squared 1 Horsepower 0.60 2 Engine size 0.58 3 Mileage 0.55 4 Age of the car 0.59 5 Horsepower, Engine size 0.65 6 Horsepower, Mileage 0.63 7 Horsepower, Age of the car 0.61 ... ... ... 15 Horsepower, Engine size, Mileage, Age 0.70
- From the table, Model 15 with all predictors has the highest Adjusted R-squared value (0.70), indicating it provides the best balance of fit and complexity.
Result: The selected model includes all four predictors: Horsepower, Engine size, Mileage, and Age of the car, as this combination offers the best Adjusted R-squared value.
Key Points:
- Exhaustive Evaluation: Considers every possible model, ensuring the most comprehensive evaluation of predictor combinations.
- Criterion-Driven Selection: Uses a statistical measure to avoid overfitting by balancing model fit with the number of predictors.
- Optimal Model: Guarantees the selection of the best-fitting model according to the chosen criterion.
Advantages:
- Thoroughly explores all potential predictor combinations.
- Provides a high degree of confidence in the selected model’s fit and generalizability.
Disadvantages:
- Computationally expensive, especially for large numbers of predictors due to the exponential growth in model combinations.
- Requires significant computational resources and time for large datasets.
- All Possible Models is a comprehensive method for selecting the best model by evaluating every combination of predictors based on a chosen criterion.
- It ensures that the final model is optimal in terms of fit and complexity, avoiding overfitting and underfitting.
- The method is computationally intensive for datasets with many predictors but provides a thorough evaluation of model performance.
Note: The choice of criterion (AIC, BIC, Adjusted R-squared) is crucial in determining the best model. It should reflect the trade-off between model fit and complexity that best suits the dataset and research question.
### Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
### Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
print(X)
### Encoding categorical data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
print(X)
### Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
### Training the Multiple Linear Regression model on the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
### Predicting the Test set results
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))In multilinear regression, we don't have to apply feature scaling because the coefficients will adjust to balance the scales of the variables.
For example:
- If we have two variables, one ranging from 0 to 10 and the other from 0 to 1000, the coefficients will adjust to balance the scales of the variables.
Polynomial Regression is a form of regression analysis in which the relationship between the independent variable ( x ) and the dependent variable ( y ) is modeled as an ( n )-th degree polynomial. It is used when the relationship between the variables is not linear and can be better approximated by a polynomial function.
y = b0 + b1*x1 + b2*x1^2 + b3*x1^3 + ... + bn*x1^n
- Motivation: Linear regression is great for straight-line relationships, but what if the data shows a curved trend? For instance, predicting the growth of a plant based on time, where growth might accelerate and then plateau.
- Problem: Linear regression struggles with such non-linear data patterns, often leading to inaccurate predictions.
- Imagine you’re riding a bike up and down hills (representing the curve of the data). Linear regression is like trying to draw a straight path from start to finish, cutting through the hills without following the ups and downs. Polynomial regression, however, lets you trace the actual path over the hills, accounting for every rise and fall.
-
Overview:
- Polynomial Regression is a type of regression that models the relationship between the independent variable (x) and the dependent variable (y) as an nth-degree polynomial. It allows for curves in the relationship, fitting the data more flexibly than a straight line.
-
Detailed Explanation:
- Linear Regression Recap: In linear regression, the model predicts ( y ) as ( y = b_0 + b_1x ). This works for straight-line trends.
- Polynomial Regression: Extends this idea to include powers of x. For example, a quadratic (second-degree) polynomial regression models ( y = b_0 + b_1x + b_2x^2 ). For a cubic (third-degree), it’s ( y = b_0 + b_1x + b_2x^2 + b_3x^3 ), and so on.
- Key Intuition: The additional terms (( x^2, x^3, \ldots )) enable the curve to bend and fit the data better than a straight line.
-
Visuals:
- To be added later
- Example: Predicting housing prices where the effect of square footage on price is not just linear; larger houses might increase in value exponentially. Polynomial regression captures this non-linear relationship.
- Example: Modeling the trajectory of a ball thrown in the air, where gravity makes the path parabolic (a curve).
- Versus Linear Regression: Linear regression is simpler but less flexible for non-linear trends. Polynomial regression handles curves by fitting multiple terms.
- Versus Other Non-Linear Models: Unlike complex models like neural networks, polynomial regression remains interpretable and easier to implement, but can struggle with extreme curves or overfitting on too high degrees.
- Be cautious of overfitting: as the degree of the polynomial increases, the model might fit the training data too closely, missing the true trend. It’s like drawing a path that tries too hard to hit every small bump.
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
# Example data: X (years of experience), y (salary)
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([15000, 18000, 21000, 26000, 33000, 40000, 48000, 58000, 69000, 81000])
# Fit linear regression
linear_model = LinearRegression()
linear_model.fit(X, y)
linear_predictions = linear_model.predict(X)
# Fit polynomial regression (degree 2 for a curve)
poly_model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
poly_model.fit(X, y)
poly_predictions = poly_model.predict(X)
# Plot results
plt.scatter(X, y, color='blue', label='Data Points')
plt.plot(X, linear_predictions, color='red', label='Linear Fit')
plt.plot(X, poly_predictions, color='green', label='Polynomial Fit (Degree 2)')
plt.title('Linear vs Polynomial Regression')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend()
plt.show()