Machine Learning A-Z

The Machine Learning Process

Data Preprocessing

Import the dataset
Clean the dataset
- Import the dataset
- Handle missing data
- Encode categorical data
- Split the dataset into the Training set and Test set
- Feature scaling

Modelling

Build a model
Train the model
Test the model
Make Predictions

Evaluation

Calculate performance metrics
- Confusion Matrix
- Accuracy
- Precision
Make a verdict

Splitting data into training set and test set

To split the data into training and test sets, we can use the train_test_split function from the sklearn.model_selection module. Here's an example:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In this process, we will split the dataset into two parts: the training set and the test set. The test_size parameter specifies the proportion of the dataset that should be included in the test set. In this case, we are using 20% of the data for testing. The random_state parameter is used to ensure that the data is split in the same way each time the code is run.

Training Set

The training set is used to train the machine learning model. It contains a set of input features (X_train) and the corresponding target labels (y_train). For example: 80% of the data is used for training the model.

We use the training set to build the model and adjust its parameters to make accurate predictions. Then, we make a linear regression model that can predict the price of a house based on its size.

Test Set

The test set is used to evaluate the performance of the trained model. It contains a set of input features (X_test) and the corresponding target labels (y_test). For example: 20% of the data is used for testing the model.

We use the test set to evaluate the model's performance on unseen data. This helps us understand how well the model generalizes to new data and whether it is overfitting or underfitting.

Equation: y = b0 + b1 * x1 + b2 * x2 + ... + bn * xn

Note: Based on the comparison between the predicted value of y and the actual value of y, we can evaluate the performance of the model. We can use metrics such as the Mean Squared Error (MSE) or the R-squared value to assess the model's accuracy.

Feature Scaling

Feature scaling is a technique used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.

Example:

X1          X2          X3          X4

1000        1           0.5         10
2000        2           1           20
3000        3           1.5         30
4000        4           2           40
....
,...

Feature scaling always applies to the columns of the dataset, not the rows. It is important to apply feature scaling to ensure that all features contribute equally to the model training process and prevent any one feature from dominating the others.

Types of Feature Scaling

Standardization (Z-score normalization)

Standardization scales the data to have a mean of 0 and a standard deviation of 1. It is calculated as follows:

X_std = (X - X_mean) / X_std_dev

Its value lies between -3 and +3. Where X is the feature value, X_mean is the mean of the feature values, and X_std_dev is the standard deviation of the feature values.

X_mean is the mean of the feature values, and X_std_dev is the standard deviation of the feature values. mean = sum(X) / n, std_dev = sqrt(sum((X - mean)^2) / n). Standard deviation is a measure of the amount of variation or dispersion of a set of values. Mean is the average of the values.

Normalization (Min-Max scaling)

Normalization scales the data to a fixed range, usually between 0 and 1. It is calculated as follows:

X_norm = (X - X_min) / (X_max - X_min)

Its value lies between 0 and 1. Where X is the feature value, X_min is the minimum value of the feature, and X_max is the maximum value of the feature.

X_min is the minimum value of the feature, and X_max is the maximum value of the feature. Min-Max scaling is used when the data does not follow a Gaussian distribution. It is sensitive to outliers in the data.

Lets consider a simple data set where we have annual income of different persons and age of different persons.

Annual Income (in $)    Age
50000                    25
60000                    30
67000                    32
70000                    39
80000                    25
90000                    35

The task here is to identify whether a person with lets say salary 60000 and age 30 is similar to which other person in the dataset. This is where we use clustering algorithms like K-means to group similar data points together.

We can use feature scaling here to ensure that both the features (Annual Income and Age) contribute equally to the clustering process. If we don't scale the features, the model may give more weight to the feature with a larger range, which can affect the clustering results.

After applying Normalization, the data is scaled to a fixed range between 0 and 1. This ensures that both features contribute equally to the clustering process.

The value will look like:

Formula:

X_norm = (X - X_min) / (X_max - X_min)

Result:

Annual Income (in $)    Age
0.0                     0.0
0.2                     0.2
0.4                     0.3
0.5                     0.6
0.7                     0.0
1.0                     0.8

Types of Variables in Data Set

Categorical Variables

Nominal: No order (e.g., colors, countries)
Ordinal: Order matters (e.g., low, medium, high)
Binary: Two categories (e.g., yes/no, true/false)
Dummy: Represent
- 0: Absence of a category
- 1: Presence of a category

Numerical Variables

Discrete: Countable (e.g., number of cars)
Continuous: Infinite values (e.g., height, weight)
Interval: No true zero (e.g., temperature in Celsius)
Ratio: True zero (e.g., weight, height)

Text Variables

Unstructured data (e.g., reviews, comments)
Requires text preprocessing (e.g., tokenization, stemming)

Date/Time Variables

Time series data (e.g., stock prices, weather data)
Extract features (e.g., day of the week, month)

Mixed Variables

Combination of different types (e.g., address, name)
Requires feature engineering (e.g., one-hot encoding, feature extraction)

Independent and Dependent Variables

Sample Data:

Data.csv

Country          Age          Salary          Purchased 

France           44           72000           No
Spain            27           48000           Yes
Germany          30           54000           No
Spain            38           61000           No
Germany          40           63777.78        Yes
France           35           58000           Yes
Spain            38.77777778 52000           No
France           48           79000           Yes
Germany          50           83000           No
France           37           67000           Yes

Independent Variables (Features)

Input variables
Used to predict the dependent variable
Denoted as X
Example: In above example, Country, Age, and Salary are independent variables.

Dependent Variable (Target)

Output variable
Predicted by the independent variables
Denoted as y
Example: In above example, Purchased is the dependent variable. We want to predict whether a person will purchase a product based on the independent variables (Country, Age, Salary).

Importing dataset

dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values # This will select all rows and all columns except the last one
y = dataset.iloc[:, -1].values # This will select all rows and the last column

Handling Missing Data

Missing data is a common problem in real-world datasets. It can occur due to various reasons, such as data corruption, human error, or system failures. Handling missing data is an essential step in the data preprocessing pipeline to ensure the quality and reliability of the data.

Strategies for Handling Missing Data

Remove Rows with Missing Data

Pros: Simple and straightforward
Cons: May lose valuable information

Impute Missing Data

Pros: Retains valuable information
Cons: May introduce bias or inaccuracies

Use Advanced Techniques

Pros: Handles missing data more effectively
Cons: Requires more computational resources

Techniques for Imputing Missing Data

Mean/Median Imputation

Replace missing values with the mean or median of the column
Suitable for continuous numerical data

Example:

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X[:, 1:] = imputer.fit_transform(X[:, 1:])

Mode Imputation

Replace missing values with the mode (most frequent value) of the column
Suitable for categorical data

Example:

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')
X[:, 0] = imputer.fit_transform(X[:, 0])

K-Nearest Neighbors (KNN) Imputation

Replace missing values with the average of the K-nearest neighbors
Suitable for both numerical and categorical data

Example:

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2)
X = imputer.fit_transform(X)

Multiple Imputation

Generate multiple imputed datasets and combine the results
Suitable for complex datasets with missing data

Example:

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10, random_state=0)
X = imputer.fit_transform(X)

Predictive Imputation

Use machine learning models to predict missing values
Suitable for datasets with complex relationships

Example:

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10, random_state=0)
X = imputer.fit_transform(X)

Encoding Categorical Data Step 1

Categorical data refers to data that represents categories or groups. It can take on a limited number of values and is often represented as text or labels. Machine learning models require numerical input, so we need to encode categorical data into a numerical format.

Example:

Country          Age          Salary          Purchased

France           44           72000           No
Spain            27           48000           Yes
Germany          30           54000           No
Spain            38           61000           No
Germany          40           63777.78        Yes
France           35           58000           Yes
Spain            38.77777778 52000           No
France           48           79000           Yes

In above example, Country and Purchased are categorical variables. We need to encode these variables into numerical format before using them in a machine learning model. The country names (France, Spain, Germany) are nominal categorical variables and to use it in a machine learning model, we need to encode them into numerical format. THe way to encode these variables is to use One-Hot Encoding.

One-Hot Encoding

One-Hot Encoding is a technique used to convert categorical data into a numerical format. It creates binary columns for each category and assigns a 1 or 0 to indicate the presence of a category.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

Example:

Country_France    Country_Spain    Country_Germany    Purchased

1                 0                0                  0
0                 1                0                  1
0                 0                1                  0
0                 1                0                  0
0                 0                1                  1
1                 0                0                  1
0                 1                0                  0
1                 0                0                  1

Encoding Categorical Data Step 2

After applying One-Hot Encoding to the categorical variables, we need to handle the issue of multicollinearity. Multicollinearity occurs when two or more independent variables are highly correlated, leading to redundancy in the data.

Encoding Independent Variables

To avoid multicollinearity, we need to drop one of the binary columns created by One-Hot Encoding. This is known as the dummy variable trap.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

Output:

    Country_Spain    Country_Germany    Country_France        Age    Salary     Purchased
    
    1.0                0.0                  0.0                    44     72000      No
    0.0                0.0                  1.0                    27     48000      Yes
    0.0                1.0                  0.0                    30     54000      No
    0.0                0.0                  1.0                    38     61000      No
    0.0                1.0                  0.0                    40     63777.78   Yes
    1.0                0.0                  0.0                    35     58000      Yes
    0.0                0.0                  1.0                    38.77777778 52000 No
    1.0                0.0                  0.0                    48     79000      Yes
    0.0                1.0                  0.0                    50     83000      No
    1.0                0.0                  0.0                    37     67000      Yes

Encoding Dependent Variables

For the dependent variable, we can use Label Encoding to convert categorical labels into numerical format.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)

Output:

Excercise:

titanic.csv

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0,D56,S
24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S
28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0,C23 C25 C27,S
53,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C
55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C

# Importing the necessary libraries
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Load the dataset
dataset = pd.read_csv('titanic.csv')


# Identify the categorical data
categorical_features = ['Sex', 'Embarked', 'Pclass']
# categorical_indices = [dataset.columns.get_loc(col) for col in categorical_columns]


# Implement an instance of the ColumnTransformer class
ct = ColumnTransformer(
    transformers=[
        ('encoder', OneHotEncoder(), categorical_features)
    ],
    remainder='passthrough'
)


# Apply the fit_transform method on the instance of ColumnTransformer
X = ct.fit_transform(dataset)


# Convert the output into a NumPy array
X = np.array(X)


# Use LabelEncoder to encode binary categorical data

le = LabelEncoder()
y = le.fit_transform(dataset['Survived'])


# Print the updated matrix of features and the dependent variable vector

print("Updated matrix of features: \n", X)
print("Updated dependent variable vector: \n", y)

Splitting the dataset into the Training set and Test set

To evaluate the performance of a machine learning model, we need to split the dataset into two parts: the training set and the test set. The training set is used to train the model, while the test set is used to evaluate its performance on unseen data.

Q. Do we need to apply feature scaling before splitting the dataset into the training set and test set or after splitting the dataset?

A. It is recommended to apply feature scaling after splitting the dataset into the training set and test set. This is because feature scaling should be done on the training set only to prevent data leakage from the test set. If feature scaling is applied before splitting the dataset, information from the test set may leak into the training set, leading to biased results.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Excercise:

# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load the Iris dataset
data = pd.read_csv('iris.csv')

# Separate features and target
X = data.iloc[:, :-1]
Y = data.iloc[:, -1]

# Split the dataset into an 80-20 training-test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Apply feature scaling on the training and test sets
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # scaler.fit_transform is used for training set. it will learn the parameters from the training set.
X_test = scaler.transform(X_test)

# Print the scaled training and test sets
print(X_train)
print(X_test)

Feature Scaling

Why do we need Feature Scaling?

Machine Learning Algorithms

Many machine learning algorithms use Euclidean distance to make predictions.
Features with larger scales may dominate the distance calculation.
Feature scaling ensures that all features contribute equally to the model.

Convergence Speed

Gradient descent converges faster on scaled features.
It reduces the number of iterations required to reach the minimum.

Regularization

Regularization techniques like Lasso and Ridge are sensitive to feature scales.
Feature scaling helps in regularization by penalizing large coefficients.

Distance-Based Algorithms

Distance-based algorithms like K-Nearest Neighbors (KNN) are sensitive to feature scales.
Feature scaling ensures that the algorithm is not biased towards features with larger scales.

Types of Feature Scaling

Standardization (Z-score normalization)

Scales the data to have a mean of 0 and a standard deviation of 1.
Suitable for algorithms that assume normally distributed data.
Equation: z = (x - mean) / std_dev
- mean is the mean of the data.
- std_dev is the standard deviation of the data. It is calculated as the square root of the variance which is the average of the squared differences from the mean.
Its value will be in the range of +3 and -3
It works well when the data follows a Gaussian distribution. It is not sensitive to outliers in the data.

Normalization (Min-Max scaling)

Scales the data to a fixed range, usually between 0 and 1.
Suitable for algorithms that require data to be on the same scale.
Equation: x_norm = (x - min) / (max - min)
- min is the minimum value of the data.
- max is the maximum value of the data.
It always result in the range of 0 to 1
It is recommended to use Min-Max scaling when the data follows a Gaussian distribution. It is sensitive to outliers in the data which means that it can be affected by outliers in the data. Outliers are data points that are significantly different from other data points in the dataset.

Robust Scaling

Scales the data based on the interquartile range (IQR).
Suitable for datasets with outliers.
Equation: x_robust = (x - Q1) / (Q3 - Q1)
- Q1 is the first quartile (25th percentile).
- Q3 is the third quartile (75th percentile).

MaxAbs Scaling

Scales the data based on the maximum absolute value.
Suitable for sparse datasets.
Equation: x_maxabs = x / max(abs(x))
- max(abs(x)) is the maximum absolute value of the data.

Quantile Transformation

Maps the data to a uniform or normal distribution.
Suitable for non-Gaussian data.
Equation: x_quantile = F(x)
- F(x) is the cumulative distribution function of the data.

Example:

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:]) # Apply feature scaling to the training set. It will scale the columns from 3 to the end.
X_test[:, 3:] = sc.transform(X_test[:, 3:]) # Apply feature scaling to the test set. It will scale the columns from 3 to the end.

Note: We are performing fit_transform on the training set and transform on the test set. This is because we want to scale the test set based on the parameters learned from the training set. This ensures that the scaling is consistent across both sets. If we were to use fit_transform on the test set, it would learn new parameters and the scaling would be different from the training set.

Excercise:

Coding exercise: Feature scaling for Machine Learning

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the Wine Quality Red dataset
# dataset = pd.read_csv('wineqality-red.csv', sep=';')
dataset = pd.read_csv('winequality-red.csv', delimiter=';')

# Separate features and target
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]

# Split the dataset into an 80-20 training-test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create an instance of the StandardScaler class
scaler = StandardScaler()

# Fit the StandardScaler on the features from the training set and transform it
X_train = scaler.fit_transform(X_train)

# Apply the transform to the test set
X_test = scaler.transform(X_test)

# Print the scaled training and test datasets
print(X_train)
print(X_test)

Regressions

Regression is a fundamental concept in statistics and machine learning, used to model and analyze relationships between variables. The primary goal of regression is to predict the value of a dependent variable (often called the "target" or "outcome") based on one or more independent variables (also known as "predictors" or "features"). Regression models are particularly useful when the dependent variable is continuous, like predicting house prices, stock prices, or temperatures.

Key Concepts in Regression

Dependent Variable: The outcome or the variable you are trying to predict. It's also known as the response variable.
Independent Variables: The predictors or features that are used to predict the dependent variable.
Regression Line: A line that best fits the data points in a regression model. In simple linear regression, it is represented as y=mx+b, where m is the slope, and b is the y-intercept.
Residuals: The differences between observed and predicted values. Residuals are used to assess the accuracy of the regression model.

1. Simple Linear Regression

One independent variable
Linear relationship between variables
Equation: y = b0 + b1 * x1
- y is the dependent variable
- x1 is the independent variable
- b0 is the y-intercept
- b1 is the slope of the line

Ordinary Least Squares (OLS) Method

Minimizes the sum of squared differences between observed and predicted values
Finds the best-fitting line that minimizes the sum of squared residuals

from sklearn.linear_model import LinearRegression

dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

regressor = LinearRegression()
regressor.fit(X_train, y_train)

Formula for OLS

b1 = sum((x - x_mean) * (y - y_mean)) / sum((x - x_mean)^2)
b0 = y_mean - b1 * x_mean

y = b0 + b1 * x1 where b0 is the y-intercept and b1 is the slope of the line.

2. Multiple Linear Regression

Multiple Linear Regression is a statistical method used to understand the relationship between one dependent variable (the outcome you want to predict) and two or more independent variables (the predictors or factors that influence the outcome). It’s like extending simple linear regression, which deals with just one predictor, to handle multiple predictors.

Key Concepts of Multiple Linear Regression

Dependent Variable (Target/Outcome): This is the variable you want to predict or explain. For example, predicting house prices.
Independent Variables (Predictors/Features): These are the variables that are used to predict the dependent variable. For example, predicting house prices based on the size of the house, number of bedrooms, location, etc.
Equation: The relationship can be represented with a linear equation:

[ y = b_0 + b_1x_1 + b_2x_2 + \ldots + b_nx_n + \epsilon ]
- ( y ): Dependent variable (e.g., house price)
- ( x_1, x_2, \ldots, x_n ): Independent variables (e.g., size, bedrooms, location)
- ( b_0 ): Intercept (the value of ( y ) when all ( x )'s are zero)
- ( b_1, b_2, \ldots, b_n ): Coefficients (they show how much ( y ) changes with a change in each ( x ))
- ( \epsilon ): Error term (captures the variation in ( y ) not explained by the predictors)
Objective: The goal is to find the best-fitting line (or plane, or hyperplane) that minimizes the difference between the predicted values and the actual values of the dependent variable. This is done by adjusting the coefficients (( b_1, b_2, \ldots, b_n )).

How It Works

Fit the Model: The model is trained on a dataset by finding the best coefficients that minimize the difference between the predicted values and actual values. This is usually done using a method called Least Squares which minimizes the sum of squared errors.
Prediction: Once the coefficients are determined, you can use the model to make predictions on new data by plugging the values of independent variables into the equation.

Example

Suppose you want to predict the price of a house based on its size (in square feet), the number of bedrooms, and the distance to the city center.

Data Example

Size (sq ft)	Bedrooms	Distance to City Center (miles)	House Price ($)
2000	3	5	300,000
1600	2	3	250,000
2400	4	8	350,000
3000	4	6	400,000

Building the Model

Define the variables:
- Dependent variable: House Price (( y ))
- Independent variables: Size (( x_1 )), Bedrooms (( x_2 )), Distance to City Center (( x_3 ))
Formulate the equation:

[ \text{House Price} = b_0 + b_1 \times \text{Size} + b_2 \times \text{Bedrooms} + b_3 \times \text{Distance to City Center} ]
Training the Model:
- Using the given data, the model learns the coefficients (( b_0, b_1, b_2, b_3 )) that best predict the house prices.
Making Predictions:
- For a house that is 2200 sq ft, with 3 bedrooms, and 4 miles from the city center, plug in the values into the equation:
[ \text{Predicted Price} = b_0 + b_1 \times 2200 + b_2 \times 3 + b_3 \times 4 ]

Assumptions of Multiple Linear Regression

Linearity: The relationship between dependent and independent variables is linear.
Independence: The observations are independent of each other.
Homoscedasticity: The variance of residual errors is constant across all levels of the independent variables.
Normality: The residuals (errors) of the model should be normally distributed.
No Multicollinearity: The independent variables should not be too highly correlated with each other.

Advantages and Limitations

Advantages:
- Easy to understand and implement.
- Works well when the relationship is linear and the assumptions are met.
- Can handle multiple predictors.
Limitations:
- Not suitable for non-linear relationships.
- Performance can be affected if the assumptions are violated.
- Sensitive to outliers which can skew the results.

Assumptions of linear regression

Linearity(Linear relationship between Y and each X): The relationship between the independent and dependent variables is linear.
- The relationship between the independent and dependent variables can be represented by a straight line.
- It can be checked using a scatter plot of the data.
Homoscedasticity(Equal Variance): The variance of the residuals is constant.
- The residuals are equally spread across all values of the independent variables.
- The residuals are not dependent on the independent variables.
- It can be checked using a scatter plot of the residuals.
MultiVariate Normality(Normality of error distribution): The residuals are normally distributed.
- The residuals are normally distributed.
- The residuals have a mean of zero.
- The residuals are independent of each other.
- It can be checked using a Q-Q plot or a histogram of the residuals.
Independence(of observations. includes "no autocorrelation"): The residuals are independent of each other.
- The residuals are independent of each other.
- There is no correlation between the residuals.
- It is important to check for autocorrelation in time series data.
- It can be checked using a scatter plot of the residuals.
Lack of Multicollienarity(Predictors are not correlated with each other): The independent variables are not highly correlated with each other.
- The independent variables are not highly correlated with each other.
- Multicollinearity can lead to unstable coefficients and inaccurate predictions.
- It can be checked using a correlation matrix or a variance inflation factor (VIF).
The outlier check: The presence of outliers can significantly affect the regression model.
- Outliers are data points that are significantly different from other data points in the dataset.
- Outliers can affect the regression coefficients and the model's accuracy.
- It can be checked using box plots or scatter plots of the data.

Multiple Linear Regression Intuition

Multiple Linear Regression is an extension of Simple Linear Regression that allows us to model the relationship between two or more independent variables and a continuous dependent variable. It is used to predict the value of the dependent variable based on the values of the independent variables.

Table:

Profit	R&D Spend	Administration	Marketing Spend	State
192261.83	165349.20	136897.80	471784.10	New York
191792.06	162597.70	151377.59	443898.53	California
191050.39	153441.51	101145.55	407934.54	New York
182901.99	144372.41	118671.85	383199.62	New York
166187.94	142107.34	91391.77	366168.42	New York
156991.12	131876.90	99814.71	362861.36	New York
156122.51	130298.13	145530.06	127716.82	California
155752.60	120542.52	148718.95	311613.29	California

Here, profit is the dependent variable, and R&D Spend, Administration, Marketing Spend, and State are the independent variables. The goal is to predict the profit based on the values of the independent variables.

Expanding the table for categorical data (State) using One-Hot Encoding:

Profit	R&D Spend	Administration	Marketing Spend	New York	California
192261.83	165349.20	136897.80	471784.10	1	0
191792.06	162597.70	151377.59	443898.53	0	1
191050.39	153441.51	101145.55	407934.54	1	0
182901.99	144372.41	118671.85	383199.62	1	0
166187.94	142107.34	91391.77	366168.42	1	0
156991.12	131876.90	99814.71	362861.36	1	0
156122.51	130298.13	145530.06	127716.82	0	1
155752.60	120542.52	148718.95	311613.29	0	1

Here, New York and California are dummy variables created using One-Hot Encoding. They represent the presence or absence of the State in the dataset. The model will learn the coefficients for these variables to predict the profit based on the State.

Dummy variable trap

The dummy variable trap is a situation in which two or more dummy variables are highly correlated. This can lead to multicollinearity, which can affect the regression coefficients and the model's accuracy. To avoid the dummy variable trap, one of the dummy variables should be dropped from the model. This can be done by using the drop_first=True parameter in the get_dummies function in pandas.

# Drop the first dummy variable to avoid the dummy variable trap
X = pd.get_dummies(X, drop_first=True)

By dropping one of the dummy variables, we ensure that the model does not include redundant information and avoids multicollinearity.

Understanding P-values

The p-value is a statistical measure that helps us decide whether to reject a null hypothesis in a hypothesis test. It's essentially a way to measure the strength of the evidence against the null hypothesis.

Key Concepts

Null Hypothesis (H₀): This is the default assumption that there is no effect or no difference. In regression analysis, the null hypothesis for a coefficient is that it equals zero (meaning the independent variable has no effect on the dependent variable).
Alternative Hypothesis (H₁): This is the opposite of the null hypothesis. It suggests that there is an effect or a difference. For a regression coefficient, it means the coefficient is not zero (the independent variable does affect the dependent variable).
P-value: The p-value helps us determine whether the observed data is consistent with the null hypothesis. It's a probability value that ranges from 0 to 1.
- Low p-value (< 0.05): This suggests that the observed data is unlikely under the null hypothesis, providing strong evidence against the null hypothesis. We usually reject the null hypothesis in this case.
- High p-value (> 0.05): This suggests that the observed data is likely under the null hypothesis, providing weak evidence against the null hypothesis. We usually fail to reject the null hypothesis in this case.

Example: Coin Tossing Experiment

Let's say we want to test if a coin is fair (i.e., it has an equal chance of landing heads or tails).

Null Hypothesis (H₀): The coin is fair (p = 0.5 for heads).
Alternative Hypothesis (H₁): The coin is biased (p ≠ 0.5).

Suppose you toss the coin 100 times, and it lands heads 65 times. You calculate a p-value of 0.03:

Since 0.03 is less than 0.05, you reject the null hypothesis and conclude that the coin is biased.

Example in Regression Analysis

Suppose you're studying how marketing spend and R&D spend affect company profits using multiple linear regression.

Regression Equation: [ \text{Profit} = \beta_0 + \beta_1 (\text{R&D Spend}) + \beta_2 (\text{Marketing Spend}) + \epsilon ]

Null Hypothesis (H₀) for (\beta_1): R&D Spend does not affect profit ((\beta_1 = 0)).
Alternative Hypothesis (H₁) for (\beta_1): R&D Spend affects profit ((\beta_1 \neq 0)).

Analysis:

Coefficient	Estimate	p-value
Intercept ((\beta_0))	50,000	0.04
R&D Spend ((\beta_1))	0.8	0.001
Marketing Spend ((\beta_2))	0.5	0.2

Interpretation:

Intercept: The p-value is 0.04, which is less than 0.05, indicating that the intercept is statistically significant.
R&D Spend: The p-value is 0.001, which is much less than 0.05, indicating that R&D Spend has a statistically significant effect on profit. This suggests that as R&D Spend increases, profit is likely to increase significantly.
Marketing Spend: The p-value is 0.2, which is greater than 0.05, indicating that Marketing Spend is not statistically significant in predicting profit. This suggests that changes in Marketing Spend do not significantly affect profit.

Practical Implications

When performing regression analysis or any statistical test, the p-value helps determine which variables (coefficients) significantly impact the outcome. It guides decisions on whether to keep or discard variables from the model based on their impact.

For instance, in the above regression example, you might consider focusing more on R&D spend as it shows a significant positive effect on profits, while re-evaluating the role of marketing spend in your model since its impact isn't statistically clear.

Key Notes

A low p-value (< 0.05) suggests strong evidence against the null hypothesis, making it statistically significant.
A high p-value (> 0.05) suggests weak evidence against the null hypothesis, making it not statistically significant.
Always use the p-value in the context of your hypothesis and data to make informed decisions about the significance of your results.

Multiple Linear Regression Intuition cont...

Building a Model

5 methods for building a model:

Backward Elimination: Remove the least significant variable.
Forward Selection: Add the most significant variable.
Bidirectional Elimination: Remove or add variables based on p-values.
Score Comparison: Compare the performance of different models.
All-in: Use all the variables.

1. Backward Elimination

Backward Elimination is a step-by-step process used in multiple linear regression to find the most significant variables that impact the dependent variable (the variable we are trying to predict). It helps simplify the model by removing the least significant predictors one at a time until only the important ones remain.

Step-by-Step Process

Set a Significance Level: Decide on a significance level, which determines how strict you want to be when deciding whether a variable should stay in the model. A common choice is 0.05 (or 5%). This means that only variables with p-values less than 0.05 are considered statistically significant.
Fit the Full Model: Start by fitting the model with all possible predictors (independent variables).
Identify the Least Significant Predictor: Look at the p-values of all predictors. Find the predictor with the highest p-value (the least statistically significant).
Remove the Predictor if it's Not Significant: If the highest p-value is greater than your chosen significance level (e.g., 0.05), remove that predictor from the model.
Repeat the Process: Refit the model without the removed predictor and repeat the process. Continue removing the least significant predictor one at a time until all remaining predictors have p-values less than the significance level.
Final Model: Once all predictors have p-values less than the significance level, you have your final model.

Example: Predicting House Prices

Let's say you have a dataset with the following variables:

Dependent Variable (Target): House Price
Independent Variables (Predictors): Number of Bedrooms, Size of the House, Age of the House, Distance to Nearest School, Crime Rate, and Number of Bathrooms.

Applying Backward Elimination

Step 1: Choose Significance Level

Set SL = 0.05.

Step 2: Fit the Full Model

Fit the model using all predictors.

Step 3: Check P-values

Predictor	Coefficient	p-value
Intercept	-	0.03
Number of Bedrooms	2000	0.02
Size of the House	5000	0.001
Age of the House	-300	0.15
Distance to Nearest School	-100	0.20
Crime Rate	-2000	0.10
Number of Bathrooms	3000	0.04

Step 4: Remove the Predictor with the Highest P-value

The predictor with the highest p-value is Distance to Nearest School (p = 0.20), which is greater than 0.05. Remove this predictor.

Step 5: Fit the Model Again

Refit the model without "Distance to Nearest School".

New P-values:

Predictor	Coefficient	p-value
Intercept	-	0.03
Number of Bedrooms	2000	0.02
Size of the House	5000	0.001
Age of the House	-300	0.10
Crime Rate	-2000	0.06
Number of Bathrooms	3000	0.04

The new highest p-value is Age of the House (p = 0.10). Remove this predictor.

Repeat the Process:

Refit the model again without "Age of the House".

New P-values:

Predictor	Coefficient	p-value
Intercept	-	0.02
Number of Bedrooms	2000	0.01
Size of the House	5000	0.001
Crime Rate	-2000	0.05
Number of Bathrooms	3000	0.03

All remaining predictors now have p-values less than 0.05, so the process stops.

Final Model

The final model includes:

Number of Bedrooms
Size of the House
Crime Rate
Number of Bathrooms

Analysis

Result: By using Backward Elimination, you've removed predictors that weren't significantly contributing to the model's ability to predict house prices. The remaining predictors have a stronger statistical relationship with the house prices, making the model simpler and more reliable.

Key Notes

Backward Elimination helps reduce the complexity of a regression model by systematically removing the least significant predictors.
It relies on p-values to determine which predictors are statistically significant.
The goal is to improve model interpretability and ensure that only the most meaningful variables are included, without compromising the model's predictive power.

2. Forward Selection

Forward Selection is a stepwise approach used in multiple linear regression to build a model incrementally by adding the most significant variables one at a time. This method starts with no predictors and gradually adds them based on their statistical significance, ensuring each addition improves the model's predictive ability.

Detailed explanation of the Forward Selection process, using simple language and a practical example:

Step-by-Step Process

Set a Significance Level: Choose a significance level (SL) for including variables in the model, e.g., SL = 0.05. This level determines which variables are statistically significant enough to be added.
Fit Simple Regression Models: Fit a simple regression model for each predictor independently, i.e., regress the dependent variable ( y ) on each predictor ( x_n ).
Select the Predictor with the Lowest p-value: Among all the simple models, select the predictor with the lowest p-value (must be below the chosen SL). This predictor is added to the model.
Fit Models with Additional Predictors: With the selected predictor(s) from the previous step, fit new models by adding one additional predictor at a time.
Repeat the Process: Add the predictor with the lowest p-value (below the SL) to the model. Repeat the process of fitting models with additional predictors until no remaining predictors have a p-value below the significance level.
Finalize the Model: When no more predictors can be added that meet the significance level, the model is complete.

Example: Predicting House Prices

Suppose you have a dataset with the following variables:

Dependent Variable (Target): House Price
Independent Variables (Predictors): Number of Bedrooms, Size of the House, Age of the House, Distance to Nearest School, Crime Rate, and Number of Bathrooms.

Applying Forward Selection

Step 1: Choose Significance Level

Set SL = 0.05.

Step 2: Fit Simple Regression Models

Fit separate models for each predictor:

Predictor	p-value
Number of Bedrooms	0.10
Size of the House	0.001
Age of the House	0.12
Distance to Nearest School	0.15
Crime Rate	0.07
Number of Bathrooms	0.04

Step 3: Select the Predictor with the Lowest p-value

The predictor with the lowest p-value is Size of the House (p = 0.001). Add this predictor to the model.

Step 4: Fit Models with One Extra Predictor

Now, add one predictor at a time to the model that already includes "Size of the House".

Model	Added Predictor	p-value (Added Predictor)
House Price ~ Size of the House + Bedrooms	Bedrooms	0.08
House Price ~ Size of the House + Age	Age	0.15
House Price ~ Size of the House + Distance	Distance	0.12
House Price ~ Size of the House + Crime Rate	Crime Rate	0.03
House Price ~ Size of the House + Bathrooms	Bathrooms	0.04

Step 5: Add the Predictor with the Lowest p-value

Add Crime Rate (p = 0.03) to the model, as it has the lowest p-value below SL.

Repeat the Process

Continue this process by testing models with "Size of the House" and "Crime Rate", plus one additional predictor at a time:

Model	Added Predictor	p-value (Added Predictor)
House Price ~ Size of the House + Crime Rate + Bedrooms	Bedrooms	0.09
House Price ~ Size of the House + Crime Rate + Age	Age	0.10
House Price ~ Size of the House + Crime Rate + Distance	Distance	0.08
House Price ~ Size of the House + Crime Rate + Bathrooms	Bathrooms	0.02

The predictor with the lowest p-value is Number of Bathrooms (p = 0.02). Add this predictor.

Final Model

Continue this process until no remaining predictors can be added with a p-value below the significance level. Assume now all other predictors have p-values above 0.05 when added to this model:

Final Model Includes:

Size of the House
Crime Rate
Number of Bathrooms

Analysis

Result: Forward Selection has helped build a model by only including variables that significantly contribute to predicting house prices, based on the statistical evidence provided by p-values.

Key Notes:

Forward Selection is efficient in identifying significant predictors, especially when dealing with a large number of variables.
By incrementally adding variables, it prevents overfitting by only including predictors that show statistical significance.
The method stops when no further predictors meet the significance level, resulting in a concise and effective predictive model.

3. Bidirectional Elimination

Bidirectional Elimination (also known as Stepwise Selection) is a combination of Forward Selection and Backward Elimination methods. It allows for adding significant predictors and removing insignificant ones simultaneously during model building. This approach helps in constructing an optimal model by iteratively testing which variables should enter or exit the model based on their significance.

Detailed explanation of the process with an example:

Step-by-Step Process

Set Significance Levels: Define two significance levels:
- SLENTER: Significance level to enter the model (e.g., 0.05).
- SLSTAY: Significance level to stay in the model (e.g., 0.05).
Forward Selection Step: Start by applying Forward Selection:
- Add predictors one by one to the model if their p-value is less than SLENTER.
Backward Elimination Step: After each addition, apply Backward Elimination:
- Check all the predictors currently in the model and remove any with a p-value greater than SLSTAY.
Repeat: Continue alternating between Forward Selection and Backward Elimination:
- Add new predictors if they meet SLENTER.
- Remove predictors if they do not meet SLSTAY.
Stop When Stable: The process stops when no new predictors can enter and no current predictors need to be removed.
Final Model: The resulting model is your final set of predictors.

Example: Predicting Sales Based on Advertising and Market Data

Suppose you have a dataset with the following variables:

Dependent Variable (Target): Sales
Independent Variables (Predictors): TV Advertising Spend, Radio Advertising Spend, Newspaper Advertising Spend, Store Size, and Number of Competitors.

Applying Bidirectional Elimination

Step 1: Set Significance Levels

SLENTER = 0.05
SLSTAY = 0.05

Step 2: Perform Forward Selection

Fit Simple Regression Models:
- Fit individual models for each predictor to find p-values:
Predictor p-value

TV Advertising Spend 0.01

Radio Advertising Spend 0.03

Newspaper Advertising 0.20

Store Size 0.08

Number of Competitors 0.15
- TV Advertising Spend (p = 0.01) is added to the model, as it has the lowest p-value and is below SLENTER.

Predictor	p-value
TV Advertising Spend	0.01
Radio Advertising Spend	0.03
Newspaper Advertising	0.20
Store Size	0.08
Number of Competitors	0.15

Step 3: Perform Backward Elimination

With "TV Advertising Spend" in the model, there is no other variable to remove since it is the first addition.

Step 4: Repeat Forward and Backward Steps

Forward Step:

Fit new models adding one more predictor at a time to "TV Advertising Spend":

Model	Added Predictor	p-value (Added Predictor)
Sales ~ TV Advertising + Radio Advertising	Radio Advertising Spend	0.04
Sales ~ TV Advertising + Newspaper	Newspaper Advertising	0.18
Sales ~ TV Advertising + Store Size	Store Size	0.07
Sales ~ TV Advertising + Competitors	Number of Competitors	0.12

Radio Advertising Spend (p = 0.04) is added since it's below SLENTER.

Backward Step:

Now, check all predictors in the model:

Predictor p-value

TV Advertising Spend 0.02

Radio Advertising Spend 0.04
Both predictors remain in the model as their p-values are below SLSTAY.

Predictor	p-value
TV Advertising Spend	0.02
Radio Advertising Spend	0.04

Repeat until no more predictors meet the criteria for entering or exiting.

Final Model

Continue this process, adding or removing predictors based on SLENTER and SLSTAY.
The process stops when adding more variables does not significantly improve the model, or if removing variables is not required.

Analysis

Result: The final model includes only those predictors that significantly contribute to predicting sales. Bidirectional Elimination efficiently balances adding significant variables and removing those that do not contribute much.

Key Notes:

Adaptable: The method allows for flexibility in model building by adding and removing predictors.
Optimal Model: Helps in identifying the most effective predictors without manually trying each combination.
Efficient: Reduces the chances of overfitting by only including significant variables.

4. Score Comparison

Score Comparison is a model selection technique that evaluates all possible combinations of predictors to find the best-fitting model based on a specific criterion, such as the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), or Adjusted R-squared. This method exhaustively searches through all potential models and selects the one that optimizes the chosen criterion.

Here’s a detailed explanation of the steps involved with an example:

Step-by-Step Process

Select a Goodness of Fit Criterion: Choose a statistical measure to evaluate the quality of each model. Common criteria include:
- Akaike Information Criterion (AIC): Measures the quality of a model by penalizing the likelihood based on the number of parameters. Lower values indicate better models.
- Bayesian Information Criterion (BIC): Similar to AIC but includes a stronger penalty for models with more parameters.
- Adjusted R-squared: Adjusts the R-squared value for the number of predictors, balancing fit with complexity.
Fit All Possible Regression Models: With ( n ) predictors, fit ( 2^n - 1 ) models, as each predictor can either be included or excluded independently:
- For example, with 3 predictors, you would fit ( 2^3 - 1 = 7 ) models.
Compare Models Using the Criterion: Evaluate each model based on the selected criterion. For AIC or BIC, lower scores indicate a better balance of model fit and complexity.
Select the Best Model: Choose the model with the best (optimal) criterion score as your final model.

Example: Predicting House Prices

Suppose you have the following predictors for predicting house prices:

Predictors: Size of the house, Number of bedrooms, Age of the house, and Proximity to the city center.

Step 1: Select a Criterion

Choose Akaike Information Criterion (AIC) as the measure of goodness of fit.

Step 2: Fit All Possible Regression Models

With 4 predictors, there are ( 2^4 - 1 = 15 ) possible models:

Model Predictors Included

1 Size

2 Bedrooms

3 Age

4 Proximity

5 Size, Bedrooms

6 Size, Age

7 Size, Proximity

... ...

15 Size, Bedrooms, Age, Proximity

Model	Predictors Included
1	Size
2	Bedrooms
3	Age
4	Proximity
5	Size, Bedrooms
6	Size, Age
7	Size, Proximity
...	...
15	Size, Bedrooms, Age, Proximity

Step 3: Compare Using AIC

Compute the AIC score for each model:

Model	Predictors Included	AIC Score
1	Size	150
2	Bedrooms	170
3	Age	160
4	Proximity	140
5	Size, Bedrooms	130
6	Size, Age	125
7	Size, Proximity	120
...	...	...
15	Size, Bedrooms, Age, Proximity	110

Step 4: Select the Best Model

From the table above, Model 15 has the lowest AIC score (110), indicating it is the best model based on the AIC criterion.

Analysis

Result: The final model includes all four predictors (Size, Bedrooms, Age, Proximity), as this combination provided the best trade-off between goodness of fit and model complexity.

Key Points:

Exhaustive Search: Evaluates every possible combination, ensuring the optimal model is chosen.
Criterion-Based: Relies on a statistical criterion to balance fit and complexity, avoiding overfitting.
Computationally Intensive: This method can be computationally expensive for datasets with a large number of predictors due to the exponential number of combinations.

Advantages:

Provides a thorough evaluation of all possible models.
Ensures the best model according to the selected criterion.

Disadvantages:

Computationally expensive for large numbers of predictors.
May require significant processing power and time for complex datasets.

Key Notes

Score Comparison is a comprehensive method for selecting the best model by evaluating all possible combinations based on a chosen criterion.
It helps balance model fit and complexity, ensuring the final model is optimal for the given dataset.
The method is computationally intensive for datasets with many predictors but provides a thorough evaluation of model performance.

5. All Possible Models

All Possible Models in Multiple Linear Regression

All Possible Models is a comprehensive approach to model selection that evaluates every combination of predictors to determine the best model based on a chosen criterion, such as the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), or Adjusted R-squared. This method ensures that the selected model provides the best fit while considering model complexity.

Here’s a detailed explanation of the steps involved with an example:

Step-by-Step Process

Select a Goodness of Fit Criterion: Choose a statistical measure to evaluate the quality of each model. Common choices include:
- Akaike Information Criterion (AIC): Balances model fit with complexity; lower values indicate better models.
- Bayesian Information Criterion (BIC): Similar to AIC but with a stronger penalty for additional predictors.
- Adjusted R-squared: Adjusts R-squared to account for the number of predictors, aiming for a better balance of fit and simplicity.
Construct All Possible Regression Models: For ( n ) predictors, create ( 2^n - 1 ) models. Each model represents a unique combination of predictors.
- Example: With 3 predictors, there are ( 2^3 - 1 = 7 ) possible models.
Compare Models Using the Criterion: Evaluate each model based on the chosen criterion. The goal is to identify the model that best optimizes this criterion.
Select the Best Model: Choose the model with the best score according to the criterion.

Example: Predicting Car Prices

Suppose you have data on car prices and want to predict prices using the following predictors:

Predictors: Horsepower, Engine size, Mileage, Age of the car.

Step 1: Select a Criterion

Choose Adjusted R-squared to measure model fit, adjusting for the number of predictors.

Step 2: Construct All Possible Models

With 4 predictors, there are ( 2^4 - 1 = 15 ) possible models:

Model	Predictors Included
1	Horsepower
2	Engine size
3	Mileage
4	Age of the car
5	Horsepower, Engine size
6	Horsepower, Mileage
7	Horsepower, Age of the car
...	...
15	Horsepower, Engine size, Mileage, Age

Step 3: Compare Using Adjusted R-squared

Calculate the Adjusted R-squared value for each model:

Model	Predictors Included	Adjusted R-squared
1	Horsepower	0.60
2	Engine size	0.58
3	Mileage	0.55
4	Age of the car	0.59
5	Horsepower, Engine size	0.65
6	Horsepower, Mileage	0.63
7	Horsepower, Age of the car	0.61
...	...	...
15	Horsepower, Engine size, Mileage, Age	0.70

Step 4: Select the Best Model

From the table, Model 15 with all predictors has the highest Adjusted R-squared value (0.70), indicating it provides the best balance of fit and complexity.

Analysis

Result: The selected model includes all four predictors: Horsepower, Engine size, Mileage, and Age of the car, as this combination offers the best Adjusted R-squared value.

Key Points:

Exhaustive Evaluation: Considers every possible model, ensuring the most comprehensive evaluation of predictor combinations.
Criterion-Driven Selection: Uses a statistical measure to avoid overfitting by balancing model fit with the number of predictors.
Optimal Model: Guarantees the selection of the best-fitting model according to the chosen criterion.

Advantages:

Thoroughly explores all potential predictor combinations.
Provides a high degree of confidence in the selected model’s fit and generalizability.

Disadvantages:

Computationally expensive, especially for large numbers of predictors due to the exponential growth in model combinations.
Requires significant computational resources and time for large datasets.

Key Notes

All Possible Models is a comprehensive method for selecting the best model by evaluating every combination of predictors based on a chosen criterion.
It ensures that the final model is optimal in terms of fit and complexity, avoiding overfitting and underfitting.
The method is computationally intensive for datasets with many predictors but provides a thorough evaluation of model performance.

Note: The choice of criterion (AIC, BIC, Adjusted R-squared) is crucial in determining the best model. It should reflect the trade-off between model fit and complexity that best suits the dataset and research question.

Multiple Linear Regression

### Importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Importing the dataset

dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

print(X)

### Encoding categorical data

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

print(X)

### Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### Training the Multiple Linear Regression model on the Training set

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

### Predicting the Test set results

y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

In multilinear regression, we don't have to apply feature scaling because the coefficients will adjust to balance the scales of the variables.

For example:

If we have two variables, one ranging from 0 to 10 and the other from 0 to 1000, the coefficients will adjust to balance the scales of the variables.

Polynomial Regression

Polynomial Regression is a form of regression analysis in which the relationship between the independent variable ( x ) and the dependent variable ( y ) is modeled as an ( n )-th degree polynomial. It is used when the relationship between the variables is not linear and can be better approximated by a polynomial function.

y = b0 + b1*x1 + b2*x1^2 + b3*x1^3 + ... + bn*x1^n

Understanding Polynomial Regression

Problem and Motivation:

Motivation: Linear regression is great for straight-line relationships, but what if the data shows a curved trend? For instance, predicting the growth of a plant based on time, where growth might accelerate and then plateau.
Problem: Linear regression struggles with such non-linear data patterns, often leading to inaccurate predictions.

Analogy:

Imagine you’re riding a bike up and down hills (representing the curve of the data). Linear regression is like trying to draw a straight path from start to finish, cutting through the hills without following the ups and downs. Polynomial regression, however, lets you trace the actual path over the hills, accounting for every rise and fall.

Explanation:

Overview:
- Polynomial Regression is a type of regression that models the relationship between the independent variable (x) and the dependent variable (y) as an nth-degree polynomial. It allows for curves in the relationship, fitting the data more flexibly than a straight line.
Detailed Explanation:
1. Linear Regression Recap: In linear regression, the model predicts ( y ) as ( y = b_0 + b_1x ). This works for straight-line trends.
2. Polynomial Regression: Extends this idea to include powers of x. For example, a quadratic (second-degree) polynomial regression models ( y = b_0 + b_1x + b_2x^2 ). For a cubic (third-degree), it’s ( y = b_0 + b_1x + b_2x^2 + b_3x^3 ), and so on.
3. Key Intuition: The additional terms (( x^2, x^3, \ldots )) enable the curve to bend and fit the data better than a straight line.
Visuals:
- To be added later

Practical Applications:

Example: Predicting housing prices where the effect of square footage on price is not just linear; larger houses might increase in value exponentially. Polynomial regression captures this non-linear relationship.
Example: Modeling the trajectory of a ball thrown in the air, where gravity makes the path parabolic (a curve).

Comparison and Contrast with Similar Concepts:

Versus Linear Regression: Linear regression is simpler but less flexible for non-linear trends. Polynomial regression handles curves by fitting multiple terms.
Versus Other Non-Linear Models: Unlike complex models like neural networks, polynomial regression remains interpretable and easier to implement, but can struggle with extreme curves or overfitting on too high degrees.

Practical Tip:

Be cautious of overfitting: as the degree of the polynomial increases, the model might fit the training data too closely, missing the true trend. It’s like drawing a path that tries too hard to hit every small bump.

Example of Polynomial Regression:

# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Example data: X (years of experience), y (salary)
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([15000, 18000, 21000, 26000, 33000, 40000, 48000, 58000, 69000, 81000])

# Fit linear regression
linear_model = LinearRegression()
linear_model.fit(X, y)
linear_predictions = linear_model.predict(X)

# Fit polynomial regression (degree 2 for a curve)
poly_model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
poly_model.fit(X, y)
poly_predictions = poly_model.predict(X)

# Plot results
plt.scatter(X, y, color='blue', label='Data Points')
plt.plot(X, linear_predictions, color='red', label='Linear Fit')
plt.plot(X, poly_predictions, color='green', label='Polynomial Fit (Degree 2)')
plt.title('Linear vs Polynomial Regression')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend()
plt.show()

FilesExpand file tree

machine_learning.md

Latest commit

History

machine_learning.md

File metadata and controls

Machine Learning A-Z

The Machine Learning Process

Splitting data into training set and test set

Training Set

Test Set

Feature Scaling

Types of Feature Scaling

Types of Variables in Data Set

Independent and Dependent Variables

Importing dataset

Handling Missing Data

Strategies for Handling Missing Data

Techniques for Imputing Missing Data

Encoding Categorical Data Step 1

One-Hot Encoding

Encoding Categorical Data Step 2

Encoding Independent Variables

Encoding Dependent Variables

Splitting the dataset into the Training set and Test set

Feature Scaling

Why do we need Feature Scaling?

Types of Feature Scaling

Excercise:

Regressions

Key Concepts in Regression

1. Simple Linear Regression

Ordinary Least Squares (OLS) Method

2. Multiple Linear Regression

Key Concepts of Multiple Linear Regression

How It Works

Example

Data Example

Building the Model

Assumptions of Multiple Linear Regression

Advantages and Limitations

Assumptions of linear regression

Multiple Linear Regression Intuition

Dummy variable trap

Understanding P-values

Key Concepts

Example: Coin Tossing Experiment

Example in Regression Analysis

Analysis:

Practical Implications

Key Notes

Multiple Linear Regression Intuition cont...

Building a Model

1. Backward Elimination

Step-by-Step Process

Example: Predicting House Prices

Applying Backward Elimination

Step 1: Choose Significance Level

Step 2: Fit the Full Model

Step 3: Check P-values

Step 4: Remove the Predictor with the Highest P-value

Step 5: Fit the Model Again

New P-values:

Repeat the Process:

New P-values:

Final Model

Analysis

Key Notes

2. Forward Selection

Step-by-Step Process

Example: Predicting House Prices

Applying Forward Selection

Step 1: Choose Significance Level

Step 2: Fit Simple Regression Models

Step 3: Select the Predictor with the Lowest p-value

Step 4: Fit Models with One Extra Predictor

Step 5: Add the Predictor with the Lowest p-value

Repeat the Process

Final Model

Analysis