Machine Learning : Linear Regression With Single Variable.

Linear Regression With Single Variable.

Linear regression with a single variable is a statistical method used to model the relationship between two variables: one independent (predictor) and one dependent (target). The goal is to find a linear equation that best fits the data points, predicting the target variable based on the predictor. This equation is typically in the form y=mx+b, where m represents the slope (how much y changes for a unit change in x) and b is the y-intercept (value of y when x=0). The line generated by this equation minimizes the difference between the actual data points and the predicted values, allowing us to make predictions about the target variable for new values of the predictor.

#!/usr/bin/env python
# coding: utf-8

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

# Load the data from a CSV file
# The dataset contains per capita income data for Canada over the years.
df = pd.read_csv("canada_per_capita_income.csv")

# Display the first 3 rows of the data to understand its structure
# Output:
#    year  per capita income (US$)
# 0  1970                     3399
# 1  1971                     3768
# 2  1972                     4251
df.head(3)

# Check the column names of the dataset
# Output:
# Index(['year', 'per capita income (US$)'], dtype='object')
df.columns

# Rename the column 'per capita income (US$)' to 'per_capita_income_usd' for easier reference
newdf = df.rename(columns={'per capita income (US$)': 'per_capita_income_usd'})

# Confirm that the column was renamed successfully
# Output:
# Index(['year', 'per_capita_income_usd'], dtype='object')
newdf.columns

# Create a scatter plot to visualize the data
# X-axis: Year
# Y-axis: Per Capita Income USD
# Data points: Red stars
get_ipython().run_line_magic('matplotlib', 'inline')
plt.xlabel = "Year"
plt.ylabel = "Per Capita Income USD"
plt.scatter(newdf.year, newdf.per_capita_income_usd, color='red', marker='*')

# Create a linear regression model
reg = linear_model.LinearRegression()

# Define the feature (X) and target (y) for the model
# X: The years (as a 2D array)
# y: The per capita income in USD
X = newdf[['year']]
y = newdf['per_capita_income_usd']

# Train the linear regression model using the data
reg.fit(X, y)

# Predict the per capita income for the year 2030
input_df = pd.DataFrame({'year': [2030]})
predicted_income = reg.predict(input_df)

# Display the prediction
# Output: [61506.3306846]
print(predicted_income)

# The model finds a linear relationship in the form of y = cx + i
# Here, we find the coefficient (c) and intercept (i) of this equation
coef = reg.coef_
intercept = reg.intercept_

# Display the coefficient and intercept
# Output:
# coef= [828.46507522]
# intercept= -1632210.7578554575
print("coef=", coef)
print("intercept=", intercept)

# Plot the original data points along with the regression line
# The regression line shows the predicted values based on the model
get_ipython().run_line_magic('matplotlib', 'inline')
plt.xlabel = 'Year'
plt.ylabel = 'Per Capita Income USD'
plt.scatter(newdf.year, newdf.per_capita_income_usd, color='red', marker='+')
plt.plot(newdf.year, reg.predict(newdf[['year']]), color='blue')

# Predict per capita income for every 5 years from 2018 to 2093
years5 = [year + 5 for year in range(2013, 2099, 5)]

# Convert the list of years to a DataFrame
input_year_df = pd.DataFrame(years5, columns=['year'])

# Predict the per capita income for these years
input_year_df['per_capita_usd'] = reg.predict(input_year_df)

# Display the DataFrame with the predicted values
# Output:
#     year  per_capita_usd
# 0   2018    47927.137157
# 1   2023    52069.462533
# 2   2028    56211.787909
# 3   2033    60354.113284
# 4   2038    64496.438660
# 5   2043    68638.764036
# 6   2048    72781.089411
# 7   2053    76923.414787
# 8   2058    81065.740162
# 9   2063    85208.065538
# 10  2068    89350.390914
# 11  2073    93492.716289
# 12  2078    97635.041665
# 13  2083   101777.367041
# 14  2088   105919.692416
# 15  2093   110062.017792
print(input_year_df)

# Save the predictions to a CSV file
# This file will contain the predicted per capita income for every 5 years
input_year_df.to_csv("per_capita_prediction.csv", index=False)