Simple Linear regression in practice

 Problem Statement:

What percentage of marks that a student is expected to score based upon the number of hours they studied ??



# Importing libraries:

import numpy as np

import pandas as pd
import matplotlib.pyplot as plt

# import os:
import os
os.getcwd()
# import data set:
df=pd.read_csv("D:\\Raj_DataScience\\Documents\\student_scores.csv")
df.head()

HoursScores
02.521
15.147
23.227
38.575
43.530 

df.describe()
HoursScores
count25.00000025.000000
mean5.01200051.480000
std2.52509425.286887
min1.10000017.000000
25%2.70000030.000000
50%4.80000047.000000
75%7.40000075.000000
max9.20000095.000000

# Lets plot our data points on 2-D graph:

df.plot(x='Hours',y='Scores',style='o')
plt.title('Hours vs Percentage')
plt.xlabel('Hours studied')
plt.ylabel('Percentage score')
plt.show()



From the above plot we can see that there is a positive linear relation between the Hours studied and percentage of score.

Preparing data:

For this we need to divide the data into 'attributes' and 'labels'
  • attribute---> independent variable
  • labels  ----> dependent variables ( whose values to be predicted )
Here, Hours studied is attribute and percentage of score will be label.

x=df.iloc[:,:-1].values
y=df.iloc[:,1].values

* The x-variable contains the attribute and y-variable contains the labels.

Split the data set into Train and Test:

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

The above script 80% of data training set while 20% of data is test set. 
The test_size variable is where we actually specify the parameter of test set.


Training Algorithm:
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train,y_train)

Here, we implemented linear regression model, itself finds the best values of slope and intercepts.

# To retrieve the intercept:
print(regressor.intercept_)
2.018160041434683

# To retieve the slope:
print(regressor.coef_)
[9.91065648]

This means that for every one unit of change in hours studied, leads to change in the score is about 9.91%.

Make predictions:

y_pred=regressor.predict(x_test)

# to compare actual and predicted values:
df=pd.DataFrame({'Actual':y_test,'Predicted':y_pred})
print(df)

   Actual  Predicted
0      20  16.884145
1      27  33.732261
2      69  75.357018
3      30  26.794801
4      62  60.491033

Model Evaluation:
The final step is to evaluate the performance of algorithm. This step is particularly important to compare how well our algorithm performed on data set.

from sklearn import metrics
print('Mean Absolute Error:',metrics.mean_absolute_error(y_test,y_pred))
print('Mean Squared Error:',metrics.mean_squared_error(y_test,y_pred))
print('Root Mean Square Error:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

Mean Absolute Error: 4.183859899002975
Mean Squared Error: 21.5987693072174
Root Mean Square Error: 4.6474476121003665
Accuracy:
print('R_Square:', metrics.r2_score(y_test,y_pred))
R_Square: 0.9454906892105356

Comments