Problem Statement:

What percentage of marks that a student is expected to score based upon the number of hours they studied ??

# Importing libraries:

import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
# import os:
import os
os.getcwd()
# import data set:
df=pd.read_csv("D:\\Raj_DataScience\\Documents\\student_scores.csv")
df.head()

	Hours	Scores
0	2.5	21
1	5.1	47
2	3.2	27
3	8.5	75
4	3.5	30

df.describe()

	Hours	Scores
count	25.000000	25.000000
mean	5.012000	51.480000
std	2.525094	25.286887
min	1.100000	17.000000
25%	2.700000	30.000000
50%	4.800000	47.000000
75%	7.400000	75.000000
max	9.200000	95.000000

# Lets plot our data points on 2-D graph:

df.plot(x='Hours',y='Scores',style='o')
plt.title('Hours vs Percentage')
plt.xlabel('Hours studied')
plt.ylabel('Percentage score')
plt.show()

From the above plot we can see that there is a positive linear relation between the Hours studied and percentage of score.

Preparing data:

For this we need to divide the data into 'attributes' and 'labels'

attribute---> independent variable
labels ----> dependent variables ( whose values to be predicted )

Here, Hours studied is attribute and percentage of score will be label.

x=df.iloc[:,:-1].values
y=df.iloc[:,1].values

* The x-variable contains the attribute and y-variable contains the labels.

Split the data set into Train and Test:

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
The above script 80% of data training set while 20% of data is test set.
The test_size variable is where we actually specify the parameter of test set.

Training Algorithm:
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train,y_train)

Here, we implemented linear regression model, itself finds the best values of slope and intercepts.

# To retrieve the intercept:

print(regressor.intercept_)

2.018160041434683

# To retieve the slope:

print(regressor.coef_)

[9.91065648]

This means that for every one unit of change in hours studied, leads to change in the score is about 9.91%.

Make predictions:

y_pred=regressor.predict(x_test)

# to compare actual and predicted values:
df=pd.DataFrame({'Actual':y_test,'Predicted':y_pred})
print(df)

   Actual  Predicted
0      20  16.884145
1      27  33.732261
2      69  75.357018
3      30  26.794801
4      62  60.491033

Model Evaluation:
The final step is to evaluate the performance of algorithm. This step is particularly important to compare how well our algorithm performed on data set.

from sklearn import metrics
print('Mean Absolute Error:',metrics.mean_absolute_error(y_test,y_pred))
print('Mean Squared Error:',metrics.mean_squared_error(y_test,y_pred))
print('Root Mean Square Error:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

Mean Absolute Error: 4.183859899002975
Mean Squared Error: 21.5987693072174
Root Mean Square Error: 4.6474476121003665
Accuracy:
print('R_Square:', metrics.r2_score(y_test,y_pred))
R_Square: 0.9454906892105356

Data science with_Raj

Search This Blog

Simple Linear regression in practice

Comments

Post a Comment