Problem Statement:
What percentage of marks that a student is expected to score based upon the number of hours they studied ??
# Importing libraries:
import numpy as np
import pandas as pdimport matplotlib.pyplot as plt
# import os:
import os
os.getcwd()
# import data set:
df=pd.read_csv("D:\\Raj_DataScience\\Documents\\student_scores.csv")
df.head()
Hours | Scores | |
---|---|---|
0 | 2.5 | 21 |
1 | 5.1 | 47 |
2 | 3.2 | 27 |
3 | 8.5 | 75 |
4 | 3.5 | 30 |
df.describe()
Hours | Scores | |
---|---|---|
count | 25.000000 | 25.000000 |
mean | 5.012000 | 51.480000 |
std | 2.525094 | 25.286887 |
min | 1.100000 | 17.000000 |
25% | 2.700000 | 30.000000 |
50% | 4.800000 | 47.000000 |
75% | 7.400000 | 75.000000 |
max | 9.200000 | 95.000000 |
# Lets plot our data points on 2-D graph:
df.plot(x='Hours',y='Scores',style='o')
plt.title('Hours vs Percentage')
plt.xlabel('Hours studied')
plt.ylabel('Percentage score')
plt.show()
From the above plot we can see that there is a positive linear relation between the Hours studied and percentage of score.
Preparing data:
For this we need to divide the data into 'attributes' and 'labels'
- attribute---> independent variable
- labels ----> dependent variables ( whose values to be predicted )
Here, Hours studied is attribute and percentage of score will be label.
x=df.iloc[:,:-1].values
y=df.iloc[:,1].values
* The x-variable contains the attribute and y-variable contains the labels.
Split the data set into Train and Test:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
The above script 80% of data training set while 20% of data is test set.
The test_size variable is where we actually specify the parameter of test set.
Training Algorithm:
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train,y_train)
Here, we implemented linear regression model, itself finds the best values of slope and intercepts.
# To retrieve the intercept:
print(regressor.intercept_)
2.018160041434683
# To retieve the slope:
print(regressor.coef_)
[9.91065648]
This means that for every one unit of change in hours studied, leads to change in the score is about 9.91%.
Make predictions:
y_pred=regressor.predict(x_test)
# to compare actual and predicted values:
df=pd.DataFrame({'Actual':y_test,'Predicted':y_pred})
print(df)
Actual Predicted
0 20 16.884145
1 27 33.732261
2 69 75.357018
3 30 26.794801
4 62 60.491033
Model Evaluation:
The final step is to evaluate the performance of algorithm. This step is particularly important to compare how well our algorithm performed on data set.
from sklearn import metrics
print('Mean Absolute Error:',metrics.mean_absolute_error(y_test,y_pred))
print('Mean Squared Error:',metrics.mean_squared_error(y_test,y_pred))
print('Root Mean Square Error:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))
Mean Absolute Error: 4.183859899002975Mean Squared Error: 21.5987693072174Root Mean Square Error: 4.6474476121003665Accuracy:
print('R_Square:', metrics.r2_score(y_test,y_pred))
R_Square: 0.9454906892105356
Comments
Post a Comment