Problem Statement:
What percentage of marks that a student is expected to score based upon the number of hours they studied ??
# Importing libraries:
import numpy as np
import pandas as pdimport matplotlib.pyplot as plt
# import os:
import os
os.getcwd()
# import data set:
df=pd.read_csv("D:\\Raj_DataScience\\Documents\\student_scores.csv")
df.head()
| Hours | Scores | |
|---|---|---|
| 0 | 2.5 | 21 |
| 1 | 5.1 | 47 |
| 2 | 3.2 | 27 |
| 3 | 8.5 | 75 |
| 4 | 3.5 | 30 |
df.describe()
| Hours | Scores | |
|---|---|---|
| count | 25.000000 | 25.000000 |
| mean | 5.012000 | 51.480000 |
| std | 2.525094 | 25.286887 |
| min | 1.100000 | 17.000000 |
| 25% | 2.700000 | 30.000000 |
| 50% | 4.800000 | 47.000000 |
| 75% | 7.400000 | 75.000000 |
| max | 9.200000 | 95.000000 |
# Lets plot our data points on 2-D graph:
df.plot(x='Hours',y='Scores',style='o')
plt.title('Hours vs Percentage')
plt.xlabel('Hours studied')
plt.ylabel('Percentage score')
plt.show()

From the above plot we can see that there is a positive linear relation between the Hours studied and percentage of score.
Preparing data:
For this we need to divide the data into 'attributes' and 'labels'
- attribute---> independent variable
- labels ----> dependent variables ( whose values to be predicted )
Here, Hours studied is attribute and percentage of score will be label.
x=df.iloc[:,:-1].values
y=df.iloc[:,1].values
* The x-variable contains the attribute and y-variable contains the labels.
Split the data set into Train and Test:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
The above script 80% of data training set while 20% of data is test set.
The test_size variable is where we actually specify the parameter of test set.
Training Algorithm:
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train,y_train)
Here, we implemented linear regression model, itself finds the best values of slope and intercepts.
# To retrieve the intercept:
print(regressor.intercept_)
2.018160041434683# To retieve the slope:print(regressor.coef_)[9.91065648]This means that for every one unit of change in hours studied, leads to change in the score is about 9.91%.Make predictions:y_pred=regressor.predict(x_test)# to compare actual and predicted values:df=pd.DataFrame({'Actual':y_test,'Predicted':y_pred})print(df)Actual Predicted0 20 16.8841451 27 33.7322612 69 75.3570183 30 26.7948014 62 60.491033Model Evaluation:The final step is to evaluate the performance of algorithm. This step is particularly important to compare how well our algorithm performed on data set.from sklearn import metricsprint('Mean Absolute Error:',metrics.mean_absolute_error(y_test,y_pred))print('Mean Squared Error:',metrics.mean_squared_error(y_test,y_pred))print('Root Mean Square Error:', np.sqrt(metrics.mean_squared_error(y_test,y_pred)))Mean Absolute Error: 4.183859899002975Mean Squared Error: 21.5987693072174Root Mean Square Error: 4.6474476121003665Accuracy:print('R_Square:', metrics.r2_score(y_test,y_pred))R_Square: 0.9454906892105356

Comments
Post a Comment