Main Challenges of Machine Learning:
In Machine learning we generally split our data into two categories like train data and test data. Training set is used to train the algorithm, Test set is used to validate and check the performance of algorithm. In general we split 70% or 80% of data as training set and remaining 30% or 20% as testing set. There is no specific rule to split data, it's all depends on the availability of data, personal requirements and etc.
Your main task is to select a learning algorithm and train it on some data, the two things that can go wrong are "bad algorithm" and "bad data". Let's see some challenges of data in ML.
- Insufficient Quantity of Training data
- Non representative Training data
- Poor Quality Data
- Irrelevant Features
- Over fitting the Training Data
- Under fitting the Training Data
⇒ Insufficient Quantity of Training Data:
In general Machine learning algorithm works well on large training data. If you feed insufficient training data to the algorithm there will high chances of getting errors and low performance of model on actual data.
⇒ Non representative Training Data:
In order to generalize well, it is crucial to that your training data to be representative of the new cases of you want to generalize.
⇒ Poor-Quality Data:
Obviously, your training data is full of errors, noise, outliers and etc., It will make harder for the model to detect the patterns, so your system is less likely to perform well. Hence, it is essential to spend time on cleaning up your training data.
For example : Removing outliers and Fill Missing values etc..
⇒ Irrelevant Features:
A critical part of the success of the Machine learning project is coming up with a good set of relevant features to train.This process, called Feature Engineering.
- Feature Selection: Select the useful features to train on among existing features.
- Feature Extraction:Combining existing features to produce more useful one.
- Creating new features by gathering new data.
⇒ Over Fitting the Training Data:
In this the model, will perform well on training data but the model performance will decline on test set or validation, known as over fitting. Over fitting happens when the model is too complex relative to the noises and amount of training data.The possible solutions are:
- gather more training data
- reduce the noise in the training data
- reduce the number of attributes in training data
⇒ Under Fitting the Training Data:
In this the model, not fit well to the training data,this leads low performance. So it's predictions are bound to be inaccurate, even on the training examples of data.The main options to fix the problem are:
- Selecting more powerful model, with more parameters
- Feeding better features to the learning algorithm (Feature Engineering)
- Reduce the constraints on the model
⇒ Hyper parameter tuning and Model selection:
Comments
Post a Comment