04dc
04dc Final Up to date on 04dc July 15, 2022
04dc
04dc Loss metric is essential for 04dc neural networks. As all machine 04dc studying mannequin is a optimization 04dc downside or one other, the 04dc loss is the target perform 04dc to attenuate. In neural networks, 04dc the optimization is finished with 04dc gradient descent and backpropagation. However 04dc what are loss features and 04dc the way are they affecting 04dc our neural networks?
04dc
04dc On this submit, we’ll cowl 04dc what loss features are and 04dc go into some generally used 04dc loss features and how one 04dc can apply them to your 04dc neural networks.
04dc
04dc After studying this text, you’ll 04dc study:
04dc
- 04dc
- 04dc What are loss features and 04dc the way they’re completely different 04dc from metrics
- 04dc Frequent loss features for regression 04dc and classification issues
- 04dc How one can use loss 04dc features in your TensorFlow mannequin
04dc
04dc
04dc
04dc
04dc Let’s get began!
04dc

04dc Loss features in TensorFlow.
04dc Photograph by 04dc Ian Taylor 04dc . Some rights reserved.
04dc
04dc Overview
04dc
04dc This text is break up 04dc into 5 part; they’re:
04dc
- 04dc
- 04dc What are loss features?
- 04dc Imply absolute error
- 04dc Imply squared error
- 04dc Categorical cross-entropy
- 04dc Loss features in apply
04dc
04dc
04dc
04dc
04dc
04dc
04dc What are loss features?
04dc
04dc In neural networks, loss features 04dc assist optimize the efficiency of 04dc the mannequin. They’re normally used 04dc to measure some penalty that 04dc the mannequin is incurring on 04dc its predictions, such because the 04dc deviation of the prediction away 04dc from the bottom reality label. 04dc Loss features are normally differentiable 04dc throughout their area (however it’s 04dc allowed that the gradient is 04dc undefined just for very particular 04dc factors, equivalent to x = 04dc 0, which is principally ignored 04dc in apply). Within the coaching 04dc loop, we differentiate them with 04dc respect to parameters and used 04dc these gradients for our backpropagation 04dc and gradient descent steps to 04dc optimize our mannequin on the 04dc coaching set.
04dc
04dc Loss features are additionally barely 04dc completely different from metrics. Whereas 04dc loss features can inform us 04dc the efficiency of our mannequin, 04dc they won’t be of direct 04dc curiosity or simply explainable by 04dc people. That is the place 04dc metrics are available in. Metrics 04dc equivalent to accuracy are rather 04dc more helpful for people to 04dc know the efficiency of a 04dc neural community though they won’t 04dc be good selections for loss 04dc features since they won’t be 04dc differentiable.
04dc
04dc Within the following, let’s discover 04dc some widespread loss features, specifically, 04dc the imply absolute error, imply 04dc squared error, and categorical cross 04dc entropy.
04dc
04dc Imply Absolute Error
04dc
04dc The imply absolute error (MAE) 04dc measures absolutely the distinction between 04dc predicted values and the bottom 04dc reality labels and takes the 04dc imply of the distinction throughout 04dc all coaching examples. Mathematically, it 04dc is the same as $frac{1}{m}sum_{i=1}^mlverthat{y}_i–y_irvert$ 04dc the place $m$ is the 04dc variety of coaching examples, $y_i$ 04dc and $hat{y}_i$ are the bottom 04dc reality and predicted values respectively, 04dc and we’re averaging over all 04dc coaching examples. The MAE isn’t 04dc unfavourable, however it could be 04dc zero provided that the prediction 04dc matched the bottom reality completely. 04dc It’s an intuitive loss perform 04dc and may additionally be used 04dc as certainly one of our 04dc metrics, specifically for regression issues 04dc since we might wish to 04dc reduce the error in our 04dc predictions.
04dc
04dc Let’s take a look at 04dc what the imply absolute error 04dc loss perform seems to be 04dc like graphically:
04dc

04dc Imply absolute error loss perform, 04dc floor reality at x = 04dc 0 and x-axis symbolize the 04dc expected worth
04dc
04dc Much like activation features, we’re 04dc normally additionally thinking about what 04dc the gradient of the loss 04dc perform seems to be like 04dc since we’re utilizing the gradient 04dc afterward to do backpropagation to 04dc coach our mannequin’s parameters.
04dc
04dc
04dc We discover that there’s a 04dc discontinuity within the gradient perform 04dc for the imply absolute loss 04dc perform however we are likely 04dc to ignore it because it 04dc happens solely at x = 04dc 0 which in apply not 04dc often occurs since it’s the 04dc chance of a single level 04dc in a steady distribution.
04dc
04dc Let’s check out methods to 04dc implement this loss perform in 04dc TensorFlow utilizing the the Keras 04dc losses module:
04dc
04dc
04dc |
04dc import 04dc tensorflow 04dc as 04dc 04dc tf 04dc from 04dc tensorflow 04dc . 04dc keras 04dc . 04dc losses 04dc import 04dc MeanAbsoluteError 04dc 04dc y_true 04dc 04dc = 04dc 04dc [ 04dc 1. 04dc , 04dc 04dc 0. 04dc ] 04dc y_pred 04dc 04dc = 04dc 04dc [ 04dc 2. 04dc , 04dc 04dc 3. 04dc ] 04dc 04dc mae_loss 04dc 04dc = 04dc 04dc MeanAbsoluteError 04dc ( 04dc ) 04dc 04dc print 04dc ( 04dc mae_loss 04dc ( 04dc y_true 04dc , 04dc 04dc y_pred 04dc ) 04dc . 04dc numpy 04dc ( 04dc ) 04dc ) |
04dc
04dc
04dc which supplies us 04dc 2.0
04dc because the output as 04dc anticipated, since $ frac{1}{2}(lvert 2-1rvert 04dc + lvert 3-0rvert) = frac{1}{2}(4) 04dc = 4 $. Subsequent, let’s 04dc discover one other loss perform 04dc for regression fashions with barely 04dc completely different properties, the imply 04dc squared error.
04dc
04dc Imply Squared Error
04dc
04dc One other standard loss perform 04dc for regression fashions is the 04dc imply squared error (MSE), which 04dc is the same as $frac{1}{m}sum_{i=1}^m(hat{y}_i–y_i)^2$. 04dc It’s much like the imply 04dc absolute error because it additionally 04dc measures the deviation of the 04dc expected worth from the bottom 04dc reality worth. Nonetheless, the imply 04dc squared error squares this distinction 04dc (all the time non-negative since 04dc sq. of actual numbers are 04dc all the time non-negative), which 04dc supplies it barely completely different 04dc properties.
04dc
04dc One notable one is that 04dc the imply squared error favors 04dc numerous small errors over a 04dc small variety of giant errors, 04dc which results in fashions which 04dc have much less outliers or 04dc no less than outliers which 04dc can be much less extreme 04dc than fashions educated with a 04dc imply absolute error. It’s because 04dc a big error would have 04dc a considerably bigger influence on 04dc the error, and consequently the 04dc gradient of the error, when 04dc in comparison with a small 04dc error.
04dc
04dc Graphically,
04dc

04dc Imply squared error loss perform, 04dc floor reality at x = 04dc 0 and x-axis symbolize the 04dc expected worth
04dc
04dc Then, wanting on the gradient,
04dc
04dc
04dc Discover that bigger errors would 04dc result in a bigger magnitude 04dc for the gradient and likewise 04dc a bigger loss. Therefore, for 04dc instance, two coaching examples that 04dc deviate from their floor truths 04dc by 1 unit would result 04dc in a lack of 2, 04dc whereas a single coaching instance 04dc that deviates from its floor 04dc reality by 2 models would 04dc result in a lack of 04dc 4, therefore having a bigger 04dc influence.
04dc
04dc Let’s take a look at 04dc methods to implement the imply 04dc squared loss in TensorFlow.
04dc
04dc
04dc |
04dc import 04dc tensorflow 04dc as 04dc 04dc tf 04dc from 04dc tensorflow 04dc . 04dc keras 04dc . 04dc losses 04dc import 04dc MeanSquaredError 04dc 04dc y_true 04dc 04dc = 04dc 04dc [ 04dc 1. 04dc , 04dc 04dc 0. 04dc ] 04dc y_pred 04dc 04dc = 04dc 04dc [ 04dc 2. 04dc , 04dc 04dc 3. 04dc ] 04dc 04dc mse_loss 04dc 04dc = 04dc 04dc MeanSquaredError 04dc ( 04dc ) 04dc 04dc print 04dc ( 04dc mse_loss 04dc ( 04dc y_true 04dc , 04dc 04dc y_pred 04dc ) 04dc . 04dc numpy 04dc ( 04dc ) 04dc ) |
04dc
04dc
04dc which supplies the output 04dc 5.0
04dc as anticipated since $frac{1}{2}[(2-1)^2 04dc + (3-0)^2] = frac{1}{2}(10) = 04dc 5$. Discover that the second 04dc instance with a predicted worth 04dc of three and precise worth 04dc of 0 contributes 90% of 04dc the error underneath the imply 04dc squared error vs 75% of 04dc the error underneath imply absolute 04dc error.
04dc
04dc Generally, you might even see 04dc folks use root imply squared 04dc error (RMSE) as a metric. 04dc That is to take the 04dc sq. root of MSE. From 04dc the attitude of a loss 04dc perform, MSE and RMSE are 04dc equal.
04dc
04dc Each MAE and MSE are 04dc measuring values in a steady 04dc vary. Therefore they’re for regression 04dc issues. For classification issues, we 04dc are able to use categorical 04dc cross-entropy.
04dc
04dc Categorical Cross-entropy
04dc
04dc The earlier two loss features 04dc are for regression fashions, the 04dc place the output might be 04dc any actual quantity. Nonetheless, for 04dc classification issues, there’s a small, 04dc discrete set of numbers that 04dc the output might take. Moreover, 04dc the quantity that we use 04dc to label-encode the lessons are 04dc arbitrary, and with no semantic 04dc which means (e.g. if we 04dc used the labels 0 for 04dc cat, 1 for canine, and 04dc a pair of for horse, 04dc it doesn’t symbolize {that a} 04dc canine is half cat and 04dc half horse). Due to this 04dc fact it mustn’t have an 04dc effect on the efficiency of 04dc the mannequin.
04dc
04dc In a classification downside, the 04dc mannequin’s output is a vector 04dc of chance for every class. 04dc In Keras fashions, normally we 04dc anticipate this vector to be 04dc “logits”, i.e., actual numbers to 04dc be reworked to chance utilizing 04dc softmax perform, or the output 04dc of a softmax activation perform.
04dc
04dc The cross-entropy between two chance 04dc distributions is a measure of 04dc the distinction between the 2 04dc chance distributions. Exactly, it’s $-sum_i 04dc P(X = x_i) log Q(X 04dc = x_i)$ for chance $P$ 04dc and $Q$. In machine studying, 04dc we normally have the chance 04dc $P$ offered by the coaching 04dc information and $Q$ predicted by 04dc the mannequin, which $P$ is 04dc 1 for the proper class 04dc and 0 for each different 04dc class. The expected chance $Q$, 04dc nonetheless, is normally valued between 04dc 0 and 1. Therefore when 04dc used for classification issues in 04dc machine studying, this method may 04dc be simplified into: $$textual content{categorical 04dc cross entropy} = – log 04dc p_{gt}$$ the place $p_{gt}$ is 04dc the model-predicted chance of the 04dc bottom reality class for that 04dc exact pattern.
04dc
04dc Cross-entropy metric have a unfavourable 04dc signal as a result of 04dc $log(x)$ tends to unfavourable infinity 04dc as $x$ tends to zero. 04dc We wish the next loss 04dc when the chance approaches 0 04dc and a decrease loss when 04dc the chance approaches 1. Graphically,
04dc

04dc Categorical cross entropy loss perform, 04dc the place x is the 04dc expected chance of the bottom 04dc reality class
04dc
04dc Discover that the loss is 04dc strictly 0 if the chance 04dc of the bottom reality class 04dc is 1 as desired. Additionally, 04dc because the chance of the 04dc bottom reality class tends to 04dc 0, the loss tends to 04dc optimistic infinity as properly, therefore 04dc considerably penalizing unhealthy predictions. You 04dc would possibly acknowledge this loss 04dc perform for logistic regression, and 04dc they’re comparable besides the logistic 04dc regression loss is restricted to 04dc the case of binary lessons.
04dc
04dc
04dc Now, wanting on the gradient 04dc of the cross entropy loss,
04dc
04dc Trying on the gradient, we 04dc are able to see that 04dc the gradient is usually unfavourable 04dc which can also be anticipated 04dc since to lower this loss, 04dc we might need the chance 04dc on the bottom reality class 04dc to be as excessive as 04dc attainable, and recall that gradient 04dc descent goes in the other 04dc way of the gradient.
04dc
04dc There are two other ways 04dc to implement categorical cross entropy 04dc in TensorFlow. The primary methodology 04dc takes in one-hot vectors as 04dc enter,
04dc
04dc
04dc |
04dc import 04dc tensorflow 04dc as 04dc 04dc tf 04dc from 04dc tensorflow 04dc . 04dc keras 04dc . 04dc losses 04dc import 04dc CategoricalCrossentropy 04dc 04dc # utilizing one scorching vector 04dc illustration 04dc y_true 04dc 04dc = 04dc 04dc [ 04dc [ 04dc 0 04dc , 04dc 04dc 1 04dc , 04dc 04dc 0 04dc ] 04dc , 04dc 04dc [ 04dc 1 04dc , 04dc 04dc 0 04dc , 04dc 04dc 0 04dc ] 04dc ] 04dc y_pred 04dc 04dc = 04dc 04dc [ 04dc [ 04dc 0.15 04dc , 04dc 04dc 0.75 04dc , 04dc 04dc 0.1 04dc ] 04dc , 04dc 04dc [ 04dc 0.75 04dc , 04dc 04dc 0.15 04dc , 04dc 04dc 0.1 04dc ] 04dc ] 04dc 04dc cross_entropy_loss 04dc 04dc = 04dc 04dc CategoricalCrossentropy 04dc ( 04dc ) 04dc 04dc print 04dc ( 04dc cross_entropy_loss 04dc ( 04dc y_true 04dc , 04dc 04dc y_pred 04dc ) 04dc . 04dc numpy 04dc ( 04dc ) 04dc ) |
04dc
04dc
04dc This provides the output as 04dc 04dc 0.2876821
04dc which is the same 04dc as $-log(0.75)$ as anticipated. The 04dc opposite method of implementing the 04dc explicit cross entropy loss in 04dc TensorFlow is utilizing a label-encoded 04dc illustration for the category, the 04dc place the category is represented 04dc by a single non-negative integer 04dc indicating the bottom reality class 04dc as a substitute.
04dc
04dc
04dc |
04dc import 04dc tensorflow 04dc as 04dc 04dc tf 04dc from 04dc tensorflow 04dc . 04dc keras 04dc . 04dc losses 04dc import 04dc SparseCategoricalCrossentropy 04dc 04dc y_true 04dc 04dc = 04dc 04dc [ 04dc 1 04dc , 04dc 04dc 0 04dc ] 04dc y_pred 04dc 04dc = 04dc 04dc [ 04dc [ 04dc 0.15 04dc , 04dc 04dc 0.75 04dc , 04dc 04dc 0.1 04dc ] 04dc , 04dc 04dc [ 04dc 0.75 04dc , 04dc 04dc 0.15 04dc , 04dc 04dc 0.1 04dc ] 04dc ] 04dc 04dc cross_entropy_loss 04dc 04dc = 04dc 04dc SparseCategoricalCrossentropy 04dc ( 04dc ) 04dc 04dc print 04dc ( 04dc cross_entropy_loss 04dc ( 04dc y_true 04dc , 04dc 04dc y_pred 04dc ) 04dc . 04dc numpy 04dc ( 04dc ) 04dc ) |
04dc
04dc
04dc which likewise offers the output 04dc 04dc 0.2876821
04dc .
04dc
04dc Now that we’ve explored loss 04dc features for each regression and 04dc classification fashions, let’s check out 04dc how we use loss features 04dc in our machine studying fashions.
04dc
04dc Loss Capabilities in Observe
04dc
04dc Let’s discover how we are 04dc able to use loss features 04dc in apply. We’ll discover this 04dc by way of a easy 04dc dense mannequin on the MNIST 04dc digit classification dataset.
04dc
04dc First, we get the obtain 04dc the info from Keras datasets 04dc module,
04dc
04dc
04dc |
04dc import 04dc tensorflow 04dc . 04dc keras 04dc as 04dc 04dc keras 04dc 04dc ( 04dc trainX 04dc , 04dc 04dc trainY 04dc ) 04dc , 04dc 04dc ( 04dc testX 04dc , 04dc 04dc testY 04dc ) 04dc 04dc = 04dc 04dc keras 04dc . 04dc datasets 04dc . 04dc mnist 04dc . 04dc load_data 04dc ( 04dc ) |
04dc
04dc
04dc Then, we construct our mannequin,
04dc
04dc
04dc |
04dc from 04dc tensorflow 04dc . 04dc keras 04dc import 04dc Sequential 04dc from 04dc tensorflow 04dc . 04dc keras 04dc . 04dc layers 04dc import 04dc Dense 04dc , 04dc 04dc Enter 04dc , 04dc 04dc Flatten 04dc 04dc mannequin 04dc 04dc = 04dc 04dc Sequential 04dc ( 04dc [ 04dc 04dc Input 04dc ( 04dc shape 04dc = 04dc ( 04dc 28 04dc , 04dc 28 04dc , 04dc 1 04dc , 04dc ) 04dc ) 04dc , 04dc 04dc Flatten 04dc ( 04dc ) 04dc , 04dc 04dc Dense 04dc ( 04dc units 04dc = 04dc 84 04dc , 04dc 04dc activation 04dc = 04dc “relu” 04dc ) 04dc , 04dc 04dc Dense 04dc ( 04dc units 04dc = 04dc 10 04dc , 04dc 04dc activation 04dc = 04dc “softmax” 04dc ) 04dc , 04dc 04dc ] 04dc ) 04dc 04dc print 04dc 04dc ( 04dc mannequin 04dc . 04dc abstract 04dc ( 04dc ) 04dc ) |
04dc
04dc
04dc And we take a look 04dc at the mannequin structure outputted 04dc from the above code,
04dc
04dc
04dc |
04dc _________________________________________________________________ 04dc Layer (sort) Output Form Param # 04dc 04dc ================================================================= 04dc flatten_1 (Flatten) (None, 784) 04dc 0 04dc 04dc dense_2 (Dense) (None, 84) 65940 04dc 04dc 04dc dense_3 (Dense) (None, 10) 850 04dc 04dc 04dc ================================================================= 04dc Whole params: 66,790 04dc Trainable params: 66,790 04dc Non-trainable params: 0 04dc _________________________________________________________________ |
04dc
04dc
04dc We are able to then 04dc compile our mannequin, which can 04dc also be the place we 04dc introduce the loss perform. Since 04dc this can be a classification 04dc downside, we’ll use the cross 04dc entropy loss. Particularly, for the 04dc reason that MNIST dataset in 04dc Keras datasets is represented as 04dc a label as a substitute 04dc of an one-hot vector, we’ll 04dc use the SparseCategoricalCrossEntropy loss.
04dc
04dc
04dc |
04dc mannequin 04dc . 04dc compile 04dc ( 04dc optimizer 04dc = 04dc “adam” 04dc , 04dc 04dc loss 04dc = 04dc tf 04dc . 04dc keras 04dc . 04dc losses 04dc . 04dc SparseCategoricalCrossentropy 04dc ( 04dc ) 04dc , 04dc 04dc metrics 04dc = 04dc “acc” 04dc ) |
04dc
04dc
04dc And at last, we prepare 04dc our mannequin.
04dc
04dc
04dc |
04dc historical past 04dc 04dc = 04dc 04dc mannequin 04dc . 04dc match 04dc ( 04dc x 04dc = 04dc trainX 04dc , 04dc 04dc y 04dc = 04dc trainY 04dc , 04dc 04dc batch_size 04dc = 04dc 256 04dc , 04dc 04dc epochs 04dc = 04dc 10 04dc , 04dc 04dc validation_data 04dc = 04dc ( 04dc testX 04dc , 04dc 04dc testY 04dc ) 04dc ) |
04dc
04dc
04dc And our mannequin efficiently trains 04dc with the next output:
04dc
04dc
04dc
04dc 1 04dc 2 04dc 3 04dc 4 04dc 5 04dc 6 04dc 7 04dc 8 04dc 9 04dc 10 04dc 11 04dc 12 04dc 13 04dc 14 04dc 15 04dc 16 04dc 17 04dc 18 04dc 19 04dc 20 04dc |
04dc Epoch 1/10 04dc 235/235 [==============================] – 2s 6ms/step 04dc – loss: 7.8607 – acc: 04dc 0.8184 – val_loss: 1.7445 – 04dc val_acc: 0.8789 04dc Epoch 2/10 04dc 235/235 [==============================] – 1s 6ms/step 04dc – loss: 1.1011 – acc: 04dc 0.8854 – val_loss: 0.9082 – 04dc val_acc: 0.8821 04dc Epoch 3/10 04dc 235/235 [==============================] – 1s 6ms/step 04dc – loss: 0.5729 – acc: 04dc 0.8998 – val_loss: 0.6689 – 04dc val_acc: 0.8927 04dc Epoch 4/10 04dc 235/235 [==============================] – 1s 5ms/step 04dc – loss: 0.3911 – acc: 04dc 0.9203 – val_loss: 0.5406 – 04dc val_acc: 0.9097 04dc Epoch 5/10 04dc 235/235 [==============================] – 1s 6ms/step 04dc – loss: 0.3016 – acc: 04dc 0.9306 – val_loss: 0.5024 – 04dc val_acc: 0.9182 04dc Epoch 6/10 04dc 235/235 [==============================] – 1s 6ms/step 04dc – loss: 0.2443 – acc: 04dc 0.9405 – val_loss: 0.4571 – 04dc val_acc: 0.9242 04dc Epoch 7/10 04dc 235/235 [==============================] – 1s 5ms/step 04dc – loss: 0.2076 – acc: 04dc 0.9469 – val_loss: 0.4173 – 04dc val_acc: 0.9282 04dc Epoch 8/10 04dc 235/235 [==============================] – 1s 5ms/step 04dc – loss: 0.1852 – acc: 04dc 0.9514 – val_loss: 0.4335 – 04dc val_acc: 0.9287 04dc Epoch 9/10 04dc 235/235 [==============================] – 1s 6ms/step 04dc – loss: 0.1576 – acc: 04dc 0.9577 – val_loss: 0.4217 – 04dc val_acc: 0.9342 04dc Epoch 10/10 04dc 235/235 [==============================] – 1s 5ms/step 04dc – loss: 0.1455 – acc: 04dc 0.9597 – val_loss: 0.4151 – 04dc val_acc: 0.9344 |
04dc
04dc
04dc And that’s one instance of 04dc methods to use a loss 04dc perform in a TensorFlow mannequin.
04dc
04dc Additional Studying
04dc
04dc Under are the documentation of 04dc the loss features from TensorFlow/Keras:
04dc
04dc Conclusion
04dc
04dc On this submit, you’ve gotten 04dc seen loss features and the 04dc function that they play in 04dc a neural community. You have 04dc got additionally seen some standard 04dc loss features utilized in regression 04dc and classification fashions, in addition 04dc to methods to use the 04dc cross entropy loss perform in 04dc a TensorFlow mannequin.
04dc
04dc Particularly, you realized:
04dc
- 04dc
- 04dc What are loss features and 04dc the way they’re completely different 04dc from metrics
- 04dc Frequent loss features for regression 04dc and classification issues
- 04dc How one can use loss 04dc features in your TensorFlow mannequin
04dc
04dc
04dc
04dc
04dc
04dc