Cross-entropy loss explanationThe cross-entropy error function in neural networksNeural network with multiple...

A Wacky, Wacky Chessboard (That Makes No Sense)

Why is this code uniquely decodable?

How to add multiple differently colored borders around a node?

Quenching swords in dragon blood; why?

Proof by Induction - New to proofs

Why does the DC-9-80 have this cusp in its fuselage?

What is better: yes / no radio, or simple checkbox?

How can I mix up weapons for large groups of similar monsters/characters?

How to satisfy a player character's curiosity about another player character?

Where was Karl Mordo in Infinity War?

Has the Isbell–Freyd criterion ever been used to check that a category is concretisable?

What's the rationale behind the objections to these measures against human trafficking?

Wanted: 5.25 floppy to usb adapter

What is Crew Dragon approaching in this picture?

How to properly claim credit for peer review?

Is my plan for fixing my water heater leak bad?

Obtaining a matrix of complex values from associations giving the real and imaginary parts of each element?

How to mitigate "bandwagon attacking" from players?

Is the theory of the category of topological spaces computable?

Can I retract my name from an already published manuscript?

Why is my solution for the partial pressures of two different gases incorrect?

On what did Lego base the appearance of the new Hogwarts minifigs?

Why is working on the same position for more than 15 years not a red flag?

Should I choose Itemized or Standard deduction?



Cross-entropy loss explanation


The cross-entropy error function in neural networksNeural network with multiple kinds of outputIs cross-entropy a good cost function if I'm interested in the probabilities of a sample belonging to a certain class?In softmax classifier, why use exp function to do normalization?Accuracy drops if more layers trainable - weirdReason for having both low loss and same predicted class?Which Loss cross-entropy do I've to use?weighted cross entropy for imbalanced dataset - multiclass classificationWhy validation loss worsens while precision/recall continue to improve?How does binary cross entropy work?













21












$begingroup$


Suppose I build a NN for classification. The last layer is a Dense layer with softmax activation. I have five different classes to classify. Suppose for a single training example, the true label is [1 0 0 0 0] while the predictions be [0.1 0.5 0.1 0.1 0.2]. How would I calculate the cross entropy loss for this example?










share|improve this question









$endgroup$

















    21












    $begingroup$


    Suppose I build a NN for classification. The last layer is a Dense layer with softmax activation. I have five different classes to classify. Suppose for a single training example, the true label is [1 0 0 0 0] while the predictions be [0.1 0.5 0.1 0.1 0.2]. How would I calculate the cross entropy loss for this example?










    share|improve this question









    $endgroup$















      21












      21








      21


      20



      $begingroup$


      Suppose I build a NN for classification. The last layer is a Dense layer with softmax activation. I have five different classes to classify. Suppose for a single training example, the true label is [1 0 0 0 0] while the predictions be [0.1 0.5 0.1 0.1 0.2]. How would I calculate the cross entropy loss for this example?










      share|improve this question









      $endgroup$




      Suppose I build a NN for classification. The last layer is a Dense layer with softmax activation. I have five different classes to classify. Suppose for a single training example, the true label is [1 0 0 0 0] while the predictions be [0.1 0.5 0.1 0.1 0.2]. How would I calculate the cross entropy loss for this example?







      machine-learning deep-learning






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jul 10 '17 at 10:26









      NainNain

      1,38651831




      1,38651831






















          6 Answers
          6






          active

          oldest

          votes


















          35












          $begingroup$

          Cross entropy formula given two distributions over discrete variable $x$, where $q(x)$ is the estimate for true distribution $p(x)$ is given by



          $$H(p,q) = -sum_{forall x} p(x) log(q(x))$$



          For a neural network, the calculation is independent of these parts:




          • What kind of layer was used.


          • What kind of activation you use - although many activations will not be compatible with the calculation, because it will produce nonsense values if the sum of probabilities is not equal to 1. Softmax is often used for multiclass classification because it guarantees a well-behaved probability distribution function.



          For a neural network, you will usually see the equation written into a form where $mathbf{y}$ is the ground truth vector and $mathbf{hat{y}}$ (or some other value taken direct from the last layer output) is the estimate, and it would look like this for a single example:



          $$L = - mathbf{y} cdot log(mathbf{hat{y}})$$



          Where $cdot$ is vector dot product.



          Your example ground truth $mathbf{y}$ gives all probability to the first value, and the other values are zero, so we can ignore them, and just use the matching term from your estimates $mathbf{hat{y}}$



          $L = -(1times log(0.1) + 0 times log(0.5) + ...)$



          $L = - log(0.1) approx 2.303$



          An important point from comments




          That means, the loss would be same no matter if the predictions are $[0.1, 0.5, 0.1, 0.1, 0.2]$ or $[0.1, 0.6, 0.1, 0.1, 0.1]$?




          Yes, this is a key feature of multiclass logloss, it rewards/penalises probabilities of correct classes only. The value is independent of how the remaining probability is split between incorrect classes.



          You will often see this equation averaged over all examples as a cost function. It is not always strictly adhered to in descriptions, but usually a loss function is lower level and describes how a single instance or component determines an error value, whilst a cost function is higher level and describes how a complete system is evaluated for optimisation. A cost function based on multiclass log loss for data set of size $N$ might look like this:



          $$J = - frac{1}{N}(sum_{i=1}^{N} mathbf{y_i} cdot log(mathbf{hat{y}_i}))$$



          Many implementations will require your ground truth values to be one-hot encoded (with a single true class), because that allows for some extra optimisation. However, in principle the cross entropy loss can be calculated - and optimised - when this is not the case.






          share|improve this answer











          $endgroup$









          • 1




            $begingroup$
            Okay. That means, the loss would be same no matter if the predictions are [0.1 0.5 0.1 0.1 0.2] or [0.1 0.6 0.1 0.1 0.1]?
            $endgroup$
            – Nain
            Jul 10 '17 at 14:48










          • $begingroup$
            @Nain: That is correct for your example. The cross-entropy loss does not depend on what the values of incorrect class probabilities are.
            $endgroup$
            – Neil Slater
            Jul 10 '17 at 15:25



















          6












          $begingroup$

          The answer from Neil is correct. However I think its important to point out that while the loss does not depend on the distribution between the incorrect classes (only the distribution between the correct class and the rest), the gradient of this loss function does effect the incorrect classes differently depending on how wrong they are. So when you use cross-ent in machine learning you will change weights differently for [0.1 0.5 0.1 0.1 0.2] and [0.1 0.6 0.1 0.1 0.1]. This is because the score of the correct class is normalized by the scores of all the other classes to turn it into a probability.






          share|improve this answer









          $endgroup$









          • 3




            $begingroup$
            Can you elaborate it with a proper example?
            $endgroup$
            – Nain
            Nov 14 '17 at 4:42










          • $begingroup$
            @Lucas Adams, can you give an example please ?
            $endgroup$
            – koryakinp
            Jul 28 '18 at 17:15










          • $begingroup$
            The derivative of EACH y_i (softmax output) w.r.t EACH logit z (or the parameter w itself) depends on EVERY y_i. medium.com/@aerinykim/…
            $endgroup$
            – Aaron
            Oct 9 '18 at 8:55



















          2












          $begingroup$

          Let's see how the gradient of the loss behaves... We have the cross-entropy as a loss function, which is given by



          $$
          H(p,q) = -sum_{i=1}^n p(x_i) log(q(x_i)) = -(p(x_1)log(q(x_1)) + ldots + p(x_n)log(q(x_n))
          $$



          Going from here.. we would like to know the derivative with respect to some $x_i$:
          $$
          frac{partial}{partial x_i} H(p,q) = -frac{partial}{partial x_i} p(x_i)log(q(x_i)).
          $$
          Since all the other terms are cancelled due to the differentiation. We can take this equation one step further to
          $$
          frac{partial}{partial x_i} H(p,q) = -p(x_i)frac{1}{q(x_i)}frac{partial q(x_i)}{partial x_i}.
          $$



          From this we can see that we are still only penalizing the true classes (for which there is value for $p(x_i)$). Otherwise we just have a gradient of zero.



          I do wonder how to software packages deal with a predicted value of 0, while the true value was larger than zero... Since we are dividing by zero in that case.






          share|improve this answer









          $endgroup$













          • $begingroup$
            I think what you want is to take derivative w.r.t. the parameter, not w.r.t. x_i.
            $endgroup$
            – Aaron
            Oct 9 '18 at 6:17



















          0












          $begingroup$

          I disagree with Lucas. The values above are already probabilities. Note that the original post indicated that the values had a softmax activation.



          The error is only propagated back on the "hot" class and the probability Q(i) does not change if the probabilities within the other classes shift between each other.






          share|improve this answer









          $endgroup$









          • 2




            $begingroup$
            Lucas is correct. With the architecture described by the OP, then gradient at all the logits (as opposed to outputs) is not zero, because the softmax function connects them all. So the [gradient of the] error at the "hot" class propagates to all output neurons.
            $endgroup$
            – Neil Slater
            May 22 '18 at 7:24










          • $begingroup$
            +1 for Neil and Lucas
            $endgroup$
            – Aaron
            Oct 9 '18 at 6:30



















          0












          $begingroup$

          I had a small doubt
          does pytorchs inbuilt CrossENtropy converts integer lables to 0 or 1 before calclulating the loss ???





          share








          New contributor




          user68826 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          $endgroup$





















            -1












            $begingroup$

            The problem is that the probabilities are coming from a 'complicated' function that incorporates the other outputs into the given value. The outcomes are inter-connected, so this way we are not deriving regarding to the actual outcome, but by all the inputs of the last activation function (softmax), for each and every outcome.



            I have found a very nice description at deepnotes.io/softmax-crossentropy where the author shows that the actual derivative is $p_i - y_i$.



            Other neat description at gombru.github.io/2018/05/23/cross_entropy_loss.



            I think that using a simple sigmoid as a last activation layer would lead to the approved answer, but using softmax indicates different answer.






            share|improve this answer









            $endgroup$













            • $begingroup$
              Welcome to Stack Exchange. However what you wrote does not seem to be an answer of the OP's question about calculating cross-entropy loss.
              $endgroup$
              – user12075
              Sep 25 '18 at 17:05











            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f20296%2fcross-entropy-loss-explanation%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            6 Answers
            6






            active

            oldest

            votes








            6 Answers
            6






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            35












            $begingroup$

            Cross entropy formula given two distributions over discrete variable $x$, where $q(x)$ is the estimate for true distribution $p(x)$ is given by



            $$H(p,q) = -sum_{forall x} p(x) log(q(x))$$



            For a neural network, the calculation is independent of these parts:




            • What kind of layer was used.


            • What kind of activation you use - although many activations will not be compatible with the calculation, because it will produce nonsense values if the sum of probabilities is not equal to 1. Softmax is often used for multiclass classification because it guarantees a well-behaved probability distribution function.



            For a neural network, you will usually see the equation written into a form where $mathbf{y}$ is the ground truth vector and $mathbf{hat{y}}$ (or some other value taken direct from the last layer output) is the estimate, and it would look like this for a single example:



            $$L = - mathbf{y} cdot log(mathbf{hat{y}})$$



            Where $cdot$ is vector dot product.



            Your example ground truth $mathbf{y}$ gives all probability to the first value, and the other values are zero, so we can ignore them, and just use the matching term from your estimates $mathbf{hat{y}}$



            $L = -(1times log(0.1) + 0 times log(0.5) + ...)$



            $L = - log(0.1) approx 2.303$



            An important point from comments




            That means, the loss would be same no matter if the predictions are $[0.1, 0.5, 0.1, 0.1, 0.2]$ or $[0.1, 0.6, 0.1, 0.1, 0.1]$?




            Yes, this is a key feature of multiclass logloss, it rewards/penalises probabilities of correct classes only. The value is independent of how the remaining probability is split between incorrect classes.



            You will often see this equation averaged over all examples as a cost function. It is not always strictly adhered to in descriptions, but usually a loss function is lower level and describes how a single instance or component determines an error value, whilst a cost function is higher level and describes how a complete system is evaluated for optimisation. A cost function based on multiclass log loss for data set of size $N$ might look like this:



            $$J = - frac{1}{N}(sum_{i=1}^{N} mathbf{y_i} cdot log(mathbf{hat{y}_i}))$$



            Many implementations will require your ground truth values to be one-hot encoded (with a single true class), because that allows for some extra optimisation. However, in principle the cross entropy loss can be calculated - and optimised - when this is not the case.






            share|improve this answer











            $endgroup$









            • 1




              $begingroup$
              Okay. That means, the loss would be same no matter if the predictions are [0.1 0.5 0.1 0.1 0.2] or [0.1 0.6 0.1 0.1 0.1]?
              $endgroup$
              – Nain
              Jul 10 '17 at 14:48










            • $begingroup$
              @Nain: That is correct for your example. The cross-entropy loss does not depend on what the values of incorrect class probabilities are.
              $endgroup$
              – Neil Slater
              Jul 10 '17 at 15:25
















            35












            $begingroup$

            Cross entropy formula given two distributions over discrete variable $x$, where $q(x)$ is the estimate for true distribution $p(x)$ is given by



            $$H(p,q) = -sum_{forall x} p(x) log(q(x))$$



            For a neural network, the calculation is independent of these parts:




            • What kind of layer was used.


            • What kind of activation you use - although many activations will not be compatible with the calculation, because it will produce nonsense values if the sum of probabilities is not equal to 1. Softmax is often used for multiclass classification because it guarantees a well-behaved probability distribution function.



            For a neural network, you will usually see the equation written into a form where $mathbf{y}$ is the ground truth vector and $mathbf{hat{y}}$ (or some other value taken direct from the last layer output) is the estimate, and it would look like this for a single example:



            $$L = - mathbf{y} cdot log(mathbf{hat{y}})$$



            Where $cdot$ is vector dot product.



            Your example ground truth $mathbf{y}$ gives all probability to the first value, and the other values are zero, so we can ignore them, and just use the matching term from your estimates $mathbf{hat{y}}$



            $L = -(1times log(0.1) + 0 times log(0.5) + ...)$



            $L = - log(0.1) approx 2.303$



            An important point from comments




            That means, the loss would be same no matter if the predictions are $[0.1, 0.5, 0.1, 0.1, 0.2]$ or $[0.1, 0.6, 0.1, 0.1, 0.1]$?




            Yes, this is a key feature of multiclass logloss, it rewards/penalises probabilities of correct classes only. The value is independent of how the remaining probability is split between incorrect classes.



            You will often see this equation averaged over all examples as a cost function. It is not always strictly adhered to in descriptions, but usually a loss function is lower level and describes how a single instance or component determines an error value, whilst a cost function is higher level and describes how a complete system is evaluated for optimisation. A cost function based on multiclass log loss for data set of size $N$ might look like this:



            $$J = - frac{1}{N}(sum_{i=1}^{N} mathbf{y_i} cdot log(mathbf{hat{y}_i}))$$



            Many implementations will require your ground truth values to be one-hot encoded (with a single true class), because that allows for some extra optimisation. However, in principle the cross entropy loss can be calculated - and optimised - when this is not the case.






            share|improve this answer











            $endgroup$









            • 1




              $begingroup$
              Okay. That means, the loss would be same no matter if the predictions are [0.1 0.5 0.1 0.1 0.2] or [0.1 0.6 0.1 0.1 0.1]?
              $endgroup$
              – Nain
              Jul 10 '17 at 14:48










            • $begingroup$
              @Nain: That is correct for your example. The cross-entropy loss does not depend on what the values of incorrect class probabilities are.
              $endgroup$
              – Neil Slater
              Jul 10 '17 at 15:25














            35












            35








            35





            $begingroup$

            Cross entropy formula given two distributions over discrete variable $x$, where $q(x)$ is the estimate for true distribution $p(x)$ is given by



            $$H(p,q) = -sum_{forall x} p(x) log(q(x))$$



            For a neural network, the calculation is independent of these parts:




            • What kind of layer was used.


            • What kind of activation you use - although many activations will not be compatible with the calculation, because it will produce nonsense values if the sum of probabilities is not equal to 1. Softmax is often used for multiclass classification because it guarantees a well-behaved probability distribution function.



            For a neural network, you will usually see the equation written into a form where $mathbf{y}$ is the ground truth vector and $mathbf{hat{y}}$ (or some other value taken direct from the last layer output) is the estimate, and it would look like this for a single example:



            $$L = - mathbf{y} cdot log(mathbf{hat{y}})$$



            Where $cdot$ is vector dot product.



            Your example ground truth $mathbf{y}$ gives all probability to the first value, and the other values are zero, so we can ignore them, and just use the matching term from your estimates $mathbf{hat{y}}$



            $L = -(1times log(0.1) + 0 times log(0.5) + ...)$



            $L = - log(0.1) approx 2.303$



            An important point from comments




            That means, the loss would be same no matter if the predictions are $[0.1, 0.5, 0.1, 0.1, 0.2]$ or $[0.1, 0.6, 0.1, 0.1, 0.1]$?




            Yes, this is a key feature of multiclass logloss, it rewards/penalises probabilities of correct classes only. The value is independent of how the remaining probability is split between incorrect classes.



            You will often see this equation averaged over all examples as a cost function. It is not always strictly adhered to in descriptions, but usually a loss function is lower level and describes how a single instance or component determines an error value, whilst a cost function is higher level and describes how a complete system is evaluated for optimisation. A cost function based on multiclass log loss for data set of size $N$ might look like this:



            $$J = - frac{1}{N}(sum_{i=1}^{N} mathbf{y_i} cdot log(mathbf{hat{y}_i}))$$



            Many implementations will require your ground truth values to be one-hot encoded (with a single true class), because that allows for some extra optimisation. However, in principle the cross entropy loss can be calculated - and optimised - when this is not the case.






            share|improve this answer











            $endgroup$



            Cross entropy formula given two distributions over discrete variable $x$, where $q(x)$ is the estimate for true distribution $p(x)$ is given by



            $$H(p,q) = -sum_{forall x} p(x) log(q(x))$$



            For a neural network, the calculation is independent of these parts:




            • What kind of layer was used.


            • What kind of activation you use - although many activations will not be compatible with the calculation, because it will produce nonsense values if the sum of probabilities is not equal to 1. Softmax is often used for multiclass classification because it guarantees a well-behaved probability distribution function.



            For a neural network, you will usually see the equation written into a form where $mathbf{y}$ is the ground truth vector and $mathbf{hat{y}}$ (or some other value taken direct from the last layer output) is the estimate, and it would look like this for a single example:



            $$L = - mathbf{y} cdot log(mathbf{hat{y}})$$



            Where $cdot$ is vector dot product.



            Your example ground truth $mathbf{y}$ gives all probability to the first value, and the other values are zero, so we can ignore them, and just use the matching term from your estimates $mathbf{hat{y}}$



            $L = -(1times log(0.1) + 0 times log(0.5) + ...)$



            $L = - log(0.1) approx 2.303$



            An important point from comments




            That means, the loss would be same no matter if the predictions are $[0.1, 0.5, 0.1, 0.1, 0.2]$ or $[0.1, 0.6, 0.1, 0.1, 0.1]$?




            Yes, this is a key feature of multiclass logloss, it rewards/penalises probabilities of correct classes only. The value is independent of how the remaining probability is split between incorrect classes.



            You will often see this equation averaged over all examples as a cost function. It is not always strictly adhered to in descriptions, but usually a loss function is lower level and describes how a single instance or component determines an error value, whilst a cost function is higher level and describes how a complete system is evaluated for optimisation. A cost function based on multiclass log loss for data set of size $N$ might look like this:



            $$J = - frac{1}{N}(sum_{i=1}^{N} mathbf{y_i} cdot log(mathbf{hat{y}_i}))$$



            Many implementations will require your ground truth values to be one-hot encoded (with a single true class), because that allows for some extra optimisation. However, in principle the cross entropy loss can be calculated - and optimised - when this is not the case.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Sep 30 '17 at 8:23

























            answered Jul 10 '17 at 11:56









            Neil SlaterNeil Slater

            17k22961




            17k22961








            • 1




              $begingroup$
              Okay. That means, the loss would be same no matter if the predictions are [0.1 0.5 0.1 0.1 0.2] or [0.1 0.6 0.1 0.1 0.1]?
              $endgroup$
              – Nain
              Jul 10 '17 at 14:48










            • $begingroup$
              @Nain: That is correct for your example. The cross-entropy loss does not depend on what the values of incorrect class probabilities are.
              $endgroup$
              – Neil Slater
              Jul 10 '17 at 15:25














            • 1




              $begingroup$
              Okay. That means, the loss would be same no matter if the predictions are [0.1 0.5 0.1 0.1 0.2] or [0.1 0.6 0.1 0.1 0.1]?
              $endgroup$
              – Nain
              Jul 10 '17 at 14:48










            • $begingroup$
              @Nain: That is correct for your example. The cross-entropy loss does not depend on what the values of incorrect class probabilities are.
              $endgroup$
              – Neil Slater
              Jul 10 '17 at 15:25








            1




            1




            $begingroup$
            Okay. That means, the loss would be same no matter if the predictions are [0.1 0.5 0.1 0.1 0.2] or [0.1 0.6 0.1 0.1 0.1]?
            $endgroup$
            – Nain
            Jul 10 '17 at 14:48




            $begingroup$
            Okay. That means, the loss would be same no matter if the predictions are [0.1 0.5 0.1 0.1 0.2] or [0.1 0.6 0.1 0.1 0.1]?
            $endgroup$
            – Nain
            Jul 10 '17 at 14:48












            $begingroup$
            @Nain: That is correct for your example. The cross-entropy loss does not depend on what the values of incorrect class probabilities are.
            $endgroup$
            – Neil Slater
            Jul 10 '17 at 15:25




            $begingroup$
            @Nain: That is correct for your example. The cross-entropy loss does not depend on what the values of incorrect class probabilities are.
            $endgroup$
            – Neil Slater
            Jul 10 '17 at 15:25











            6












            $begingroup$

            The answer from Neil is correct. However I think its important to point out that while the loss does not depend on the distribution between the incorrect classes (only the distribution between the correct class and the rest), the gradient of this loss function does effect the incorrect classes differently depending on how wrong they are. So when you use cross-ent in machine learning you will change weights differently for [0.1 0.5 0.1 0.1 0.2] and [0.1 0.6 0.1 0.1 0.1]. This is because the score of the correct class is normalized by the scores of all the other classes to turn it into a probability.






            share|improve this answer









            $endgroup$









            • 3




              $begingroup$
              Can you elaborate it with a proper example?
              $endgroup$
              – Nain
              Nov 14 '17 at 4:42










            • $begingroup$
              @Lucas Adams, can you give an example please ?
              $endgroup$
              – koryakinp
              Jul 28 '18 at 17:15










            • $begingroup$
              The derivative of EACH y_i (softmax output) w.r.t EACH logit z (or the parameter w itself) depends on EVERY y_i. medium.com/@aerinykim/…
              $endgroup$
              – Aaron
              Oct 9 '18 at 8:55
















            6












            $begingroup$

            The answer from Neil is correct. However I think its important to point out that while the loss does not depend on the distribution between the incorrect classes (only the distribution between the correct class and the rest), the gradient of this loss function does effect the incorrect classes differently depending on how wrong they are. So when you use cross-ent in machine learning you will change weights differently for [0.1 0.5 0.1 0.1 0.2] and [0.1 0.6 0.1 0.1 0.1]. This is because the score of the correct class is normalized by the scores of all the other classes to turn it into a probability.






            share|improve this answer









            $endgroup$









            • 3




              $begingroup$
              Can you elaborate it with a proper example?
              $endgroup$
              – Nain
              Nov 14 '17 at 4:42










            • $begingroup$
              @Lucas Adams, can you give an example please ?
              $endgroup$
              – koryakinp
              Jul 28 '18 at 17:15










            • $begingroup$
              The derivative of EACH y_i (softmax output) w.r.t EACH logit z (or the parameter w itself) depends on EVERY y_i. medium.com/@aerinykim/…
              $endgroup$
              – Aaron
              Oct 9 '18 at 8:55














            6












            6








            6





            $begingroup$

            The answer from Neil is correct. However I think its important to point out that while the loss does not depend on the distribution between the incorrect classes (only the distribution between the correct class and the rest), the gradient of this loss function does effect the incorrect classes differently depending on how wrong they are. So when you use cross-ent in machine learning you will change weights differently for [0.1 0.5 0.1 0.1 0.2] and [0.1 0.6 0.1 0.1 0.1]. This is because the score of the correct class is normalized by the scores of all the other classes to turn it into a probability.






            share|improve this answer









            $endgroup$



            The answer from Neil is correct. However I think its important to point out that while the loss does not depend on the distribution between the incorrect classes (only the distribution between the correct class and the rest), the gradient of this loss function does effect the incorrect classes differently depending on how wrong they are. So when you use cross-ent in machine learning you will change weights differently for [0.1 0.5 0.1 0.1 0.2] and [0.1 0.6 0.1 0.1 0.1]. This is because the score of the correct class is normalized by the scores of all the other classes to turn it into a probability.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 13 '17 at 23:25









            Lucas AdamsLucas Adams

            6111




            6111








            • 3




              $begingroup$
              Can you elaborate it with a proper example?
              $endgroup$
              – Nain
              Nov 14 '17 at 4:42










            • $begingroup$
              @Lucas Adams, can you give an example please ?
              $endgroup$
              – koryakinp
              Jul 28 '18 at 17:15










            • $begingroup$
              The derivative of EACH y_i (softmax output) w.r.t EACH logit z (or the parameter w itself) depends on EVERY y_i. medium.com/@aerinykim/…
              $endgroup$
              – Aaron
              Oct 9 '18 at 8:55














            • 3




              $begingroup$
              Can you elaborate it with a proper example?
              $endgroup$
              – Nain
              Nov 14 '17 at 4:42










            • $begingroup$
              @Lucas Adams, can you give an example please ?
              $endgroup$
              – koryakinp
              Jul 28 '18 at 17:15










            • $begingroup$
              The derivative of EACH y_i (softmax output) w.r.t EACH logit z (or the parameter w itself) depends on EVERY y_i. medium.com/@aerinykim/…
              $endgroup$
              – Aaron
              Oct 9 '18 at 8:55








            3




            3




            $begingroup$
            Can you elaborate it with a proper example?
            $endgroup$
            – Nain
            Nov 14 '17 at 4:42




            $begingroup$
            Can you elaborate it with a proper example?
            $endgroup$
            – Nain
            Nov 14 '17 at 4:42












            $begingroup$
            @Lucas Adams, can you give an example please ?
            $endgroup$
            – koryakinp
            Jul 28 '18 at 17:15




            $begingroup$
            @Lucas Adams, can you give an example please ?
            $endgroup$
            – koryakinp
            Jul 28 '18 at 17:15












            $begingroup$
            The derivative of EACH y_i (softmax output) w.r.t EACH logit z (or the parameter w itself) depends on EVERY y_i. medium.com/@aerinykim/…
            $endgroup$
            – Aaron
            Oct 9 '18 at 8:55




            $begingroup$
            The derivative of EACH y_i (softmax output) w.r.t EACH logit z (or the parameter w itself) depends on EVERY y_i. medium.com/@aerinykim/…
            $endgroup$
            – Aaron
            Oct 9 '18 at 8:55











            2












            $begingroup$

            Let's see how the gradient of the loss behaves... We have the cross-entropy as a loss function, which is given by



            $$
            H(p,q) = -sum_{i=1}^n p(x_i) log(q(x_i)) = -(p(x_1)log(q(x_1)) + ldots + p(x_n)log(q(x_n))
            $$



            Going from here.. we would like to know the derivative with respect to some $x_i$:
            $$
            frac{partial}{partial x_i} H(p,q) = -frac{partial}{partial x_i} p(x_i)log(q(x_i)).
            $$
            Since all the other terms are cancelled due to the differentiation. We can take this equation one step further to
            $$
            frac{partial}{partial x_i} H(p,q) = -p(x_i)frac{1}{q(x_i)}frac{partial q(x_i)}{partial x_i}.
            $$



            From this we can see that we are still only penalizing the true classes (for which there is value for $p(x_i)$). Otherwise we just have a gradient of zero.



            I do wonder how to software packages deal with a predicted value of 0, while the true value was larger than zero... Since we are dividing by zero in that case.






            share|improve this answer









            $endgroup$













            • $begingroup$
              I think what you want is to take derivative w.r.t. the parameter, not w.r.t. x_i.
              $endgroup$
              – Aaron
              Oct 9 '18 at 6:17
















            2












            $begingroup$

            Let's see how the gradient of the loss behaves... We have the cross-entropy as a loss function, which is given by



            $$
            H(p,q) = -sum_{i=1}^n p(x_i) log(q(x_i)) = -(p(x_1)log(q(x_1)) + ldots + p(x_n)log(q(x_n))
            $$



            Going from here.. we would like to know the derivative with respect to some $x_i$:
            $$
            frac{partial}{partial x_i} H(p,q) = -frac{partial}{partial x_i} p(x_i)log(q(x_i)).
            $$
            Since all the other terms are cancelled due to the differentiation. We can take this equation one step further to
            $$
            frac{partial}{partial x_i} H(p,q) = -p(x_i)frac{1}{q(x_i)}frac{partial q(x_i)}{partial x_i}.
            $$



            From this we can see that we are still only penalizing the true classes (for which there is value for $p(x_i)$). Otherwise we just have a gradient of zero.



            I do wonder how to software packages deal with a predicted value of 0, while the true value was larger than zero... Since we are dividing by zero in that case.






            share|improve this answer









            $endgroup$













            • $begingroup$
              I think what you want is to take derivative w.r.t. the parameter, not w.r.t. x_i.
              $endgroup$
              – Aaron
              Oct 9 '18 at 6:17














            2












            2








            2





            $begingroup$

            Let's see how the gradient of the loss behaves... We have the cross-entropy as a loss function, which is given by



            $$
            H(p,q) = -sum_{i=1}^n p(x_i) log(q(x_i)) = -(p(x_1)log(q(x_1)) + ldots + p(x_n)log(q(x_n))
            $$



            Going from here.. we would like to know the derivative with respect to some $x_i$:
            $$
            frac{partial}{partial x_i} H(p,q) = -frac{partial}{partial x_i} p(x_i)log(q(x_i)).
            $$
            Since all the other terms are cancelled due to the differentiation. We can take this equation one step further to
            $$
            frac{partial}{partial x_i} H(p,q) = -p(x_i)frac{1}{q(x_i)}frac{partial q(x_i)}{partial x_i}.
            $$



            From this we can see that we are still only penalizing the true classes (for which there is value for $p(x_i)$). Otherwise we just have a gradient of zero.



            I do wonder how to software packages deal with a predicted value of 0, while the true value was larger than zero... Since we are dividing by zero in that case.






            share|improve this answer









            $endgroup$



            Let's see how the gradient of the loss behaves... We have the cross-entropy as a loss function, which is given by



            $$
            H(p,q) = -sum_{i=1}^n p(x_i) log(q(x_i)) = -(p(x_1)log(q(x_1)) + ldots + p(x_n)log(q(x_n))
            $$



            Going from here.. we would like to know the derivative with respect to some $x_i$:
            $$
            frac{partial}{partial x_i} H(p,q) = -frac{partial}{partial x_i} p(x_i)log(q(x_i)).
            $$
            Since all the other terms are cancelled due to the differentiation. We can take this equation one step further to
            $$
            frac{partial}{partial x_i} H(p,q) = -p(x_i)frac{1}{q(x_i)}frac{partial q(x_i)}{partial x_i}.
            $$



            From this we can see that we are still only penalizing the true classes (for which there is value for $p(x_i)$). Otherwise we just have a gradient of zero.



            I do wonder how to software packages deal with a predicted value of 0, while the true value was larger than zero... Since we are dividing by zero in that case.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered May 22 '18 at 7:15









            zwepzwep

            1212




            1212












            • $begingroup$
              I think what you want is to take derivative w.r.t. the parameter, not w.r.t. x_i.
              $endgroup$
              – Aaron
              Oct 9 '18 at 6:17


















            • $begingroup$
              I think what you want is to take derivative w.r.t. the parameter, not w.r.t. x_i.
              $endgroup$
              – Aaron
              Oct 9 '18 at 6:17
















            $begingroup$
            I think what you want is to take derivative w.r.t. the parameter, not w.r.t. x_i.
            $endgroup$
            – Aaron
            Oct 9 '18 at 6:17




            $begingroup$
            I think what you want is to take derivative w.r.t. the parameter, not w.r.t. x_i.
            $endgroup$
            – Aaron
            Oct 9 '18 at 6:17











            0












            $begingroup$

            I disagree with Lucas. The values above are already probabilities. Note that the original post indicated that the values had a softmax activation.



            The error is only propagated back on the "hot" class and the probability Q(i) does not change if the probabilities within the other classes shift between each other.






            share|improve this answer









            $endgroup$









            • 2




              $begingroup$
              Lucas is correct. With the architecture described by the OP, then gradient at all the logits (as opposed to outputs) is not zero, because the softmax function connects them all. So the [gradient of the] error at the "hot" class propagates to all output neurons.
              $endgroup$
              – Neil Slater
              May 22 '18 at 7:24










            • $begingroup$
              +1 for Neil and Lucas
              $endgroup$
              – Aaron
              Oct 9 '18 at 6:30
















            0












            $begingroup$

            I disagree with Lucas. The values above are already probabilities. Note that the original post indicated that the values had a softmax activation.



            The error is only propagated back on the "hot" class and the probability Q(i) does not change if the probabilities within the other classes shift between each other.






            share|improve this answer









            $endgroup$









            • 2




              $begingroup$
              Lucas is correct. With the architecture described by the OP, then gradient at all the logits (as opposed to outputs) is not zero, because the softmax function connects them all. So the [gradient of the] error at the "hot" class propagates to all output neurons.
              $endgroup$
              – Neil Slater
              May 22 '18 at 7:24










            • $begingroup$
              +1 for Neil and Lucas
              $endgroup$
              – Aaron
              Oct 9 '18 at 6:30














            0












            0








            0





            $begingroup$

            I disagree with Lucas. The values above are already probabilities. Note that the original post indicated that the values had a softmax activation.



            The error is only propagated back on the "hot" class and the probability Q(i) does not change if the probabilities within the other classes shift between each other.






            share|improve this answer









            $endgroup$



            I disagree with Lucas. The values above are already probabilities. Note that the original post indicated that the values had a softmax activation.



            The error is only propagated back on the "hot" class and the probability Q(i) does not change if the probabilities within the other classes shift between each other.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Feb 2 '18 at 1:50









            bluemonkeybluemonkey

            91




            91








            • 2




              $begingroup$
              Lucas is correct. With the architecture described by the OP, then gradient at all the logits (as opposed to outputs) is not zero, because the softmax function connects them all. So the [gradient of the] error at the "hot" class propagates to all output neurons.
              $endgroup$
              – Neil Slater
              May 22 '18 at 7:24










            • $begingroup$
              +1 for Neil and Lucas
              $endgroup$
              – Aaron
              Oct 9 '18 at 6:30














            • 2




              $begingroup$
              Lucas is correct. With the architecture described by the OP, then gradient at all the logits (as opposed to outputs) is not zero, because the softmax function connects them all. So the [gradient of the] error at the "hot" class propagates to all output neurons.
              $endgroup$
              – Neil Slater
              May 22 '18 at 7:24










            • $begingroup$
              +1 for Neil and Lucas
              $endgroup$
              – Aaron
              Oct 9 '18 at 6:30








            2




            2




            $begingroup$
            Lucas is correct. With the architecture described by the OP, then gradient at all the logits (as opposed to outputs) is not zero, because the softmax function connects them all. So the [gradient of the] error at the "hot" class propagates to all output neurons.
            $endgroup$
            – Neil Slater
            May 22 '18 at 7:24




            $begingroup$
            Lucas is correct. With the architecture described by the OP, then gradient at all the logits (as opposed to outputs) is not zero, because the softmax function connects them all. So the [gradient of the] error at the "hot" class propagates to all output neurons.
            $endgroup$
            – Neil Slater
            May 22 '18 at 7:24












            $begingroup$
            +1 for Neil and Lucas
            $endgroup$
            – Aaron
            Oct 9 '18 at 6:30




            $begingroup$
            +1 for Neil and Lucas
            $endgroup$
            – Aaron
            Oct 9 '18 at 6:30











            0












            $begingroup$

            I had a small doubt
            does pytorchs inbuilt CrossENtropy converts integer lables to 0 or 1 before calclulating the loss ???





            share








            New contributor




            user68826 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            $endgroup$


















              0












              $begingroup$

              I had a small doubt
              does pytorchs inbuilt CrossENtropy converts integer lables to 0 or 1 before calclulating the loss ???





              share








              New contributor




              user68826 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              $endgroup$
















                0












                0








                0





                $begingroup$

                I had a small doubt
                does pytorchs inbuilt CrossENtropy converts integer lables to 0 or 1 before calclulating the loss ???





                share








                New contributor




                user68826 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$



                I had a small doubt
                does pytorchs inbuilt CrossENtropy converts integer lables to 0 or 1 before calclulating the loss ???






                share








                New contributor




                user68826 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.








                share


                share






                New contributor




                user68826 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                answered 7 mins ago









                user68826user68826

                1




                1




                New contributor




                user68826 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.





                New contributor





                user68826 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                user68826 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.























                    -1












                    $begingroup$

                    The problem is that the probabilities are coming from a 'complicated' function that incorporates the other outputs into the given value. The outcomes are inter-connected, so this way we are not deriving regarding to the actual outcome, but by all the inputs of the last activation function (softmax), for each and every outcome.



                    I have found a very nice description at deepnotes.io/softmax-crossentropy where the author shows that the actual derivative is $p_i - y_i$.



                    Other neat description at gombru.github.io/2018/05/23/cross_entropy_loss.



                    I think that using a simple sigmoid as a last activation layer would lead to the approved answer, but using softmax indicates different answer.






                    share|improve this answer









                    $endgroup$













                    • $begingroup$
                      Welcome to Stack Exchange. However what you wrote does not seem to be an answer of the OP's question about calculating cross-entropy loss.
                      $endgroup$
                      – user12075
                      Sep 25 '18 at 17:05
















                    -1












                    $begingroup$

                    The problem is that the probabilities are coming from a 'complicated' function that incorporates the other outputs into the given value. The outcomes are inter-connected, so this way we are not deriving regarding to the actual outcome, but by all the inputs of the last activation function (softmax), for each and every outcome.



                    I have found a very nice description at deepnotes.io/softmax-crossentropy where the author shows that the actual derivative is $p_i - y_i$.



                    Other neat description at gombru.github.io/2018/05/23/cross_entropy_loss.



                    I think that using a simple sigmoid as a last activation layer would lead to the approved answer, but using softmax indicates different answer.






                    share|improve this answer









                    $endgroup$













                    • $begingroup$
                      Welcome to Stack Exchange. However what you wrote does not seem to be an answer of the OP's question about calculating cross-entropy loss.
                      $endgroup$
                      – user12075
                      Sep 25 '18 at 17:05














                    -1












                    -1








                    -1





                    $begingroup$

                    The problem is that the probabilities are coming from a 'complicated' function that incorporates the other outputs into the given value. The outcomes are inter-connected, so this way we are not deriving regarding to the actual outcome, but by all the inputs of the last activation function (softmax), for each and every outcome.



                    I have found a very nice description at deepnotes.io/softmax-crossentropy where the author shows that the actual derivative is $p_i - y_i$.



                    Other neat description at gombru.github.io/2018/05/23/cross_entropy_loss.



                    I think that using a simple sigmoid as a last activation layer would lead to the approved answer, but using softmax indicates different answer.






                    share|improve this answer









                    $endgroup$



                    The problem is that the probabilities are coming from a 'complicated' function that incorporates the other outputs into the given value. The outcomes are inter-connected, so this way we are not deriving regarding to the actual outcome, but by all the inputs of the last activation function (softmax), for each and every outcome.



                    I have found a very nice description at deepnotes.io/softmax-crossentropy where the author shows that the actual derivative is $p_i - y_i$.



                    Other neat description at gombru.github.io/2018/05/23/cross_entropy_loss.



                    I think that using a simple sigmoid as a last activation layer would lead to the approved answer, but using softmax indicates different answer.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Sep 25 '18 at 15:16









                    guykoguyko

                    1




                    1












                    • $begingroup$
                      Welcome to Stack Exchange. However what you wrote does not seem to be an answer of the OP's question about calculating cross-entropy loss.
                      $endgroup$
                      – user12075
                      Sep 25 '18 at 17:05


















                    • $begingroup$
                      Welcome to Stack Exchange. However what you wrote does not seem to be an answer of the OP's question about calculating cross-entropy loss.
                      $endgroup$
                      – user12075
                      Sep 25 '18 at 17:05
















                    $begingroup$
                    Welcome to Stack Exchange. However what you wrote does not seem to be an answer of the OP's question about calculating cross-entropy loss.
                    $endgroup$
                    – user12075
                    Sep 25 '18 at 17:05




                    $begingroup$
                    Welcome to Stack Exchange. However what you wrote does not seem to be an answer of the OP's question about calculating cross-entropy loss.
                    $endgroup$
                    – user12075
                    Sep 25 '18 at 17:05


















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f20296%2fcross-entropy-loss-explanation%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Fairchild Swearingen Metro Inhaltsverzeichnis Geschichte | Innenausstattung | Nutzung | Zwischenfälle...

                    Pilgersdorf Inhaltsverzeichnis Geografie | Geschichte | Bevölkerungsentwicklung | Politik | Kultur...

                    Marineschifffahrtleitung Inhaltsverzeichnis Geschichte | Heutige Organisation der NATO | Nationale und...