Simulating a probability of 1 of 2^N with less than N random bitsIs the number of coin tosses of a...

Hostile work environment after whistle-blowing on coworker and our boss. What do I do?

How can I raise concerns with a new DM about XP splitting?

How to deal with or prevent idle in the test team?

"lassen" in meaning "sich fassen"

Is there a good way to store credentials outside of a password manager?

The most efficient algorithm to find all possible integer pairs which sum to a given integer

I'm in charge of equipment buying but no one's ever happy with what I choose. How to fix this?

Female=gender counterpart?

Can I Retrieve Email Addresses from BCC?

Is there enough fresh water in the world to eradicate the drinking water crisis?

How can a jailer prevent the Forge Cleric's Artisan's Blessing from being used?

Teaching indefinite integrals that require special-casing

What if somebody invests in my application?

My boss asked me to take a one-day class, then signs it up as a day off

No idea how to draw this using tikz

Can I rely on these GitHub repository files?

Golf game boilerplate

Can a malicious addon access internet history and such in chrome/firefox?

How to check participants in at events?

A known event to a history junkie

Adding empty element to declared container without declaring type of element

Was the picture area of a CRT a parallelogram (instead of a true rectangle)?

Greatest common substring

Should my PhD thesis be submitted under my legal name?



Simulating a probability of 1 of 2^N with less than N random bits


Is the number of coin tosses of a probabilistic Turing machine a Blum complexity measure?Prove that inserting $n$ sorted values in to an AVL using AVL insertion is $Thetaleft (n log left ( n right ) right )$2SUM with a weightIs transitivity required for a sorting algorithmHow to compare conditional entropy and mutual information?How to state a recurrence that expresses the worst case for good pivots?Counting words that satisfy SAT-like constraints with BDDsHigher order empirical entropy is not the entropy of the empirical distribution?How do these alternative definitions of one-way functions compare?Average-case analysis of linear search given that the desired element appears $k$ times













24












$begingroup$


Say I need to simulate the following discrete distribution:



$$
P(X = k) =
begin{cases}
frac{1}{2^N}, & text{if $k = 1$} \
1 - frac{1}{2^N}, & text{if $k = 0$}
end{cases}
$$



The most obvious way is to draw $N$ random bits and check if all of them equals to $0$ (or $1$). However, information theory says



$$
begin{align}
S & = - sum_{i} P_i log{P_i} \
& = - frac{1}{2^N} log{frac{1}{2^N}} - left(1 - frac{1}{2^N}right) log{left(1 - frac{1}{2^N}right)} \
& = frac{1}{2^N} log{2^N} + left(1 - frac{1}{2^N}right) log{frac{2^N}{2^N - 1}} \
& rightarrow 0
end{align}
$$



So the minimum number of random bits required actually decreases as $N$ goes large. How is this possible?



Please assume that we are running on a computer where bits is your only source of randomness, so you can't just tose a biased coin.










share|cite|improve this question











$endgroup$












  • $begingroup$
    This is closely related to coding theory and Kolmogorov complexity, if you're looking for keywords to dig deeper. The technique of counting repeating runs of the same bit which D.W. mentions below comes up a lot - these lecture notes touch on it for example people.cs.uchicago.edu/~fortnow/papers/kaikoura.pdf
    $endgroup$
    – Brian Gordon
    13 hours ago
















24












$begingroup$


Say I need to simulate the following discrete distribution:



$$
P(X = k) =
begin{cases}
frac{1}{2^N}, & text{if $k = 1$} \
1 - frac{1}{2^N}, & text{if $k = 0$}
end{cases}
$$



The most obvious way is to draw $N$ random bits and check if all of them equals to $0$ (or $1$). However, information theory says



$$
begin{align}
S & = - sum_{i} P_i log{P_i} \
& = - frac{1}{2^N} log{frac{1}{2^N}} - left(1 - frac{1}{2^N}right) log{left(1 - frac{1}{2^N}right)} \
& = frac{1}{2^N} log{2^N} + left(1 - frac{1}{2^N}right) log{frac{2^N}{2^N - 1}} \
& rightarrow 0
end{align}
$$



So the minimum number of random bits required actually decreases as $N$ goes large. How is this possible?



Please assume that we are running on a computer where bits is your only source of randomness, so you can't just tose a biased coin.










share|cite|improve this question











$endgroup$












  • $begingroup$
    This is closely related to coding theory and Kolmogorov complexity, if you're looking for keywords to dig deeper. The technique of counting repeating runs of the same bit which D.W. mentions below comes up a lot - these lecture notes touch on it for example people.cs.uchicago.edu/~fortnow/papers/kaikoura.pdf
    $endgroup$
    – Brian Gordon
    13 hours ago














24












24








24


6



$begingroup$


Say I need to simulate the following discrete distribution:



$$
P(X = k) =
begin{cases}
frac{1}{2^N}, & text{if $k = 1$} \
1 - frac{1}{2^N}, & text{if $k = 0$}
end{cases}
$$



The most obvious way is to draw $N$ random bits and check if all of them equals to $0$ (or $1$). However, information theory says



$$
begin{align}
S & = - sum_{i} P_i log{P_i} \
& = - frac{1}{2^N} log{frac{1}{2^N}} - left(1 - frac{1}{2^N}right) log{left(1 - frac{1}{2^N}right)} \
& = frac{1}{2^N} log{2^N} + left(1 - frac{1}{2^N}right) log{frac{2^N}{2^N - 1}} \
& rightarrow 0
end{align}
$$



So the minimum number of random bits required actually decreases as $N$ goes large. How is this possible?



Please assume that we are running on a computer where bits is your only source of randomness, so you can't just tose a biased coin.










share|cite|improve this question











$endgroup$




Say I need to simulate the following discrete distribution:



$$
P(X = k) =
begin{cases}
frac{1}{2^N}, & text{if $k = 1$} \
1 - frac{1}{2^N}, & text{if $k = 0$}
end{cases}
$$



The most obvious way is to draw $N$ random bits and check if all of them equals to $0$ (or $1$). However, information theory says



$$
begin{align}
S & = - sum_{i} P_i log{P_i} \
& = - frac{1}{2^N} log{frac{1}{2^N}} - left(1 - frac{1}{2^N}right) log{left(1 - frac{1}{2^N}right)} \
& = frac{1}{2^N} log{2^N} + left(1 - frac{1}{2^N}right) log{frac{2^N}{2^N - 1}} \
& rightarrow 0
end{align}
$$



So the minimum number of random bits required actually decreases as $N$ goes large. How is this possible?



Please assume that we are running on a computer where bits is your only source of randomness, so you can't just tose a biased coin.







algorithms information-theory randomness pseudo-random-generators entropy






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited 10 hours ago









Discrete lizard

4,43011537




4,43011537










asked yesterday









nalzoknalzok

542416




542416












  • $begingroup$
    This is closely related to coding theory and Kolmogorov complexity, if you're looking for keywords to dig deeper. The technique of counting repeating runs of the same bit which D.W. mentions below comes up a lot - these lecture notes touch on it for example people.cs.uchicago.edu/~fortnow/papers/kaikoura.pdf
    $endgroup$
    – Brian Gordon
    13 hours ago


















  • $begingroup$
    This is closely related to coding theory and Kolmogorov complexity, if you're looking for keywords to dig deeper. The technique of counting repeating runs of the same bit which D.W. mentions below comes up a lot - these lecture notes touch on it for example people.cs.uchicago.edu/~fortnow/papers/kaikoura.pdf
    $endgroup$
    – Brian Gordon
    13 hours ago
















$begingroup$
This is closely related to coding theory and Kolmogorov complexity, if you're looking for keywords to dig deeper. The technique of counting repeating runs of the same bit which D.W. mentions below comes up a lot - these lecture notes touch on it for example people.cs.uchicago.edu/~fortnow/papers/kaikoura.pdf
$endgroup$
– Brian Gordon
13 hours ago




$begingroup$
This is closely related to coding theory and Kolmogorov complexity, if you're looking for keywords to dig deeper. The technique of counting repeating runs of the same bit which D.W. mentions below comes up a lot - these lecture notes touch on it for example people.cs.uchicago.edu/~fortnow/papers/kaikoura.pdf
$endgroup$
– Brian Gordon
13 hours ago










2 Answers
2






active

oldest

votes


















24












$begingroup$

Wow, great question! Let me try to explain the resolution. It'll take three distinct steps.



The first thing to note is that the entropy is focused more on the average number of bits needed per draw, not the maximum number of bits needed.



With your sampling procedure, the maximum number of random bits needed per draw is $N$ bits, but the average number of bits needed is 2 bits (the average of a geometric distribution with $p=1/2$) -- this is because there is a $1/2$ probability that you only need 1 bit (if the first bit turns out to be 1), a $1/4$ probability that you only need 2 bits (if the first two bits turn out to be 01), a $1/8$ probability that you only need 3 bits (if the first three bits turn out to be 001), and so on.



The second thing to note is that the entropy doesn't really capture the average number of bits needed for a single draw. Instead, the entropy captures the amortized number of bits needed to sample $m$ i.i.d. draws from this distribution. Suppose we need $f(m)$ bits to sample $m$ draws; then the entropy is the limit of $f(m)/m$ as $m to infty$.



The third thing to note is that, with this distribution, you can sample $m$ i.i.d. draws with fewer bits than needed to repeatedly sample one draw. Suppose you naively decided to draw one sample (takes 2 random bits on average), then draw another sample (using 2 more random bits on average), and so on, until you've repeated this $m$ times. That would require about $2m$ random bits on average.



But it turns out there's a way to sample from $m$ draws using fewer than $2m$ bits. It's hard to believe, but it's true!



Let me give you the intuition. Suppose you wrote down the result of sampling $m$ draws, where $m$ is really large. Then the result could be specified as a $m$-bit string. This $m$-bit string will be mostly 0's, with a few 1's in it: in particular, on average it will have about $m/2^N$ 1's (could be more or less than that, but if $m$ is sufficiently large, usually the number will be close to that). The length of the gaps between the 1's are random, but will typically be somewhere vaguely in the vicinity of $2^N$ (could easily be half that or twice that or even more, but of that order of magnitude). Of course, instead of writing down the entire $m$-bit string, we could write it down more succinctly by writing down a list of the lengths of the gaps -- that carries all the same information, in a more compressed format. How much more succinct? Well, we'll usually need about $N$ bits to represent the length of each gap; and there will be about $m/2^N$ gaps; so we'll need in total about $mN/2^N$ bits (could be a bit more, could be a bit less, but if $m$ is sufficiently large, it'll usually be close to that). That's a lot shorter than a $m$-bit string.



And if there's a way to write down the string this succinctly, perhaps it won't be too surprising if that means there's a way to generate the string with a number of random bits comparable to the length of the string. In particular, you randomly generate the length of each gap; this is sampling from a geometric distribution with $p=1/2^N$, and that can be done with roughly $sim N$ random bits on average (not $2^N$). You'll need about $m/2^N$ i.i.d. draws from this geometric distribution, so you'll need in total roughly $sim Nm/2^N$ random bits. (It could be a small constant factor larger, but not too much larger.) And, notice is that this is much smaller than $2m$ bits.



So, we can sample $m$ i.i.d. draws from your distribution, using just $f(m) sim Nm/2^N$ random bits (roughly). Recall that the entropy is $lim_{m to infty} f(m)/m$. So this means that you should expect the entropy to be (roughly) $N/2^N$. That's off by a little bit, because the above calculation was sketchy and crude -- but hopefully it gives you some intuition for why the entropy is what it is, and why everything is consistent and reasonable.






share|cite|improve this answer











$endgroup$













  • $begingroup$
    Wow, great answer! But could you elaborate on why sampling from a geometric distribution with $p=frac{1}{2^N}$ takes $N$ bits on average? I know such a random variable would have a mean of $2^N$ , so it takes on average $N$ bits to store, but I suppose this doesn't mean you can generate one with $N$ bits.
    $endgroup$
    – nalzok
    yesterday












  • $begingroup$
    @nalzok, A fair question! Could you perhaps ask that as a separate question? I can see how to do it, but it's a bit messy to type up right now. If you ask perhaps someone will get to answering quicker than I can. The approach I'm thinking of is similar to arithmetic coding. Define $q_i = Pr[Xle i]$ (where $X$ is the geometric r.v.), then generate a random number $r$ in the interval $[0,1)$, and find $i$ such that $q_i le r < q_{i+1}$. If you write down the bits of the binary expension $r$ one at a time, usually after writing down $N+O(1)$ bits of $r$, $i$ will be fully determined.
    $endgroup$
    – D.W.
    23 hours ago






  • 1




    $begingroup$
    So you're basically using the inverse CDF method to convert a uniformly distributed random variable to an arbitrary distribution, combined with an idea similar to binary search? I'll need to analyze the quantile function of a geometric distribution to be sure, but this hint is enough. Thanks!
    $endgroup$
    – nalzok
    23 hours ago








  • 1




    $begingroup$
    @nalzok, ahh, yes, that's a nicer way to think about it -- lovely. Thank you for suggesting that. Yup, that's what I had in mind.
    $endgroup$
    – D.W.
    23 hours ago





















1












$begingroup$

You can think this backwards: consider the problem of binary encoding instead of generation. Suppose that you have a source that emits symbols $Xin {A,B}$ with $p(A)=2^{-N}$, $p(B)=1-2^{-N}$. For example, if $N=3$, we get $H(X)approx 0.54356$. So (Shannon tells us) there is an uniquely decodable binary encoding $X to Y$, where $Y in {0,1}$ (data bits), such that we need, on average, about $0.54356$ data bits for each original symbol $X$.



(In case you are wondering how such encoding can exists, given that we have only two source symbols, and it seems that we cannot do better that the trivial encoding , $Ato 0$, $Bto 1$ , with one bit per symbol, you need to understand that to approximate the Shannon bound we need to take "extensions" of the source, that is, to code sequences of inputs as a whole. See in particular arithmetic encoding).



Once the above is clear, if we assume we have an invertible mapping $X^n to Y^n$ , and noticing that, in the Shannon limit $Y^n$ must have maximum entropy (1 bit of information per bit of data), i.e., $Y^n$ has the statistics of a fair coin, then we have a generation scheme at hand: draw $n$ random bits (here $n$ has no relation with $N$) with a fair coin, interpret it as the output $Y^n$ of the encoder, and decode $X^n$ from it. In this way, $X^n$ will have the desired probability distribution, and we need (in average) $H(X)<1$ coins to generate each value of $X$.
coins






share|cite|improve this answer









$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "419"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcs.stackexchange.com%2fquestions%2f106018%2fsimulating-a-probability-of-1-of-2n-with-less-than-n-random-bits%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    24












    $begingroup$

    Wow, great question! Let me try to explain the resolution. It'll take three distinct steps.



    The first thing to note is that the entropy is focused more on the average number of bits needed per draw, not the maximum number of bits needed.



    With your sampling procedure, the maximum number of random bits needed per draw is $N$ bits, but the average number of bits needed is 2 bits (the average of a geometric distribution with $p=1/2$) -- this is because there is a $1/2$ probability that you only need 1 bit (if the first bit turns out to be 1), a $1/4$ probability that you only need 2 bits (if the first two bits turn out to be 01), a $1/8$ probability that you only need 3 bits (if the first three bits turn out to be 001), and so on.



    The second thing to note is that the entropy doesn't really capture the average number of bits needed for a single draw. Instead, the entropy captures the amortized number of bits needed to sample $m$ i.i.d. draws from this distribution. Suppose we need $f(m)$ bits to sample $m$ draws; then the entropy is the limit of $f(m)/m$ as $m to infty$.



    The third thing to note is that, with this distribution, you can sample $m$ i.i.d. draws with fewer bits than needed to repeatedly sample one draw. Suppose you naively decided to draw one sample (takes 2 random bits on average), then draw another sample (using 2 more random bits on average), and so on, until you've repeated this $m$ times. That would require about $2m$ random bits on average.



    But it turns out there's a way to sample from $m$ draws using fewer than $2m$ bits. It's hard to believe, but it's true!



    Let me give you the intuition. Suppose you wrote down the result of sampling $m$ draws, where $m$ is really large. Then the result could be specified as a $m$-bit string. This $m$-bit string will be mostly 0's, with a few 1's in it: in particular, on average it will have about $m/2^N$ 1's (could be more or less than that, but if $m$ is sufficiently large, usually the number will be close to that). The length of the gaps between the 1's are random, but will typically be somewhere vaguely in the vicinity of $2^N$ (could easily be half that or twice that or even more, but of that order of magnitude). Of course, instead of writing down the entire $m$-bit string, we could write it down more succinctly by writing down a list of the lengths of the gaps -- that carries all the same information, in a more compressed format. How much more succinct? Well, we'll usually need about $N$ bits to represent the length of each gap; and there will be about $m/2^N$ gaps; so we'll need in total about $mN/2^N$ bits (could be a bit more, could be a bit less, but if $m$ is sufficiently large, it'll usually be close to that). That's a lot shorter than a $m$-bit string.



    And if there's a way to write down the string this succinctly, perhaps it won't be too surprising if that means there's a way to generate the string with a number of random bits comparable to the length of the string. In particular, you randomly generate the length of each gap; this is sampling from a geometric distribution with $p=1/2^N$, and that can be done with roughly $sim N$ random bits on average (not $2^N$). You'll need about $m/2^N$ i.i.d. draws from this geometric distribution, so you'll need in total roughly $sim Nm/2^N$ random bits. (It could be a small constant factor larger, but not too much larger.) And, notice is that this is much smaller than $2m$ bits.



    So, we can sample $m$ i.i.d. draws from your distribution, using just $f(m) sim Nm/2^N$ random bits (roughly). Recall that the entropy is $lim_{m to infty} f(m)/m$. So this means that you should expect the entropy to be (roughly) $N/2^N$. That's off by a little bit, because the above calculation was sketchy and crude -- but hopefully it gives you some intuition for why the entropy is what it is, and why everything is consistent and reasonable.






    share|cite|improve this answer











    $endgroup$













    • $begingroup$
      Wow, great answer! But could you elaborate on why sampling from a geometric distribution with $p=frac{1}{2^N}$ takes $N$ bits on average? I know such a random variable would have a mean of $2^N$ , so it takes on average $N$ bits to store, but I suppose this doesn't mean you can generate one with $N$ bits.
      $endgroup$
      – nalzok
      yesterday












    • $begingroup$
      @nalzok, A fair question! Could you perhaps ask that as a separate question? I can see how to do it, but it's a bit messy to type up right now. If you ask perhaps someone will get to answering quicker than I can. The approach I'm thinking of is similar to arithmetic coding. Define $q_i = Pr[Xle i]$ (where $X$ is the geometric r.v.), then generate a random number $r$ in the interval $[0,1)$, and find $i$ such that $q_i le r < q_{i+1}$. If you write down the bits of the binary expension $r$ one at a time, usually after writing down $N+O(1)$ bits of $r$, $i$ will be fully determined.
      $endgroup$
      – D.W.
      23 hours ago






    • 1




      $begingroup$
      So you're basically using the inverse CDF method to convert a uniformly distributed random variable to an arbitrary distribution, combined with an idea similar to binary search? I'll need to analyze the quantile function of a geometric distribution to be sure, but this hint is enough. Thanks!
      $endgroup$
      – nalzok
      23 hours ago








    • 1




      $begingroup$
      @nalzok, ahh, yes, that's a nicer way to think about it -- lovely. Thank you for suggesting that. Yup, that's what I had in mind.
      $endgroup$
      – D.W.
      23 hours ago


















    24












    $begingroup$

    Wow, great question! Let me try to explain the resolution. It'll take three distinct steps.



    The first thing to note is that the entropy is focused more on the average number of bits needed per draw, not the maximum number of bits needed.



    With your sampling procedure, the maximum number of random bits needed per draw is $N$ bits, but the average number of bits needed is 2 bits (the average of a geometric distribution with $p=1/2$) -- this is because there is a $1/2$ probability that you only need 1 bit (if the first bit turns out to be 1), a $1/4$ probability that you only need 2 bits (if the first two bits turn out to be 01), a $1/8$ probability that you only need 3 bits (if the first three bits turn out to be 001), and so on.



    The second thing to note is that the entropy doesn't really capture the average number of bits needed for a single draw. Instead, the entropy captures the amortized number of bits needed to sample $m$ i.i.d. draws from this distribution. Suppose we need $f(m)$ bits to sample $m$ draws; then the entropy is the limit of $f(m)/m$ as $m to infty$.



    The third thing to note is that, with this distribution, you can sample $m$ i.i.d. draws with fewer bits than needed to repeatedly sample one draw. Suppose you naively decided to draw one sample (takes 2 random bits on average), then draw another sample (using 2 more random bits on average), and so on, until you've repeated this $m$ times. That would require about $2m$ random bits on average.



    But it turns out there's a way to sample from $m$ draws using fewer than $2m$ bits. It's hard to believe, but it's true!



    Let me give you the intuition. Suppose you wrote down the result of sampling $m$ draws, where $m$ is really large. Then the result could be specified as a $m$-bit string. This $m$-bit string will be mostly 0's, with a few 1's in it: in particular, on average it will have about $m/2^N$ 1's (could be more or less than that, but if $m$ is sufficiently large, usually the number will be close to that). The length of the gaps between the 1's are random, but will typically be somewhere vaguely in the vicinity of $2^N$ (could easily be half that or twice that or even more, but of that order of magnitude). Of course, instead of writing down the entire $m$-bit string, we could write it down more succinctly by writing down a list of the lengths of the gaps -- that carries all the same information, in a more compressed format. How much more succinct? Well, we'll usually need about $N$ bits to represent the length of each gap; and there will be about $m/2^N$ gaps; so we'll need in total about $mN/2^N$ bits (could be a bit more, could be a bit less, but if $m$ is sufficiently large, it'll usually be close to that). That's a lot shorter than a $m$-bit string.



    And if there's a way to write down the string this succinctly, perhaps it won't be too surprising if that means there's a way to generate the string with a number of random bits comparable to the length of the string. In particular, you randomly generate the length of each gap; this is sampling from a geometric distribution with $p=1/2^N$, and that can be done with roughly $sim N$ random bits on average (not $2^N$). You'll need about $m/2^N$ i.i.d. draws from this geometric distribution, so you'll need in total roughly $sim Nm/2^N$ random bits. (It could be a small constant factor larger, but not too much larger.) And, notice is that this is much smaller than $2m$ bits.



    So, we can sample $m$ i.i.d. draws from your distribution, using just $f(m) sim Nm/2^N$ random bits (roughly). Recall that the entropy is $lim_{m to infty} f(m)/m$. So this means that you should expect the entropy to be (roughly) $N/2^N$. That's off by a little bit, because the above calculation was sketchy and crude -- but hopefully it gives you some intuition for why the entropy is what it is, and why everything is consistent and reasonable.






    share|cite|improve this answer











    $endgroup$













    • $begingroup$
      Wow, great answer! But could you elaborate on why sampling from a geometric distribution with $p=frac{1}{2^N}$ takes $N$ bits on average? I know such a random variable would have a mean of $2^N$ , so it takes on average $N$ bits to store, but I suppose this doesn't mean you can generate one with $N$ bits.
      $endgroup$
      – nalzok
      yesterday












    • $begingroup$
      @nalzok, A fair question! Could you perhaps ask that as a separate question? I can see how to do it, but it's a bit messy to type up right now. If you ask perhaps someone will get to answering quicker than I can. The approach I'm thinking of is similar to arithmetic coding. Define $q_i = Pr[Xle i]$ (where $X$ is the geometric r.v.), then generate a random number $r$ in the interval $[0,1)$, and find $i$ such that $q_i le r < q_{i+1}$. If you write down the bits of the binary expension $r$ one at a time, usually after writing down $N+O(1)$ bits of $r$, $i$ will be fully determined.
      $endgroup$
      – D.W.
      23 hours ago






    • 1




      $begingroup$
      So you're basically using the inverse CDF method to convert a uniformly distributed random variable to an arbitrary distribution, combined with an idea similar to binary search? I'll need to analyze the quantile function of a geometric distribution to be sure, but this hint is enough. Thanks!
      $endgroup$
      – nalzok
      23 hours ago








    • 1




      $begingroup$
      @nalzok, ahh, yes, that's a nicer way to think about it -- lovely. Thank you for suggesting that. Yup, that's what I had in mind.
      $endgroup$
      – D.W.
      23 hours ago
















    24












    24








    24





    $begingroup$

    Wow, great question! Let me try to explain the resolution. It'll take three distinct steps.



    The first thing to note is that the entropy is focused more on the average number of bits needed per draw, not the maximum number of bits needed.



    With your sampling procedure, the maximum number of random bits needed per draw is $N$ bits, but the average number of bits needed is 2 bits (the average of a geometric distribution with $p=1/2$) -- this is because there is a $1/2$ probability that you only need 1 bit (if the first bit turns out to be 1), a $1/4$ probability that you only need 2 bits (if the first two bits turn out to be 01), a $1/8$ probability that you only need 3 bits (if the first three bits turn out to be 001), and so on.



    The second thing to note is that the entropy doesn't really capture the average number of bits needed for a single draw. Instead, the entropy captures the amortized number of bits needed to sample $m$ i.i.d. draws from this distribution. Suppose we need $f(m)$ bits to sample $m$ draws; then the entropy is the limit of $f(m)/m$ as $m to infty$.



    The third thing to note is that, with this distribution, you can sample $m$ i.i.d. draws with fewer bits than needed to repeatedly sample one draw. Suppose you naively decided to draw one sample (takes 2 random bits on average), then draw another sample (using 2 more random bits on average), and so on, until you've repeated this $m$ times. That would require about $2m$ random bits on average.



    But it turns out there's a way to sample from $m$ draws using fewer than $2m$ bits. It's hard to believe, but it's true!



    Let me give you the intuition. Suppose you wrote down the result of sampling $m$ draws, where $m$ is really large. Then the result could be specified as a $m$-bit string. This $m$-bit string will be mostly 0's, with a few 1's in it: in particular, on average it will have about $m/2^N$ 1's (could be more or less than that, but if $m$ is sufficiently large, usually the number will be close to that). The length of the gaps between the 1's are random, but will typically be somewhere vaguely in the vicinity of $2^N$ (could easily be half that or twice that or even more, but of that order of magnitude). Of course, instead of writing down the entire $m$-bit string, we could write it down more succinctly by writing down a list of the lengths of the gaps -- that carries all the same information, in a more compressed format. How much more succinct? Well, we'll usually need about $N$ bits to represent the length of each gap; and there will be about $m/2^N$ gaps; so we'll need in total about $mN/2^N$ bits (could be a bit more, could be a bit less, but if $m$ is sufficiently large, it'll usually be close to that). That's a lot shorter than a $m$-bit string.



    And if there's a way to write down the string this succinctly, perhaps it won't be too surprising if that means there's a way to generate the string with a number of random bits comparable to the length of the string. In particular, you randomly generate the length of each gap; this is sampling from a geometric distribution with $p=1/2^N$, and that can be done with roughly $sim N$ random bits on average (not $2^N$). You'll need about $m/2^N$ i.i.d. draws from this geometric distribution, so you'll need in total roughly $sim Nm/2^N$ random bits. (It could be a small constant factor larger, but not too much larger.) And, notice is that this is much smaller than $2m$ bits.



    So, we can sample $m$ i.i.d. draws from your distribution, using just $f(m) sim Nm/2^N$ random bits (roughly). Recall that the entropy is $lim_{m to infty} f(m)/m$. So this means that you should expect the entropy to be (roughly) $N/2^N$. That's off by a little bit, because the above calculation was sketchy and crude -- but hopefully it gives you some intuition for why the entropy is what it is, and why everything is consistent and reasonable.






    share|cite|improve this answer











    $endgroup$



    Wow, great question! Let me try to explain the resolution. It'll take three distinct steps.



    The first thing to note is that the entropy is focused more on the average number of bits needed per draw, not the maximum number of bits needed.



    With your sampling procedure, the maximum number of random bits needed per draw is $N$ bits, but the average number of bits needed is 2 bits (the average of a geometric distribution with $p=1/2$) -- this is because there is a $1/2$ probability that you only need 1 bit (if the first bit turns out to be 1), a $1/4$ probability that you only need 2 bits (if the first two bits turn out to be 01), a $1/8$ probability that you only need 3 bits (if the first three bits turn out to be 001), and so on.



    The second thing to note is that the entropy doesn't really capture the average number of bits needed for a single draw. Instead, the entropy captures the amortized number of bits needed to sample $m$ i.i.d. draws from this distribution. Suppose we need $f(m)$ bits to sample $m$ draws; then the entropy is the limit of $f(m)/m$ as $m to infty$.



    The third thing to note is that, with this distribution, you can sample $m$ i.i.d. draws with fewer bits than needed to repeatedly sample one draw. Suppose you naively decided to draw one sample (takes 2 random bits on average), then draw another sample (using 2 more random bits on average), and so on, until you've repeated this $m$ times. That would require about $2m$ random bits on average.



    But it turns out there's a way to sample from $m$ draws using fewer than $2m$ bits. It's hard to believe, but it's true!



    Let me give you the intuition. Suppose you wrote down the result of sampling $m$ draws, where $m$ is really large. Then the result could be specified as a $m$-bit string. This $m$-bit string will be mostly 0's, with a few 1's in it: in particular, on average it will have about $m/2^N$ 1's (could be more or less than that, but if $m$ is sufficiently large, usually the number will be close to that). The length of the gaps between the 1's are random, but will typically be somewhere vaguely in the vicinity of $2^N$ (could easily be half that or twice that or even more, but of that order of magnitude). Of course, instead of writing down the entire $m$-bit string, we could write it down more succinctly by writing down a list of the lengths of the gaps -- that carries all the same information, in a more compressed format. How much more succinct? Well, we'll usually need about $N$ bits to represent the length of each gap; and there will be about $m/2^N$ gaps; so we'll need in total about $mN/2^N$ bits (could be a bit more, could be a bit less, but if $m$ is sufficiently large, it'll usually be close to that). That's a lot shorter than a $m$-bit string.



    And if there's a way to write down the string this succinctly, perhaps it won't be too surprising if that means there's a way to generate the string with a number of random bits comparable to the length of the string. In particular, you randomly generate the length of each gap; this is sampling from a geometric distribution with $p=1/2^N$, and that can be done with roughly $sim N$ random bits on average (not $2^N$). You'll need about $m/2^N$ i.i.d. draws from this geometric distribution, so you'll need in total roughly $sim Nm/2^N$ random bits. (It could be a small constant factor larger, but not too much larger.) And, notice is that this is much smaller than $2m$ bits.



    So, we can sample $m$ i.i.d. draws from your distribution, using just $f(m) sim Nm/2^N$ random bits (roughly). Recall that the entropy is $lim_{m to infty} f(m)/m$. So this means that you should expect the entropy to be (roughly) $N/2^N$. That's off by a little bit, because the above calculation was sketchy and crude -- but hopefully it gives you some intuition for why the entropy is what it is, and why everything is consistent and reasonable.







    share|cite|improve this answer














    share|cite|improve this answer



    share|cite|improve this answer








    edited 17 hours ago









    einpoklum

    373113




    373113










    answered yesterday









    D.W.D.W.

    102k12127292




    102k12127292












    • $begingroup$
      Wow, great answer! But could you elaborate on why sampling from a geometric distribution with $p=frac{1}{2^N}$ takes $N$ bits on average? I know such a random variable would have a mean of $2^N$ , so it takes on average $N$ bits to store, but I suppose this doesn't mean you can generate one with $N$ bits.
      $endgroup$
      – nalzok
      yesterday












    • $begingroup$
      @nalzok, A fair question! Could you perhaps ask that as a separate question? I can see how to do it, but it's a bit messy to type up right now. If you ask perhaps someone will get to answering quicker than I can. The approach I'm thinking of is similar to arithmetic coding. Define $q_i = Pr[Xle i]$ (where $X$ is the geometric r.v.), then generate a random number $r$ in the interval $[0,1)$, and find $i$ such that $q_i le r < q_{i+1}$. If you write down the bits of the binary expension $r$ one at a time, usually after writing down $N+O(1)$ bits of $r$, $i$ will be fully determined.
      $endgroup$
      – D.W.
      23 hours ago






    • 1




      $begingroup$
      So you're basically using the inverse CDF method to convert a uniformly distributed random variable to an arbitrary distribution, combined with an idea similar to binary search? I'll need to analyze the quantile function of a geometric distribution to be sure, but this hint is enough. Thanks!
      $endgroup$
      – nalzok
      23 hours ago








    • 1




      $begingroup$
      @nalzok, ahh, yes, that's a nicer way to think about it -- lovely. Thank you for suggesting that. Yup, that's what I had in mind.
      $endgroup$
      – D.W.
      23 hours ago




















    • $begingroup$
      Wow, great answer! But could you elaborate on why sampling from a geometric distribution with $p=frac{1}{2^N}$ takes $N$ bits on average? I know such a random variable would have a mean of $2^N$ , so it takes on average $N$ bits to store, but I suppose this doesn't mean you can generate one with $N$ bits.
      $endgroup$
      – nalzok
      yesterday












    • $begingroup$
      @nalzok, A fair question! Could you perhaps ask that as a separate question? I can see how to do it, but it's a bit messy to type up right now. If you ask perhaps someone will get to answering quicker than I can. The approach I'm thinking of is similar to arithmetic coding. Define $q_i = Pr[Xle i]$ (where $X$ is the geometric r.v.), then generate a random number $r$ in the interval $[0,1)$, and find $i$ such that $q_i le r < q_{i+1}$. If you write down the bits of the binary expension $r$ one at a time, usually after writing down $N+O(1)$ bits of $r$, $i$ will be fully determined.
      $endgroup$
      – D.W.
      23 hours ago






    • 1




      $begingroup$
      So you're basically using the inverse CDF method to convert a uniformly distributed random variable to an arbitrary distribution, combined with an idea similar to binary search? I'll need to analyze the quantile function of a geometric distribution to be sure, but this hint is enough. Thanks!
      $endgroup$
      – nalzok
      23 hours ago








    • 1




      $begingroup$
      @nalzok, ahh, yes, that's a nicer way to think about it -- lovely. Thank you for suggesting that. Yup, that's what I had in mind.
      $endgroup$
      – D.W.
      23 hours ago


















    $begingroup$
    Wow, great answer! But could you elaborate on why sampling from a geometric distribution with $p=frac{1}{2^N}$ takes $N$ bits on average? I know such a random variable would have a mean of $2^N$ , so it takes on average $N$ bits to store, but I suppose this doesn't mean you can generate one with $N$ bits.
    $endgroup$
    – nalzok
    yesterday






    $begingroup$
    Wow, great answer! But could you elaborate on why sampling from a geometric distribution with $p=frac{1}{2^N}$ takes $N$ bits on average? I know such a random variable would have a mean of $2^N$ , so it takes on average $N$ bits to store, but I suppose this doesn't mean you can generate one with $N$ bits.
    $endgroup$
    – nalzok
    yesterday














    $begingroup$
    @nalzok, A fair question! Could you perhaps ask that as a separate question? I can see how to do it, but it's a bit messy to type up right now. If you ask perhaps someone will get to answering quicker than I can. The approach I'm thinking of is similar to arithmetic coding. Define $q_i = Pr[Xle i]$ (where $X$ is the geometric r.v.), then generate a random number $r$ in the interval $[0,1)$, and find $i$ such that $q_i le r < q_{i+1}$. If you write down the bits of the binary expension $r$ one at a time, usually after writing down $N+O(1)$ bits of $r$, $i$ will be fully determined.
    $endgroup$
    – D.W.
    23 hours ago




    $begingroup$
    @nalzok, A fair question! Could you perhaps ask that as a separate question? I can see how to do it, but it's a bit messy to type up right now. If you ask perhaps someone will get to answering quicker than I can. The approach I'm thinking of is similar to arithmetic coding. Define $q_i = Pr[Xle i]$ (where $X$ is the geometric r.v.), then generate a random number $r$ in the interval $[0,1)$, and find $i$ such that $q_i le r < q_{i+1}$. If you write down the bits of the binary expension $r$ one at a time, usually after writing down $N+O(1)$ bits of $r$, $i$ will be fully determined.
    $endgroup$
    – D.W.
    23 hours ago




    1




    1




    $begingroup$
    So you're basically using the inverse CDF method to convert a uniformly distributed random variable to an arbitrary distribution, combined with an idea similar to binary search? I'll need to analyze the quantile function of a geometric distribution to be sure, but this hint is enough. Thanks!
    $endgroup$
    – nalzok
    23 hours ago






    $begingroup$
    So you're basically using the inverse CDF method to convert a uniformly distributed random variable to an arbitrary distribution, combined with an idea similar to binary search? I'll need to analyze the quantile function of a geometric distribution to be sure, but this hint is enough. Thanks!
    $endgroup$
    – nalzok
    23 hours ago






    1




    1




    $begingroup$
    @nalzok, ahh, yes, that's a nicer way to think about it -- lovely. Thank you for suggesting that. Yup, that's what I had in mind.
    $endgroup$
    – D.W.
    23 hours ago






    $begingroup$
    @nalzok, ahh, yes, that's a nicer way to think about it -- lovely. Thank you for suggesting that. Yup, that's what I had in mind.
    $endgroup$
    – D.W.
    23 hours ago













    1












    $begingroup$

    You can think this backwards: consider the problem of binary encoding instead of generation. Suppose that you have a source that emits symbols $Xin {A,B}$ with $p(A)=2^{-N}$, $p(B)=1-2^{-N}$. For example, if $N=3$, we get $H(X)approx 0.54356$. So (Shannon tells us) there is an uniquely decodable binary encoding $X to Y$, where $Y in {0,1}$ (data bits), such that we need, on average, about $0.54356$ data bits for each original symbol $X$.



    (In case you are wondering how such encoding can exists, given that we have only two source symbols, and it seems that we cannot do better that the trivial encoding , $Ato 0$, $Bto 1$ , with one bit per symbol, you need to understand that to approximate the Shannon bound we need to take "extensions" of the source, that is, to code sequences of inputs as a whole. See in particular arithmetic encoding).



    Once the above is clear, if we assume we have an invertible mapping $X^n to Y^n$ , and noticing that, in the Shannon limit $Y^n$ must have maximum entropy (1 bit of information per bit of data), i.e., $Y^n$ has the statistics of a fair coin, then we have a generation scheme at hand: draw $n$ random bits (here $n$ has no relation with $N$) with a fair coin, interpret it as the output $Y^n$ of the encoder, and decode $X^n$ from it. In this way, $X^n$ will have the desired probability distribution, and we need (in average) $H(X)<1$ coins to generate each value of $X$.
    coins






    share|cite|improve this answer









    $endgroup$


















      1












      $begingroup$

      You can think this backwards: consider the problem of binary encoding instead of generation. Suppose that you have a source that emits symbols $Xin {A,B}$ with $p(A)=2^{-N}$, $p(B)=1-2^{-N}$. For example, if $N=3$, we get $H(X)approx 0.54356$. So (Shannon tells us) there is an uniquely decodable binary encoding $X to Y$, where $Y in {0,1}$ (data bits), such that we need, on average, about $0.54356$ data bits for each original symbol $X$.



      (In case you are wondering how such encoding can exists, given that we have only two source symbols, and it seems that we cannot do better that the trivial encoding , $Ato 0$, $Bto 1$ , with one bit per symbol, you need to understand that to approximate the Shannon bound we need to take "extensions" of the source, that is, to code sequences of inputs as a whole. See in particular arithmetic encoding).



      Once the above is clear, if we assume we have an invertible mapping $X^n to Y^n$ , and noticing that, in the Shannon limit $Y^n$ must have maximum entropy (1 bit of information per bit of data), i.e., $Y^n$ has the statistics of a fair coin, then we have a generation scheme at hand: draw $n$ random bits (here $n$ has no relation with $N$) with a fair coin, interpret it as the output $Y^n$ of the encoder, and decode $X^n$ from it. In this way, $X^n$ will have the desired probability distribution, and we need (in average) $H(X)<1$ coins to generate each value of $X$.
      coins






      share|cite|improve this answer









      $endgroup$
















        1












        1








        1





        $begingroup$

        You can think this backwards: consider the problem of binary encoding instead of generation. Suppose that you have a source that emits symbols $Xin {A,B}$ with $p(A)=2^{-N}$, $p(B)=1-2^{-N}$. For example, if $N=3$, we get $H(X)approx 0.54356$. So (Shannon tells us) there is an uniquely decodable binary encoding $X to Y$, where $Y in {0,1}$ (data bits), such that we need, on average, about $0.54356$ data bits for each original symbol $X$.



        (In case you are wondering how such encoding can exists, given that we have only two source symbols, and it seems that we cannot do better that the trivial encoding , $Ato 0$, $Bto 1$ , with one bit per symbol, you need to understand that to approximate the Shannon bound we need to take "extensions" of the source, that is, to code sequences of inputs as a whole. See in particular arithmetic encoding).



        Once the above is clear, if we assume we have an invertible mapping $X^n to Y^n$ , and noticing that, in the Shannon limit $Y^n$ must have maximum entropy (1 bit of information per bit of data), i.e., $Y^n$ has the statistics of a fair coin, then we have a generation scheme at hand: draw $n$ random bits (here $n$ has no relation with $N$) with a fair coin, interpret it as the output $Y^n$ of the encoder, and decode $X^n$ from it. In this way, $X^n$ will have the desired probability distribution, and we need (in average) $H(X)<1$ coins to generate each value of $X$.
        coins






        share|cite|improve this answer









        $endgroup$



        You can think this backwards: consider the problem of binary encoding instead of generation. Suppose that you have a source that emits symbols $Xin {A,B}$ with $p(A)=2^{-N}$, $p(B)=1-2^{-N}$. For example, if $N=3$, we get $H(X)approx 0.54356$. So (Shannon tells us) there is an uniquely decodable binary encoding $X to Y$, where $Y in {0,1}$ (data bits), such that we need, on average, about $0.54356$ data bits for each original symbol $X$.



        (In case you are wondering how such encoding can exists, given that we have only two source symbols, and it seems that we cannot do better that the trivial encoding , $Ato 0$, $Bto 1$ , with one bit per symbol, you need to understand that to approximate the Shannon bound we need to take "extensions" of the source, that is, to code sequences of inputs as a whole. See in particular arithmetic encoding).



        Once the above is clear, if we assume we have an invertible mapping $X^n to Y^n$ , and noticing that, in the Shannon limit $Y^n$ must have maximum entropy (1 bit of information per bit of data), i.e., $Y^n$ has the statistics of a fair coin, then we have a generation scheme at hand: draw $n$ random bits (here $n$ has no relation with $N$) with a fair coin, interpret it as the output $Y^n$ of the encoder, and decode $X^n$ from it. In this way, $X^n$ will have the desired probability distribution, and we need (in average) $H(X)<1$ coins to generate each value of $X$.
        coins







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered 13 hours ago









        leonbloyleonbloy

        22619




        22619






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Computer Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcs.stackexchange.com%2fquestions%2f106018%2fsimulating-a-probability-of-1-of-2n-with-less-than-n-random-bits%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            is 'sed' thread safeWhat should someone know about using Python scripts in the shell?Nexenta bash script uses...

            How do i solve the “ No module named 'mlxtend' ” issue on Jupyter?

            Pilgersdorf Inhaltsverzeichnis Geografie | Geschichte | Bevölkerungsentwicklung | Politik | Kultur...