本答案是英文原版的配套答案,与翻译的中文版课本题序不太一样但内容一样。
翻译的中文版增加了题量。
2.2、Entropy of functions. Let X be a random variable taking on a finite number of values. What is the (general) inequality relationship of ()H X and ()H Y if(a) 2X Y =?(b) cos Y X =?Solution: Let ()y g x =. Then():()()x y g x p y p x ==∑.Consider any set of x ’s that map onto a single y . For this set()():():()()log ()log ()log ()x y g x x y g x p x p x p x p y p y p y ==≤=∑∑,Since log is a monotone increasing function and ():()()()x y g x p x p x p y =≤=∑.Extending this argument to the entire range of X (and Y ), we obtain():()()log ()()log ()x y x g x H X p x p x p x p x =-=-∑∑∑()log ()()yp y p y H Y ≥-=∑,with equality iff g if one-to-one with probability one.(a) 2X Y = is one-to-one and hence the entropy, which is just a function of the probabilities does not change, i.e., ()()H X H Y =.(b) cos Y X =is not necessarily one-to-one. Hence all that we can say is that ()()H X H Y ≥, which equality if cosine is one-to-one on the range of X .2.16. Example of joint entropy. Let (,)p x y be given byFind(a) ()H X ,()H Y .(b) (|)H X Y ,(|)H Y X . (c) (,)H X Y(d) ()(|)H Y H Y X -.(e) (;)I X Y(f) Draw a Venn diagram for the quantities in (a) through (e).Solution:Fig. 1 Venn diagram(a) 231()log log30.918 bits=()323H X H Y =+=.(b) 12(|)(|0)(|1)0.667 bits (/)33H X Y H X Y H X Y H Y X ==+===((,)(|)()p x y p x y p y =)((|)(,)()H X Y H X Y H Y =-)(c) 1(,)3log3 1.585 bits 3H X Y =⨯=(d) ()(|)0.251 bits H Y H Y X -=(e)(;)()(|)0.251 bits=-=I X Y H Y H Y X(f)See Figure 1.2.29 Inequalities. Let X,Y and Z be joint random variables. Prove the following inequalities and find conditions for equality.(a) )ZHYXH≥X(Z()|,|(b) )ZIYXI≥X((Z);,;(c) )XYXHZ≤Z-H-XYH),(,)(((X,,)H(d) )XYIZIZII+-XZY≥Y(););(|;;(Z|)(XSolution:(a)Using the chain rule for conditional entropy,HZYXHZXH+XH≥XYZ=),(|(Z)(||,()|)With equality iff 0YH,that is, when Y is a function of X andXZ,|(=)Z.(b)Using the chain rule for mutual information,ZIXXIZYX+=,I≥IYZ(|;)X);)(,;;(Z)(With equality iff 0ZYI, that is, when Y and Z areX)|;(=conditionally independent given X.(c)Using first the chain rule for entropy and then definition of conditionalmutual information,XZHYHIXZYX==-H-XHYYZ)()(;Z)|,|),|(X(,,)(XHXZH-Z≤,=,)()()(X|HWith equality iff 0ZYI, that is, when Y and Z areX(=|;)conditionally independent given X .(d) Using the chain rule for mutual information,);()|;();,();()|;(Z X I X Y Z I Z Y X I Y Z I Y Z X I +==+And therefore this inequality is actually an equality in all cases.4.5 Entropy rates of Markov chains.(a) Find the entropy rate of the two-state Markov chain with transition matrix⎥⎦⎤⎢⎣⎡--=1010010111p p p p P (b) What values of 01p ,10p maximize the rate of part (a)?(c) Find the entropy rate of the two-state Markov chain with transition matrix⎥⎦⎤⎢⎣⎡-=0 1 1p p P(d) Find the maximum value of the entropy rate of the Markov chain of part (c). We expect that the maximizing value of p should be less than 2/1, since the 0 state permits more information to be generated than the 1 state.Solution:(a) T he stationary distribution is easily calculated.10010*********,p p p p p p +=+=ππ Therefore the entropy rate is10011001011010101012)()()()()|(p p p H p p H p p H p H X X H ++=+=ππ(b) T he entropy rate is at most 1 bit because the process has only two states. This rate can be achieved if( and only if) 2/11001==p p , in which case the process is actually i.i.d. with2/1)1Pr()0Pr(====i i X X .(c) A s a special case of the general two-state Markov chain, the entropy rate is1)()1()()|(1012+=+=p p H H p H X X H ππ.(d) B y straightforward calculus, we find that the maximum value of)(χH of part (c) occurs for 382.02/)53(=-=p . The maximum value isbits 694.0)215()1()(=-=-=H p H p H (wrong!)5.4 Huffman coding. Consider the random variable⎪⎪⎭⎫ ⎝⎛=0.02 0.03 0.04 0.04 0.12 0.26 49.0 7654321x x x x x x x X (a) Find a binary Huffman code for X .(b) Find the expected codelength for this encoding.(c) Find a ternary Huffman code for X .Solution:(a) The Huffman tree for this distribution is(b)The expected length of the codewords for the binary Huffman code is 2.02 bits.( ∑⨯=)()(i p l X E )(c) The ternary Huffman tree is5.9 Optimal code lengths that require one bit above entropy. The source coding theorem shows that the optimal code for a random variable X has an expected length less than 1)(+X H . Given an example of a random variable for which the expected length of the optimal code is close to 1)(+X H , i.e., for any 0>ε, construct a distribution for which the optimal code has ε-+>1)(X H L .Solution: there is a trivial example that requires almost 1 bit above its entropy. Let X be a binary random variable with probability of 1=X close to 1. Then entropy of X is close to 0, but the length of its optimal code is 1 bit, which is almost 1 bit above its entropy.5.25 Shannon code. Consider the following method for generating a code for a random variable X which takes on m values {}m ,,2,1 with probabilities m p p p ,,21. Assume that the probabilities are ordered so thatm p p p ≥≥≥ 21. Define ∑-==11i k i i p F , the sum of the probabilities of allsymbols less than i . Then the codeword for i is the number ]1,0[∈i Frounded off to i l bits, where ⎥⎥⎤⎢⎢⎡=i i p l 1log . (a) Show that the code constructed by this process is prefix-free and the average length satisfies 1)()(+<≤X H L X H .(b) Construct the code for the probability distribution (0.5, 0.25, 0.125, 0.125).Solution:(a) Since ⎥⎥⎤⎢⎢⎡=i i p l 1log , we have 11log 1log +<≤i i i p l pWhich implies that 1)()(+<=≤∑X H l p L X H i i .By the choice of i l , we have )1(22---<≤ii l i l p . Thus j F , i j > differs from j F by at least il -2, and will therefore differ from i F is at least one place in the first i l bits of the binary expansion of i F . Thus thecodeword for j F , i j >, which has length i j l l ≥, differs from thecodeword for i F at least once in the first i l places. Thus no codewordis a prefix of any other codeword.(b) We build the following table3.5 AEP. Let ,,21X X be independent identically distributed random variables drawn according to theprobability mass function {}m x x p ,2,1),(∈. Thus ∏==n i i n x p x x x p 121)(),,,( . We know that)(),,,(log 121X H X X X p n n →- in probability. Let ∏==n i i n x q x x x q 121)(),,,( , where q is another probability mass function on {}m ,2,1.(a) Evaluate ),,,(log 1lim 21n X X X q n-, where ,,21X X are i.i.d. ~ )(x p . Solution: Since the n X X X ,,,21 are i.i.d., so are )(1X q ,)(2X q ,…,)(n X q ,and hence we can apply the strong law of large numbers to obtain∑-=-)(log 1lim ),,,(log 1lim 21i n X q n X X X q n 1..))((log p w X q E -=∑-=)(log )(x q x p∑∑-=)(log )()()(log )(x p x p x q x p x p )()||(p H q p D +=8.1 Preprocessing the output. One is given a communication channel withtransition probabilities )|(x y p and channel capacity );(max )(Y X I C x p =.A helpful statistician preprocesses the output by forming )(_Y g Y =. He claims that this will strictly improve the capacity.(a) Show that he is wrong.(b) Under what condition does he not strictly decrease the capacity? Solution:(a) The statistician calculates )(_Y g Y =. Since _Y Y X →→ forms a Markov chain, we can apply the data processing inequality. Hence for every distribution on x ,);();(_Y X I Y X I ≥. Let )(_x p be the distribution on x that maximizes );(_Y X I . Then__)()()(_)()()();(max );();();(max __C Y X I Y X I Y X I Y X I C x p x p x p x p x p x p ==≥≥===.Thus, the statistician is wrong and processing the output does not increase capacity.(b) We have equality in the above sequence of inequalities only if we have equality in data processing inequality, i.e., for the distribution that maximizes );(_Y X I , we have Y Y X →→_forming a Markov chain.8.3 An addition noise channel. Find the channel capacity of the following discrete memoryless channel:Where {}{}21Pr 0Pr ====a Z Z . The alphabet for x is {}1,0=X . Assume that Z is independent of X . Observe that the channel capacity depends on the value of a . Solution: A sum channel.Z X Y += {}1,0∈X , {}a Z ,0∈We have to distinguish various cases depending on the values of a .0=a In this case, X Y =,and 1);(max =Y X I . Hence the capacity is 1 bitper transmission.1,0≠≠a In this case, Y has four possible values a a +1,,1,0. KnowingY ,we know the X which was sent, and hence 0)|(=Y X H . Hence thecapacity is also 1 bit per transmission.1=a In this case Y has three possible output values, 0,1,2, the channel isidentical to the binary erasure channel, with 21=f . The capacity of this channel is 211=-f bit per transmission.1-=a This is similar to the case when 1=a and the capacity is also 1/2 bit per transmission.8.5 Channel capacity. Consider the discrete memoryless channel)11 (mod Z X Y +=, where ⎪⎪⎭⎫ ⎝⎛=1/3 1/3, 1/3,3 2,,1Z and {}10,,1,0 ∈X . Assume thatZ is independent of X .(a) Find the capacity.(b) What is the maximizing )(*x p ?Solution: The capacity of the channel is );(max )(Y X I C x p =)()()|()()|()();(Z H Y H X Z H Y H X Y H Y H Y X I -=-=-=bits 311log)(log );(=-≤Z H y Y X I , which is obtained when Y has an uniform distribution, which occurs when X has an uniform distribution.(a)The capacity of the channel is bits 311log /transmission.(b) The capacity is achieved by an uniform distribution on the inputs.10,,1,0for 111)( ===i i X p 8.12 Time-varying channels. Consider a time-varying discrete memoryless channel. Let n Y Y Y ,,21 be conditionally independent givenn X X X ,,21 , with conditional distribution given by ∏==ni i i i x y p x y p 1)|()|(.Let ),,(21n X X X X =, ),,(21n Y Y Y Y =. Find );(max )(Y X I x p . Solution:∑∑∑∑∑=====--≤-≤-=-=-=-=ni i n i i i n i i ni i i ni i i n p h X Y H Y H X Y H Y H X Y Y Y H Y H X Y Y Y H Y H X Y H Y H Y X I 111111121))(1()|()()|()(),,|()()|,,()()|()();(With equlity ifnX X X ,,21 is chosen i.i.d. Hence∑=-=ni i x p p h Y X I 1)())(1();(max .10.2 A channel with two independent looks at Y . Let 1Y and 2Y be conditionally independent and conditionally identically distributed givenX .(a) Show );();(2),;(21121Y Y I Y X I Y Y X I -=. (b) Conclude that the capacity of the channelX(Y1,Y2)is less than twice the capacity of the channelXY1Solution:(a) )|,(),(),;(212121X Y Y H Y Y H Y Y X I -=)|()|();()()(212121X Y H X Y H Y Y I Y H Y H ---+=);();(2);();();(2112121Y Y I Y X I Y Y I Y X I Y X I -=-+=(b) The capacity of the single look channel 1Y X → is );(max 1)(1Y X I C x p =.Thecapacityof the channel ),(21Y Y X →is11)(211)(21)(22);(2max );();(2max ),;(max C Y X I Y Y I Y X I Y Y X I C x p x p x p =≤-==10.3 The two-look Gaussian channel. Consider the ordinary Shannon Gaussian channel with two correlated looks at X , i.e., ),(21Y Y Y =, where2211Z X Y Z X Y +=+= with a power constraint P on X , and ),0(~),(221K N Z Z ,where⎥⎦⎤⎢⎣⎡=N N N N K ρρ. Find the capacity C for (a) 1=ρ (b) 0=ρ (c) 1-=ρSolution:It is clear that the two input distribution that maximizes the capacity is),0(~P N X . Evaluating the mutual information for this distribution,),(),()|,(),()|,(),(),;(max 212121212121212Z Z h Y Y h X Z Z h Y Y h X Y Y h Y Y h Y Y X I C -=-=-==Nowsince⎪⎪⎭⎫⎝⎛⎥⎦⎤⎢⎣⎡N N N N N Z Z ,0~),(21ρρ,wehave)1()2log(21)2log(21),(222221ρππ-==N e Kz e Z Z h.Since11Z X Y +=, and22Z X Y +=, wehave ⎪⎪⎭⎫⎝⎛⎥⎦⎤⎢⎣⎡++++N N P N N P N Y Y P P ,0~),(21ρρ, And ))1(2)1(()2log(21)2log(21),(222221ρρππ-+-==PN N e K e Y Y h Y .Hence⎪⎪⎭⎫⎝⎛++=-=)1(21log 21),(),(21212ρN P Z Z h Y Y h C(a) 1=ρ.In this case, ⎪⎭⎫⎝⎛+=N P C 1log 21, which is the capacity of a single look channel.(b) 0=ρ. In this case, ⎪⎭⎫⎝⎛+=N P C 21log 21, which corresponds to using twice the power in a single look. The capacity is the same as the capacity of the channel )(21Y Y X +→.(c) 1-=ρ. In this case, ∞=C , which is not surprising since if we add1Y and 2Y , we can recover X exactly.10.4 Parallel channels and waterfilling. Consider a pair of parallel Gaussianchannels,i.e.,⎪⎪⎭⎫⎝⎛+⎪⎪⎭⎫ ⎝⎛=⎪⎪⎭⎫ ⎝⎛212121Z Z X X Y Y , where⎪⎪⎭⎫ ⎝⎛⎥⎥⎦⎤⎢⎢⎣⎡⎪⎪⎭⎫ ⎝⎛222121 00 ,0~σσN Z Z , And there is a power constraint P X X E 2)(2221≤+. Assume that 2221σσ>. At what power does the channel stop behaving like a single channel with noise variance 22σ, and begin behaving like a pair of channels? Solution: We will put all the signal power into the channel with less noise until the total power of noise+signal in that channel equals the noise power in the other channel. After that, we will split anyadditional power evenly between the two channels. Thus the combined channel begins to behave like a pair of parallel channels when the signal power is equal to the difference of the two noise powers, i.e., when 22212σσ-=P .。