当前位置:文档之家› 语音信号处理中英文翻译

语音信号处理中英文翻译

附录:中英文翻译15SpeechSignalProcessing15.3AnalysisandSynthesisJ esseW. FussellA fte r an acousti c spee ch s i gnal i s conve rte d to an ele ctri cal si gnal by a mi crophone, i t m ay be desi rable toanalyzetheelectricalsignaltoestimatesometime-varyingparameterswhichprovideinformationaboutamodel of the speech producti on me chanism. S peech a na ly sis i s the process of e stim ati ng such paramete rs. Simil arl y , g ive n some parametri c model of spee ch production and a se que nce of param eters for that m odel,speechsynthesis istheprocessofcreatinganelectricalsignalwhichapproximatesspeech.Whileanalysisandsynthesistechniques maybedoneeitheronthecontinuoussignaloronasampledversionofthesignal,mostmode rn anal y sis and sy nthesis methods are base d on di gital si gnal processing.Atypicalspeechproductionmodelisshownin Fig.15.6.Inthismodeltheoutputoftheexcitationfunctionisscaledbythegainparam eterandthenfilteredtoproducespeech.Allofthesefunctionsaretime-varying.F IGUR E 15 .6 A ge ne ra l spee ch productionmodel.F IGUR E 1 5 .7 W ave form of a spoken phone me /i/ as i nbeet.Formanymodels,theparametersarevariedataperiodicrate,typically50to100timespersecond.Mostspee ch inform ati on is containe d i n the porti on of the si gnal bel ow about 4 kHz.Theexcitationisusually modeledaseitheramixtureorachoiceofrandomnoiseandperiodicwaveform.For hum an spee ch, v oi ced e x citati on occurs w hen the vocal fol ds in the lary nx vibrate; unvoi ce d e x citati onoccurs at constri cti ons i n the vocal tract w hi ch cre ate turbulent a i r fl ow [Fl anagan, 1965] . The rel ati ve mi x ofthesetw o type s ofexcitationisterme d ‚v oicing.‛In addition,theperiodi c e xcitation i s characterizedby afundamentalfrequency,termed pitch orF0.Theexcitationisscaledbyafactordesignedtoproducetheproperampli tude or level of the spee ch si gnal . The scaled ex citati on function i s then fi ltere d to produce the properspe ctral characte risti cs. W hile the filter m ay be nonli near, i t i s usuall y m odele d as a li nearfunction.AnalysisofExcitationInasimplifiedform,theexcitationfunctionmaybeconsideredtobepurelyperiodic,forvoicedspeech,orpurel y random, for unvoi ce d. T hese tw o states correspond to voi ce d phoneti c cl asse s such as vow elsand nasalsandunvoicedsoundssuchasunvoicedfricatives.Thisbinaryvoicingmodelisanoversimplificationforsounds such as v oi ced fri cati ves, whi ch consist of a mi xture of peri odi c and random compone nts. Fi gure 15.7is an ex ample of a time w ave form of a spoke n /i/ phoneme , w hi ch is w ell m odeled by onl y pe riodi c e x citation.B oth ti me dom ai n and frequency dom ai n anal y s is te chni ques have bee n used to esti m ate the de greeofvoi ci ng for a short se gme nt or frame of spee ch. One ti me dom ain fe ature, te rme d the ze ro crossing rate,i sthenumberoftimesthesignalchangessigninashortinterval.AsshowninFig.15.7,thezerocrossingrateforvoicedsoundsisrelativ elylow.Sinceunvoicedspeechtypicallyhasalargerproportionofhigh-frequencyenergy than voi ce d spee ch, the ratio of high-fre que ncy to low -frequency e nergy is a fre que ncy dom aintechni que that provi des i nform ation on voi cing.A nothe r measure use d to estim ate the de gree of voi ci ng is the autocorrel ation functi on, w hi ch is de fine d fora sam pled speech se gment, S ,aswheres(n)isthevalueofthenthsamplewithinthesegmentoflengthN.Sincetheautocorrelationfunctionofa periodi c functi on is i tsel f pe ri odi c, voi ci ng can be e sti mated from the de gree of pe ri odi city oftheautocorrel ati on function. Fi gure 15. 8 i s a graph of the nonne gati ve te rms of the autocorrel ation functi on for a64 -ms frame of the w aveform of Fi g . 15. 7. Ex cept for the de cre ase i n amplitude w ith i ncre asi ng lag, whi chresultsfromtherectangularwindowfunctionwhichdelimitsthesegment,theautocorrelationfunctionisseento be quite pe riodi c for thi s voi ce dutterance.F IGUR E 1 5 .8 A utocorrel ati on functi on of one frame of /i/. Ifananalysisofthevoicingofthespeechsignalindicatesavoicedorperiodiccomponentispresent,another ste p i n the anal y si s process m ay be to estim ate the freque ncy ( or pe ri od) of the voi ce d component.Thereareanumberofwaysinwhichthismaybedone.Oneistomeasurethetimelapsebetweenpeaksinthetime dom ai n si gnal. For ex am ple i n Fi g . 15.7 the m aj or peaks are separate d by about 0. 00 71 s, for afundamentalfrequencyofabout141Hz.Note,itwouldbequitepossibletoerrintheestimateoffundamentalfre quency by mistaki ng the sm aller pe aks that occur betwee n the m a jor pe aks for the m aj or pe aks. Thesesmallerpeaksareproducedbyresonanceinthevocaltractwhich,inthisexample,happentobeatabouttwicethe ex citation fre quency . T his ty pe of e rror w ould re sult in an e sti m ate of pitch approxi m atel y tw i ce the corre ct fre quency.The di stance betw ee n m ajor pe ak s of the autocorrel ation functi on is a closel y rel ate d fe ature thatisfre quentl y use d to esti m ate the pitch pe ri od. In Fi g . 15. 8, the di stance between the m aj or peaks in the autocorrelationfunctionisabout0.0071s.Estimatesofpitchfromtheautocorrelationfunctionarealsosusce pti ble to mistaking the fi rst vocal track resonance for the g l ottal e x citati on frequency.The absol ute m agnitude di ffere nce functi on ( AM DF), de fi nedas,is another functi on w hi ch is often use d i n estim ating the pitch of voi ce d spee ch. A n ex ample of the AM DF isshownin Fig.15.9forthesame64-msframeofthe/i/phoneme.However,theminimaoftheAMDFisusedasanindicatorofthepitchperiod.TheAMDFhasbeenshownt obeagoodpitchperiodindicator[Rossetal.,19 74 ] and does not requi re multi pli cations.FourierAnalysisOne of the m ore comm on processe s for e stim ating the spe ctrum of a se gme nt of spee ch is the Fourie rtransform [ Oppenheim and S chafer, 1 97 5 ]. T he Fourie r transform of a seque nce is m athem ati call y de fine daswheres(n)representsthetermsofthesequence.Theshort-timeFouriertransformofasequenceisatimedependentfunction,definedasF IGUR E 1 5 .9 A bsolute m agnitude diffe rence functi on of one frame of /i/.wherethewindowfunctionw(n)isusuallyzeroexceptforsomefiniterange,andthevariablemisusedtoselectthesectionofthesequ enceforanalysis.ThediscreteFouriertransform(DFT)isobtainedbyuniformlysam pling the short-ti me Fourie r transform i n the fre quency dime nsi on. Thus an N-point DFT is computedusingEq.(15.14),wherethe setofNsamples,s(n),may have firstbeenmultiplied by a window function.Anexampleofthemagnitudeofa512-pointDFTofthewaveformofthe/i/from Fig.15.10isshowninFig.15.10.Noteforthisfi gure, the 512 poi nts in the se que nce have been m ulti plied by a Ham ming w i ndow de fi nedbyF IGUR E 1 5 .1 0 M agnitude of 51 2-point FFT of Ham mi ng window e d/i/.S ince the spe ctral characteristi cs of spee ch m ay change dram a ti call y in a fe w milli se conds, the le ngth, type,and l ocation of the wi ndow function are im portant consi derati ons. If the w indow is too long, changi ng spe ctralcharacteristicsmaycauseablurredresult;ifthewindowistooshort,spectralinaccuraciesresult.AHammingwi ndow of 16 to 32 m s durati on is com m onl y use d for spee ch analysis.S everal characte risti cs of a speech utte rance m ay be dete rmine d by ex amination of the DFT m agnitude. InFig.15.10,theDFTofavoicedutterancecontainsaseriesofsharppeaksinthefrequencydomain.Thesepeaks, caused by the peri odi c sampl ing acti on of the g lottal ex ci tation, are separated by the fundame ntalfrequencywhichisabout141Hz,inthisexample.Inaddition,broaderpeakscanbeseen,forexampleatabout300 Hz and at about 2300 Hz. T hese broad peaks, calle d formants, result from resonances in the vocaltract. LinearPredictiveAnalysisGivenasampled(discrete-time)signals(n),apowerfulandgeneralparametric modelfortimeseriesanalysisiswheres(n)istheoutputandu(n)istheinput(perhapsunknown).Themodelparametersare a(k)fork=1,p,b( l ) for l = 1, q, and G. b( 0) is assume d to be unity. Thi s m odel , describe d as an autore g ressi ve m ov ing average(ARM A)orpole-zeromodel,formsthefoundationfortheanalysismethodtermedlinearprediction.Anautoregressive(AR) orall-polemodel,forwhichallofthe‚b‛coe fficientsexceptb(0)arezero,isfrequentlyused for spee ch anal y si s [M arkel and Gray, 1976].In the standard A R formul ati on of li ne ar predi ction, the model paramete rs are sele cte d to mi ni mizethemean-squarederrorbetweenthemodelandthespeechdata.Inoneofthevariantsoflinearprediction,theautocorrelationmethod,themini mizationiscarriedoutforawindowedsegmentofdata.Intheautocorrelationmethod,minimizingthemean-squareerror of the time domain samples is equivalentto minimizing theintegratedratioofthesignalspectrumtothespectrumoftheall-polemodel.Thus,linearpredictiveanalysisisagoodmethod forspectralanalysiswheneverthesignalisproducedby an all-pole system.M ost speechsounds fi t thi s model w ell.One ke y consi deration for li near pre dicti ve anal y si s is the order of the model, p. For spee ch, if the orde ristoosmall,theformantstructureisnot well represented. If the orderis too large, pitch pulses as well asformantsbegintoberepresented.Tenth- or twelfth-order analysis is typical forspeech.Figures15.11 and15.12 provideexamplesof the spectrum produced by eighth-order and sixteenth-order linear predictiveanalysisofthe/i/waveformofFig.15.7.Figure15.11showstheretobethreeformantsatfrequenciesofabout30 0, 23 00, and 3200 Hz , whi ch are ty pi cal for an/i/.Homomorphic(Cepstral)AnalysisFor the speech m odel of Fi g. 15. 6, the e x citati on and filter i mpulse response are convol ved to produce thespeech.Oneoftheproblemsofspeechanalysisistoseparateordeconvolvethespeechintothesetw ocom ponents. Onesuch te chni que is called hom omorphi c filte ri ng [ Oppe nheim and S chafer, 1968 ]. Thecharacte risti c sy ste mfor a sy ste m for hom om orphi c deconvol ution conve rts a convolution operation to anadditi on ope ration. The output of such a characteristi c sy stem is calle d the com ple x cep str u m . The complexcepstrumisdefinedastheinverseFouriertransformofthecomplexlogarithmoftheFouriertransformoftheinput.Iftheinputseque nceisminimumphase(i.e.,thez-transformoftheinputsequencehasnopolesorzerosoutside the unit ci rcle), the se quence can be represe nted by the real portion of the transforms. Thus, the re alcepstrum can be com pute d by cal cul ati ng the inve rse Fourie r transform of the log- spe ctrum of theinput.FIGURE15.11Eighth-orderlinearpredictiveanalysisofan‚i‛.FIGURE15.12Sixteenth-orderlinearpredictiveanalysisofan‚i‛.Fi gure 1 5.1 3 show s an e x ample of the cepstrum for the voi ced /i/ utterance from Fi g. 15.7 . The cepstrum ofsuch a voi ce d utterance i s characte rized by rel ati vel y la rge v alues in the fi rst one or tw o milli se conds as w ellas。

相关主题