当前位置:文档之家› 基于语音控制的显示器的设计开题报告

基于语音控制的显示器的设计开题报告

毕业设计(论文)材料之二(2)本科毕业设计(论文)开题报告题目:基于语音控制的显示器的设计课题类型:设计□√实验研究□论文□学生姓名:xxxxxx学号: xxxxxxxxxxxx专业班级: xxxxx学院:电气工程学院指导教师: xxxxx开题时间:2016.3.122016年 3月12 日一、本课题的研究意义、研究现状和发展趋势(文献综述)1.1选题目的背景及意义人与人之间的交流手段中,语音是最高效的手段之一,如果能让人与计算机的交流也能达到这样的简单高效,那将会带来极大地便利。

现有的显示器调节方案主要是采用手动调节的方式,通过手动按键输入各种命令,使显示器能按照终端用户的要求进行开关机,信号选择,亮度色彩等调节。

而手动调节的方式在很大程度上浪费用户的时间。

本课题拟采用语音识别处理器和通信模块设计一种语音控制的显示器,能够简捷、快速、有效地对显示器进行调节,解放用户双手,使产品更加人性化、智能化的同时也节约了用户的时间。

语音处理技术是一门新兴的技术,它不仅包括语音的录制和播放,还涉及语音的压缩编码和解码,语音的识别等各种处理技术。

以往做这方面的设计,一般有两个途径:一种方案是单片机扩展设计,另一种就是借助于专门的语音处理芯片。

普通的单片机往往不能实现这么复杂的过程和算法,即使勉强实现也要加很多的外围器件。

专门的语音处理芯片也比较多,像ISD 系列、PM50 系列等,但是专门的语音处理芯片功能比较单一,想在语音之外的其他方面应用基本是不可能的。

SPCE061A 是凌阳科技推出的一款16 位μ'nSP 结构的微控制器。

该芯片带有硬件乘法器,能够实现乘法运算、内积运算等复杂的运算。

它不仅运算能力强,而且处理速度快,单周期最高可以达到49MHz。

SPCE061A 内嵌32K 字的FLASH 程序存储器以及2K 的SRAM。

同时该SOC 芯片具有ADC 和DAC 功能,其MIC_ADC 通道带有AGC 自动增益环节,能够很轻松的将语音信号采集到芯片内部,两路10 位的电流输出型DAC,只要外接一个功放就可以完成声音的播放。

以上介绍的这些硬件资源使得该SPCE061A 能够单芯片实现语音处理功能。

1.2国内外研究现状及发展趋势1.2.1国内语音识别的发展状况20世纪50年代我国就有人尝试用电子管电路进行元音识别,到70年代才由中科院声学所开始进行计算机语音识别的研究.80年代开始,很多学者和单位参与到语音识别的研究中来,也开展了从最初的特定人、小词汇量孤立词识别,到非特定人、大词汇量连续语音识别的研究工作.80年代末,以汉语全音节识别作为主攻方向的研究已经取得了相当大的进展,一些汉语语音输入系统已经向实用化迈进。

90年代j四达技术开发中心和哈尔滨工业大学合作推出了具有自然语言理解能力的新产品.在国家“863”计划的支持下,清华大学和中科院自动化所等单位在汉语听写机原理样机的研制方面开展了卓有成效的研究.经过60多年的发展,语音识别技术已经得到了很大发展,对于语音识别的研究也达到了相当高的水平,并在实验室环境下能达到很好的识别效果。

但是,在实际应用中,噪声以及各种因素的影响,使语音识别系统的性能大幅度下降,很难达到让人满意的效果。

因此,对噪声环境下的语音识别的研究有着异常重要的理论价值和现实意义。

1.2.2国外语音识别的发展状况国外的语音识别是从1952年贝尔实验室的Davis等人研制的特定说话人孤立数字识别系统开始的。

20世纪60年代,日本的很多研究者开发了相关的特殊硬件来进行语音识别RCA实验室的Martin等人为解决语音信号时间尺度不统一的问题,开发了一系列的时问归正方法,明显地改善了识别性能。

与此同时,苏联的Vmtsyuk提出了采用动态规划方法解决两个语音的时闻对准问题,这是动态时间弯折算法DTW(dymmic time warping)的基础,也是其连续词识别算法的初级版。

20世纪70年代,人工智能技术走入语音识别的研究中来.人们对语音识别的研究也取得了突破性进展.线性预测编码技术也被扩展应用到语音识别中,DTw也基本成熟。

20世纪80年代,语音识别研究的一个重要进展,就是识别算法从模式匹配技术转向基于统计模型的技术,更多地追求从整体统计的角度来建立最佳的语音识别系统。

隐马尔可夫模型(hidden Markov model,删)技术就是其中一个典型技术。

删的研究使大词汇量连续语音识别系统的开发成为可能。

20世纪90年代,人工神经网络(artificial neural network,ANN)也被应用到语音识别的研究中,并使相应的研究工作在模型的细化、参数的提取和优化以及系统的自适应技术等方面取得了一些关键性的进展,此时,语音识别技术进一步成熟,并走向实用。

许多发达国家,如美国、日本、韩国,已经IBM、Microsoft、Apple、AT&T、Nrr等著名公司都为语音识别系统的实用化开发研究投以巨资。

当今,基于HMM和ANN相结合的方法得到了广泛的重视。

而一些模式识别、机器学习方面的新技术也被应用到语音识别过程中,如支持向量机(support vector machine,SVM)技术、进化算法(evolutionary computation)技术等。

1.2.3国外语音识别的发展趋势目前,全球语音技术市场规模超过30亿美元,近年来年增长率保持在25%以上,未来语音识别市场被看好,其中电信行业(V oIP等),移动应用领域(手机、学习机、平板电脑、车载系统等移动设备),都会呈现出爆发式增长。

下面列举几个电信及移动应用领域成功的语音产品/软件。

1、电信行业:电话银行系统电话银行系统(Telephon Barver Server)是近年来国外日益兴起的一种高新技术,它是实现银行现代化经营与管理的基础,它通过电话这种现代化的通信工具把用户与银行紧密相连,使用户不必去银行,无论何时何地,只要通过拨通电话银行的电话号码,就能够得到电话银行提供的其它服务(往来交易查询、申请技术、利率查询等),当银行安装这种系统以后,可使银行提高服务质量,增加客户,为银行带来更好的经济效益。

2、移动应用领域:SiriSiri是苹果公司在其产品iphone 4S上应用的一项语音控制功能。

Siri可以令iPhone4S变身为一台智能化机器人,Siri可实现:手机读短信、手机介绍餐厅、用手机询问天气、语音设置闹钟等功能。

Siri支持自然语言输入,并能调用系统自带的天气预报、日程安排、搜索资料等应用,还能够不断学习新的声音和语调,提供对话式的应答。

3、生活领域:手机“导游”这是由思必驰设计师独特构思的一款产品,该产品旨在为您的手机里藏一位“导游”。

每到一个景区,这位“导游”会先到售票处“报到”,然后只要您告诉他景点名称,他就能为您“滔滔不绝地讲述”景点背后的故事。

除了以上几个行业和代表性产品之外,语音识别技术还能在语音翻译领域、语音游戏领域、语音搜索领域大展拳脚。

科技源于创新,语音创造价值,相信不久的将来,会有更多的形形色色的语音应用出现在我们的生活中,为平凡的生活增添更多色彩。

二、主要设计(研究)内容本课题拟采用语音识别处理器和通信模块设计一种语音控制的显示器,能够简捷、快速、有效地对显示器进行调节,解放用户双手,使产品更加人性化、智能化的同时也节约了用户的时间。

根据研究内容,确定工作流程如下2.1语音的识别采用芯片LD3320,一颗基于非特定人语音识别(SI一ASR: Speaker一IndependentAutomatic Speech Recognition)技术的语音识别/声控芯片。

提供了真正的单芯片语音识别解决方案。

功能介绍尺寸:2*6.2cm排针:2*20标准DIP40排针。

LD3320芯片的音频模拟管脚连接相应的电容/电阻后通过排针引出。

M-LD3320模块上设计有2个音频插座,直接引出MIC输入和Speak:输出信号。

用户可以用一个带麦克风的耳机验证语音识别和声音播放,十分方便。

M-LD3320模块上没有电源芯片,相应的电源管脚由排针引出,由开发者连接入3.3v电源输入。

M-LD3320模块上的CLK输入可以选择如下任意一种:(1)直接将晶振信号通过排针输入到LD3320的相应管脚。

(2)或者用户可以自行焊接晶振,在模块上预留晶振的空间和连接点[3]。

M-LD3320模块上有两个LED灯,连接到LD3320芯片的29, 30管脚上,在LD3320上电重启复位(RSTB*)并稳定工作后,29, 30管脚会稳定输出低电平,因此这两个LED灯可以作为芯片上电指示。

2.2进行仿真并调试三、研究方案及工作计划(含工作重点与难点及拟采用的途径)3.1 研究方案3.1.1 总的方案3.2工作重点与难点及拟采用的途径首先,语音识别系统的鲁棒性不够强,对环境的依赖程度过高。

在某一种环境下训练的语音识别系统换了一种环境之后性能就会下降。

其次,语音识别对于外部噪声特别敏感。

这不仅是因为外部噪声会导致语音信号发生变化,而且由于嘈杂的环境下人的音调,语速以及音量都会改变,因此识别难度也更大。

再次,语音的随机性很强。

就算是同一个人在不同的时刻,由于身心状态的差异,导致语音的特征也会不一样。

最后,由于目前对人类的听觉理解、知识积累和神经系统的机理等方面的研究水平不足,限制了语音识别的发展。

为了解决上述问题,研究者们想出了各种方法,比如自适应训练、神经网络等。

这些做法虽然都取得了一定的成绩,然而,如果要让语音识别系统的性能得到大幅的提高,还有大量的工作要做。

目前,市场上大词汇量的语音识别系统多采样PC机作为硬件平台,而基于嵌入式的中小词汇的语音识别系统,其硬件设计常采用DSP或者AUM这样的高性能芯片,这样硬件成本较高。

对于单片机来说,虽然成本低,但由于单片机本身计算能力有限,而语音识别的计算量过大,这对系统在单片机上的实现带来了很大的困难。

因此,如何改进算法以减少计算量成为了语音识别能否在单片机上运行的一大难点。

四、阅读的主要参考文献[ 1 ] 杨行峻,迟惠生,等. 语音信号数字处理[M ]. 北京:电子工业出版社, 1995. [ 2 ] 赵力. 语音信号处理[M ]. 北京: 机械工业出版社,2003.[ 3 ] Gannot S, Burshtein D, Weinstein E. Iterative and se2quential Kalman filter2based speech enhancement algo2rithms [ J ]. IEEE Trans Speech and Audio Process, 1998, 6(4) : 3732385.[ 4 ] Kin J B, Lee K Y , Lee CW. On the app lications of theinteracting multip le model algorithm for enhancing noisyspeech [ J ]. IEEE Trans Speech and Audio Process, 2000,8 (3) : 3492352.[ 5 ] Y Ephraim, H L V Trees. A signal subspace app roach forspeech enhancement[ J ]. IEEE Trans. Speech and AudioProcessing, 1995, 3 (7) : 2512266. [ 6 ] F Jabloun, B Champagne. A multi - microphone signalsubspace app roach for speech enhancement[A ]. In Proc.IEEE ICASSP01 [C ]. 2001. 2052208 .[ 7 ] Boll S. Supp ression of acoustic noise in speech using spec2tral subtraction [ J ]. IEEE Trans on Acoustic Speech andSignal Processing, 1979, 27 (2) : 1132120. [ 8 ] Ningp ing Fan. Low distortion speech denoising using an a2dap tive parametric Wiener filter [A ]. IEEE InternationalConference on Acoustics, Speech and Signal Processing( ICASSP) [C ]. 2004, 1: 122309.[ 9 ] Ephraim Y, Malah D. Speech enhancement using a mini2mum2mean square error short2time spectral amp litude esti2mator [ J ]. IEEE Transactions on Acoustics, Speech andSignal Processing, 1984, 32 (60) : 110921121.[ 10 ] 韩纪庆,张磊,郑铁然. 语音信号处理[M ]. 北京:清华大学出版社, 2004年. [ 11 ] 高鹰,谢胜利. 一种变步长LMS自适应滤波算法及分析[ J ]. 电子学报, 2001, 29 (8) : 109421097.[ 12 ] Jax P Vary P. Artificial bandwidth extension of speechsignals usingMMSE estimation based on a hidden Markovmodel [A ]. IEEE International Conference on Acoustics,Speech, and Signal Processing ( ICASSP) [ C ]. 2003. 6802683[ 13 ] SMallat and W L Hwang. Singularity detection and p ro2cessing with wavelets[ J ]. IEEE Trans on Information The2ory, 1992, 38 (2) : 6172643 .[ 14 ] D L Donoho and IM Johnstone. Adap ting to unknownsmoothness via wavelet shrinkage [ J ]. Journal of the A2merican StatisticalAssociation, 1995, 90: 120021224.[ 15 ] L iew Ban Fah, Hussain A, Samad SA. Speech enhance2ment by noisecancellation using neural network. [A ] TEN2CON 2000 [C ]. Proceedings, Kuala Lumpur, 20.[ 16 ] J iang Xiaop ing, Fu Hua, Yao Tianren. A single channelspeech enhancement method based on masking p ropertiesand minimum statistics[A ]. 2002 6 th International Confer2ence on Signal Processing[C ]. 2002. 4602463.[ 17 ] 裴文江,刘文波,于盛林. 基于分形理论的混沌信号与噪声分离方法[ J ]. 南京航空航天大学学报, 1997, 29(5). 4832487.[ 18 ] Virag N. Single channel speech enhancement based onmasking p roperties of human auditory system [ J ]. IEEETrans on Speech Audio Process, 1999, 7 (2) : 1262137.[ 19 ] Ghoreish M H, Sheikzadeh H Hybird. Speech enhance2ment system based on HMM and spectral subtraction [A ].IEEE International Conference on Acoustic, Speech andSignal Processing[C ]. 2000 (3) : 185521858.Speech RecognitionVictor Zue, Ron Cole, & Wayne WardMIT Laboratory for Computer Science, Cambridge, Massachusetts, USA Oregon Graduate Institute of Science & Technology, Portland, Oregon, USACarnegie Mellon University, Pittsburgh, Pennsylvania, USA1 Defining the ProblemSpeech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands & control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding, a subject covered in section.Speech recognition systems can be characterized by many parameters, some of the more important of which are shown in Figure. An isolated-word speech recognition system requires that the speaker pause briefly between words, whereas a continuous speech recognition system does not. Spontaneous, or extemporaneously generated, speech contains disfluencies, and is much more difficult to recognize than speech read from script. Some systems require speaker enrollment---a user must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other parameters depend on the specific task. Recognition is generally more difficult when vocabularies are large or have many similar-sounding words. When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words.The simplest language model can be specified as a finite-state network, where the permissible words following each word are given explicitly. More general language models approximating natural language are specified in terms of a context-sensitive grammar.One popular measure of the difficulty of the task, combining the vocabulary size and the 1 language model, is perplexity, loosely defined as the geometric mean of the number of words that can follow a word after the language model has been applied (see section for a discussion of language modeling in general and perplexity in particular). Finally, there are some external parameters that can affect speech recognition system performance, including the characteristics of the environmental noise and the type and the placement of the microphone.Speech recognition is a difficult problem, largely because of the many sources of variability associated with the signal. First, the acoustic realizations of phonemes, the smallest sound units of which words are composed, are highly dependent on the context in which they appear. These phonetic variabilities are exemplified by the acoustic differences of the phoneme,At word boundaries, contextual variations can be quite dramatic---making gas shortage sound like gash shortage in American English, and devo andare sound like devandare in Italian.Second, acoustic variabilities can result from changes in the environment as well as in the position and characteristics of the transducer. Third, within-speaker variabilities can result from changes in the speaker's physical and emotional state, speaking rate, or voice quality. Finally, differences in sociolinguistic background, dialect, and vocal tract size and shape can contribute to across-speaker variabilities.Figure shows the major components of a typical speech recognition system. The digitized speech signal is first transformed into a set of useful measurements or features at a fixed rate, 2 typically once every 10--20 msec (see sectionsand 11.3 for signal representation and digital signal processing, respectively). These measurements are then used to search for the most likely word candidate, making use of constraints imposed by the acoustic, lexical, and language models. Throughout this process, training data are used to determine the values of the model parameters.Speech recognition systems attempt to model the sources of variability described above in several ways. At the level of signal representation, researchers have developed representations that emphasize perceptually important speaker-independent features of the signal, and de-emphasize speaker-dependent characteristics. At the acoustic phonetic level, speaker variability is typically modeled using statistical techniques applied to large amounts of data. Speaker adaptation algorithms have also been developed that adapt speaker-independent acoustic models to those of the current speaker during system use, (see section). Effects of linguistic context at the acoustic phonetic level are typically handled by training separate models for phonemes in different contexts; this is called context dependent acoustic modeling.Word level variability can be handled by allowing alternate pronunciations of words in representations known as pronunciation networks. Common alternate pronunciations of words, as well as effects of dialect and accent are handled by allowing search algorithms to find alternate paths of phonemes through these networks. Statistical language models, based on estimates of the frequency of occurrence of word sequences, are often used to guide the search through the most probable sequence of words.The dominant recognition paradigm in the past fifteen years is known as hiddenMarkov models (HMM). An HMM is a doubly stochastic model, in which the generation of the underlying phoneme string and the frame-by-frame, surface acoustic realizations are both represented probabilistically as Markov processes, as discussed in sections,and 11.2. Neural networks have also been used to estimate the frame based scores; these scores are then integrated into HMM-based system architectures, in what has come to be known as hybrid systems, as described in section 11.5.An interesting feature of frame-based HMM systems is that speech segments are identified during the search process, rather than explicitly. An alternate approach is to first identify speech segments, then classify the segments and use the segment scores to recognize words. This approach has produced competitive recognition performance in several tasks.2 State of the ArtComments about the state-of-the-art need to be made in the context of specific applications which reflect the constraints on the task. Moreover, different technologies are sometimes appropriate for different tasks. For example, when the vocabulary is small, the entire word can be modeled as a single unit. Such an approach is not practical for large vocabularies, where word models must be built up from subword units.The past decade has witnessed significant progress in speech recognition technology. Word error rates continue to drop by a factor of 2 every two years. Substantial progress has been made in the basic technology, leading to the lowering of barriers to speaker independence, continuous speech, and large vocabularies. There are several factors that have contributed to this rapid progress. First, there is the coming of age of the HMM. HMM is powerful in that, with the availability of training data, the parameters of the model can be trained automatically to give optimal performance.Second, much effort has gone into the development of large speech corpora for system development, training, and testing. Some of these corpora are designed for acoustic phonetic research, while others are highly task specific. Nowadays, it is not uncommon to have tens of thousands of sentences available for system training and testing. These corpora permit researchers to quantify the acoustic cues important for phonetic contrasts and to determine parameters of the recognizers in a statistically meaningful way. While many of these corpora (e.g., TIMIT, RM, ATIS, and WSJ; see section 12.3) were originally collected under the sponsorship of the U.S. Defense Advanced Research Projects Agency (ARPA) to spur human language technology development among its contractors, they have nevertheless gained world-wide acceptance (e.g., in Canada, France, Germany, Japan, and the U.K.) as standards on which to evaluate speech recognition.Third, progress has been brought about by the establishment of standards for performance evaluation. Only a decade ago, researchers trained and tested their systems using locally collected data, and had not been very careful in delineating training and testing sets. As a result, it was very difficult to compare performance across systems, and a system's performance typically degraded when it was presentedwith previously unseen data. The recent availability of a large body of data in the public domain, coupled with the specification of evaluation standards, has resulted in uniform documentation of test results, thus contributing to greater reliability in monitoring progress (corpus development activities and evaluation methodologies are summarized in chapters 12 and 13 respectively).Finally, advances in computer technology have also indirectly influenced our progress. The availability of fast computers with inexpensive mass storage capabilities has enabled researchers to run many large scale experiments in a short amount of time. This means that the elapsed time between an idea and its implementation and evaluation is greatly reduced. In fact, speech recognition systems with reasonable performance can now run in real time using high-end workstations without additional hardware---a feat unimaginable only a few years ago.One of the most popular, and potentially most useful tasks with low perplexity (PP=11) is the recognition of digits. For American English, speaker-independent recognition of digit strings spoken continuously and restricted to telephone bandwidth can achieve an error rate of 0.3% when the string length is known.One of the best known moderate-perplexity tasks is the 1,000-word so-called Resource 5 Management (RM) task, in which inquiries can be made concerning various naval vessels in the Pacific ocean. The best speaker-independent performance on the RM task is less than 4%, using a word-pair language model that constrains the possible words following a given word (PP=60). More recently, researchers have begun to address the issue of recognizing spontaneously generated speech. For example, in the Air Travel Information Service (ATIS) domain, word error rates of less than 3% has been reported for a vocabulary of nearly 2,000 words and a bigram language model with a perplexity of around 15.High perplexity tasks with a vocabulary of thousands of words are intended primarily for the dictation application. After working on isolated-word, speaker-dependent systems for many years, the community has since 1992 moved towards very-large-vocabulary (20,000 words and more), high-perplexity (PP≈200), speaker-independent, continuous speech recognition. The best system in 1994 achieved an error rate of 7.2% on read sentences drawn from North America business news.With the steady improvements in speech recognition performance, systems are now being deployed within telephone and cellular networks in many countries. Within the next few years, speech recognition will be pervasive in telephone networks around the world. There are tremendous forces driving the development of the technology; in many countries, touch tone penetration is low, and voice is the only option for controlling automated services. In voice dialing, for example, users can dial 10--20 telephone numbers by voice (e.g., call home) after having enrolled their voices by saying the words associated with telephone numbers. AT&T, on the other hand, has installed a call routing system using speaker-independent word-spotting technology that can detect a few key phrases (e.g., person to person, calling card) in sentences such as: I want to charge it to my calling card.At present, several very large vocabulary dictation systems are available fordocument generation. These systems generally require speakers to pause between words. Their performance can be further enhanced if one can apply constraints of the specific domain such as dictating medical reports.Even though much progress is being made, machines are a long way from recognizing conversational speech. Word recognition rates on telephone conversations in the Switchboard corpus are around 50%. It will be many years before unlimited vocabulary, speaker-independent continuous dictation capability is realized.译文:语音识别舒维都,罗恩科尔,韦恩沃德麻省理工学院计算机科学实验室,剑桥,马萨诸塞州,美国俄勒冈科学与技术学院,波特兰,俄勒冈州,美国卡耐基梅隆大学,匹兹堡,宾夕法尼亚州,美国一定义问题语音识别是指音频信号的转换过程,被电话或麦克风的所捕获的一系列的消息。

相关主题