Title: Hard Core DSP – What it is and how to make it happenAuthor: Lynn PattersonTitle: VP Product DevelopmentDate: 6/11/98OVERVIEWIn recent years Digital Signal Processing technology has been applied to a variety of types of processing applications. Generally these can be classified as non-real time, soft real-time and hard-core real-time applications.Non real-time DSP refers to applications where the huge FLOP capacity of the DSP is put to work on historical data. The data was collected and archived for processing at a later time. The data is stored on some type of mass storage media and job processed in a compute center. Some examples are seismic evaluation, image enhancement and intelligent signal extraction applications.Soft real-time DSP refers to applications where data arrives to the system from a “sensor” as it is sampled, an algorithm is applied to that data and results are posted. This process repeats continuously. In a “soft” system, the processing node may not be able to fully process all data without some tuning of the system. This on-the-fly adjustment can be implemented in several ways.1. The data source can be throttled – that is there is some handshaking mechanismfrom the processing back to the source that triggers the source to slow the rate atwhich it delivers data to the system for processing.2. The system employs the elasticity in the system buffers to hold the additional datasamples over the steady state rate. Essentially, one or several blocks of data arequeued up while one block requires longer for processing than the time line alotted .The ability to allow for oversized buffering space is typically difficult and typically theworst case scenario can not be accounted for.3. Data is dropped. If the processing node can not accept the samples or block ofsamples, it is dropped and is not retrievable.4. Additional processing nodes are applied to the data stream. This requires the systemto have a real-time dynamic architecture.5. The algorithm applied to the data adjusts to require a reduced processing load underpeak conditions. Depending on the nature of the adjustment, this may or may notdifferentiate an application as soft or hard real-time.In all of these cases the “performance” of the system may vary over time but the system does not fail. Consider the case ff the system throttles itself, the overall performance drops since the system does not run at full speed. For the case of elastic buffers, dropped data may ultimately result if the processing can not “catch up”. Therefore, for cases 2 and 3, if data is dropped, the algorithm has a reduced set of data to work on and it should be expected that the quality of the result is decreased. The fourth case is rarely possible in real-time systems. However, if the system did accommodate this, it is reasonable to assume that the additional processing is taken from another system task and the overall performance of the system is reduced due to that. The final case also implies a decrease in the quality of the result since a reduced algorithm was implemented.Consider an example of an image inspection system, if throttling is implemented, the full algorithm is applied but the rate of inspection for the system is decreased and hence the performance of the product. If data is dropped, less frames are averaged and the quality of the image is reduced. I will assume additional processors can not be employed as this is anembedded system that was built to cost guidelines. A reduced algorithm would result in less accurate analysis of the image or processing to be skipped on part of an image.Hard-core DSP refers to applications where there is an absolute guarantee that the processing will keep up with the real-time data flow, even under worst case conditions. That includes the peak data arrival rates and most calculation intensive algorithm conditions. The algorithm may have some loading adjustment built in for quick calculation. This is acceptable if it is part of the system design. Hard-core DSP frequently refers to applications where data flows are fixed by the system requirements and the processing must accommodate them, as opposed to allowing the processing power to define the system performance and adjusting the data. Data is processed in blocks. If the calculations on a previous block are not finished and results posted within the specified time window, input data on the next incoming block is generally lost. In the hard-core world, dropped data usually puts at risk the validity of the output of the system. That is it is deemed the system fails if it can not process the entire stream of input data. Lastly the system typically requires a tight coupling of the data to the processing. That is, a very low latency between sampling and processing is required. It is interesting to note that many soft real-time applications must be treated as hard-core designs if the system performance parameters are set at absolute limits.The rest of this paper outlines several issues that must be considered when architecting a hard-core DSP system based on the SHARC processor. How these areas are affected by the specific system approach where the real-time IO arrives at the processor via the SHARC serial ports is then considered.Hard-core Design IssuesWhen implementing an application with real-time data flow, three main issues must always be addressed. First is the bandwidth on the processors data buses, second is the latency associated with distributing data throughout the system, and third is the DSP core loading associated with moving the data. These issues will be addressed relative to the SHARC processor when the IO stream into the processor is via the SHARC serial ports.Processor data busThe SHARC processor is designed to be clustered in groups up to 6. When clustered, all of these processors share a common parallel off chip data bus. This data bus is used for inter-processor communications and accesses to off chip memory. In addition, often the real time I/O data is read over this bus. The bandwidth on this external cluster bus is therefore a precious commodity when implementing an application.With the SHARC processor, there is a second data bus that must be evaluated when considering bus loading. That is, the I/O data bus. This is a parallel data bus internal to the SHARC that carries all the data that is moved via DMAs in the SHARC. Serial data that is sent/received via DMAs is carried over this bus and is thus worth evaluating.Data Distribution LatencyEach SHARC processor has two full duplex serial ports. Each can be programmed to operate as either standard synchronous serial ports or in the TDM mode. In the TDM mode, data is transmitted in frames with a specific number of time slots. Each slot in every frame contains the data to or from one specific I/O channel. This is repeated every frame. The 1688s presents and recieves its data as a TDM stream. The SHARC can be programmed to receive any slots on the incoming TDM stream and to output any slots on the outgoing TDM stream. All other slots are ignored. For example, one SHARC processor can be programmed to input slots 1 and 2 from the TDM stream into its internal memory and another SHARC can be programmed to input channels 3 and 4 into the internal memory of that SHARC. Inside the SHARCs, only the data from the specified channels is packed into an input array in consecutive memory addresses. Therefore, the SHARC application is only presented with the data from the channels it isinterested in. Another noteworthy point is, the channels that are input/output can be changed at any time. Therefore, the application can change on the fly the input or output channels on which it processes. There are no restrictions on the number of SHARCs that can input the same digitized data from the 1688s. That is any number of SHARCs on the serial chain, from zero to all, can receive the data for any channel. However, for the output channel, only one SHARC should output data for any channel to avoid contention.Processor LoadingTo operate the SHARC processor serial ports in the TDM mode, the serial ports are programmed with several key facts such as frame size and requested input/output channels. Also, the user can set up chained DMA transfers that continuously input/output data to/from the SHARC processor via the serial port. Typically, a double buffer scheme is used for the input/output data. An interrupt can trigger the core after each buffer is received. These DMAs are set up once and there is no additional code overhead required to keep them functioning.Evaluation of a System Solution for SHARC and serial port data systemsA powerful system solution which is very applicable for systems with many channels and lower sample rates can be created by using the SHARC serial ports. Ixthos’s products provide systems that integrate various combinations of analog input and output channels and SHARCS in an integrated system solution. Examples of solutions that can be provided in a single VME slot are up to 32 analog input (16 bit 200kHz) input channels and 16 SHARCS, or 16 analog input and 16 analog output (16bit 48kHz) channels and 16 SHARCS. The following discussion considers the second case (16 input, 16 output and 16 SHARCS) relative to the above architecture issues. The module that provides the IO is referred to as the IXI1688s and the processor base card is referred to as the IXZ16.IXZ16/IXI1688s - Processor Data Bus loading evaluationWhen using the 1688s on any of the IXZ16 card, there is no loading on the external cluster bus for any of the SHARCs since the data is delivered to the SHARCs over the serial bus. This means the full cluster bandwith is available for inter-processor communications and off chip memory accesses.The 1688s does add some minimal loading to the internal I/O data bus. This loading is dependent on the number of channels the specific SHARC is processing. Even if a specific SHARC processes half of the 16 input and 16 output channels, this loading is less than 3% of the I/O data bus’s capacity. (I/O Data Bus capacity is 160MB/sec. 1688s loading is 48k samples/sec/channel * 16channels* 4 bytes/sample for data + 48k controls/sec/2channel * 8 channels * 4bytes/control for control = 3.9 MB/sec. Net loading = 3.9MB/sec / 160MB/sec = 2.5% Note: on the output serial stream one control word for every 2 channels must be transmitted).IXZ16/1688s – Data Distribution LatencyIn the IXZ family, the customer can configure several clusters or all of the SHARCs on a board to be ganged on one serial chain. There are also methods to to extend this serial chain to other IXZ basecards. Therefore, the user can configure the system to have a variable number of processors all inputting/outputting data off of the same serial chain. This serial chain is received at all processors at essentially the same time. (Only transmission delays skew this; there is no buffering of the data)There is no fifo that holds data that is output from the 1688s. Each sample is transmitted in the appropriate TDM slot as it is formed by the sigma delta converter.The power of this data distribution method is that all processors receive the data essentially simultaneously, no processor needs to be burdened with distributing data to other processors and multiple processors can receive the same input channel automatically. All these facts lead to a minimal latency to distribute data to any number of processors in a system.IXZ/161688s – Processor LoadingThe SHARC processors on the IXZ base card send and receive all data over the serial ports. Therefore, other than an initial setup of the serial ports and launching of the DMAs there is NO loading on the DSP core to move the I/O data. The data simply appears in the internal memory of the SHARC and is output from the internal memory of the SHARC. This is a powerful feature! IXZ/1688 system configuration notesWith all the above configurations it is possible to extend the processors that have access to the input digital TDM stream. That is, not only the SHARCs on the basecard populated with the 1688s module, but other basecards can have direct access to the TDM stream of digitized analog input data values. That is the system scales to additional processing nodes as required.This product offering is available as a commercial level product and in an 8 SHARC processor configuration for rugged military applications.ConclusionsWhen designing a hard-core real-time DSP application many issues other than counting the FLOPs of the system must be considered. Specifically, how is data going to move in the system, what impact does this data movement have on the valuable system resources, and is the latency associated with this data distribution acceptable for the system requirements. The SHARC is a powerful data moving processor and by using its full capabilities the best system solution is created. Ixthos has wide variety of product offerings that integrate IO and the DSP processing for creating lean hard-core systems.。