Abstract

Wave Pipelining has been used in order to reduce the area without degrading its performance and also to improve the speed. In pipelining the throughput is achieved with the compensation of area and the critical path will be same as the original architecture. While considering of larger circuits the area places an important role where the cost is a major issue. Hence in order to reduce the area the latches are removed which produces the technique of wave pipelining. The concept of Wave Pipelining optimize the area and speed when compared to the pipelining architectures. Thus the Wave Pipelining is applied in the FFT architecture for achieving less area. The FFT pipelined architecture is based on splitting the architecture by stages. Each stage is evaluated with effective structure. It’s noted that by using the internal clock which increases both speed and reduces clock loads the number of gates used in the pipelined structure is reduced in using wave pipelining. The Wave Pipelining technique have to be implemented in a 32-bit FFT Pipelined architecture which is suitable for a digital signal processing applications for image processing and for conversion of frequency domain to time domain. And to compare the area and speed of the WP architecture and pipelined architecture.

Keywords: DIF, DIT, Multiplier, Pipelining, 32-bit FFT Processor, Wave Pipelining (WP)

1. Introduction

Wave Pipelining is a technique which is adopted in digital signal processing and many areas for its high throughput and high speed\(^1\). Maximum-rate pipelining is a technique which provides high performance by implementing the pipelined technique without using intermediate registers\(^2\). Wave-pipelining provides faster clocks by reducing the latency. The delay in Wave Pipelining is computed with maximum to minimum delays in the combinational circuit\(^3\). These differences in delay provides a narrow gap known as computing cones, thus increasing the clock rate. Wave Pipelining blocks makes use of internal clocks which are much faster than the existing pipelined system. It’s a significant method to reduce clock loads, area, power and timing without changing the functionality of the circuit. Wave pipelined technique is applied in an architecture for easy implementation of larger circuits\(^4\).

Another important architecture in this brief is a 32-bit Pipelined FFT processor\(^5\). The pipelined structure is divided into two major stages, they are complex multiplication and next is for complex addition. The 32-bit pipelined FFT processor is performed in 37 clock cycles. The effective way to evaluate DFT by reducing computation is FFT\(^6\). The algorithm which is used for communication and Digital signal processing extensively is FFT. In many DSP projects the basic algorithm used is FFT algorithm. The FFT is a distinctive estimation where the memory access intensively and the high parallelism is desired. Hence pipelining technique is used in FFT. The pipelining increases the speed of the circuit at the cost of area and manufacturing cost because of its increased number of registers which is used for saving the intermediate value from each stage. Since the pipelined circuits make use of external clocks, it increases the clock loads\(^7\). The evaluation of speed and area of each are compared and the performance metrics are taken.

\(^*\)Author for correspondence
In this brief, Section 1 describes about initial concepts, Section 2 discusses about proposed, Section 3 result and analysis and Section 4 conclusion.

2. Existing Technique

Considering a Wave Pipelined Counter and pipelined FFT processor.

2.1 WP Counter

It is constructed by using the Look up tables, hence the RTL structure of the WP Counter does not contain any intermediate register in it, which reduces the area of the circuit in a large scale. K-Map is implemented using the LUT in SOP form and the Boolean expression is obtained by using K-Map reduction. The clock loads are reduced because of the use of internal clocks.

\[ O = (!I_0 \cdot I_1); \]
\[ O = ((I_0 \cdot !I_1 \cdot I_2) + (I_0 \cdot I_1 \cdot !I_2)); \]
\[ O = ((I_0 \cdot I_1 \cdot !I_2) + (I_0 \cdot !I_1 \cdot I_2 \cdot I_3) + (I_0 \cdot I_1 \cdot !I_3)); \]

Figure 1 shows the schematic for look up. The output is feedback to one of its input for avoiding data lose in the design and for increasing the speed.

Figures 2 and 3 shows the look up which has three inputs like the clock, feedback and one includes the output of previous LUT. Figure 4 shows the technology schematic for WP Counter. These include an input buffer and output buffer which elaborates the data transmission in a circuit.

2.2 Pipelined FFT Processor

The pipelined architecture used is a 32 bit pipelined FFT architecture which includes multiplier section adder section and a ROM for saving the twiddle factors. Since the 32 bit architecture is parallelized the speed will increase. The twiddle factors are same for any architecture hence by saving the twiddle factors in rom and making use of it when necessary will reduces the computation time and area in a large scale.

The architecture is split as follows: The process element for each stage radix-2 radix-4 and radix-8. These are represented using the block diagram (Figure 5) given...
are used for buffering the pipelined values in it. In order to increase the computation fast and parallel the ROM are used which contains a set of value in it. Here the twiddle factors are saved in look up table. While performing the radix-2 the twiddle factors used is 1. And for PS2 the twiddle factor is either –j or 1. The PS1 computation is the complex value which has -0.707+j0.707, -j, 1. These are separated in to imaginary and real values. The selection of the values are based on the select line. We can perform DIT and DIF in the same architecture which reduces the area in the design.

The WP FFT makes use of the 8-bit multiplier and 8-bit FFT. It makes use of the internal clocks which reduces the clock loads.

The Figure 8 shows the RTL schematic where the pipelined FFT blocks are synthesized. These denote the pipelined FFT without any latches in it. Thus it reduces the area and increase the efficiency of the processor without compromising the performance of the circuit. The clock used is internal and the rise and fall time of the internal clock generated is 9.506 ns. The maximum setup for a (0), a (1), sel, x (0), x (1) are 5.935, 6.057, 6.130, 5.945, 6.357 in ns.

3. Proposed Design

In this section the pipelined FFT architecture is modified by applying the wavepipelining concept in it. This design speed is increased because of using internal clocks and also due to the absence of intermediate registers which are used for buffering the pipelined values in it. In order to increase the computation fast and parallel the ROM are used which contains a set of value in it. Here the twiddle factors are saved in look up table. While performing the radix-2 the twiddle factors used is 1. And for PS2 the twiddle factor is either –j or 1. The PS1 computation is the complex value which has -0.707+j0.707, -j, 1. These are separated in to imaginary and real values. The selection of the values are based on the select line. We can perform DIT and DIF in the same architecture which reduces the area in the design.

The WP FFT makes use of the 8-bit multiplier and 8-bit FFT. It makes use of the internal clocks which reduces the clock loads.

The Figure 8 shows the RTL schematic where the pipelined FFT blocks are synthesized. These denote the pipelined FFT without any latches in it. Thus it reduces the area and increase the efficiency of the processor without compromising the performance of the circuit. The clock used is internal and the rise and fall time of the internal clock generated is 9.506 ns. The maximum setup for a (0), a (1), sel, x (0), x (1) are 5.935, 6.057, 6.130, 5.945, 6.357 in ns.
calculates the value based on the twiddle factor stored in the storage unit (RAM). The inverse of DIF is DIT and the output is denoted in $w_0$.

The Figure 10 shows the simulation output after 5.248 ns

5. Conclusion

A wave pipelined structure have been applied in a pipelined 32-bit FFT and an effective reduction in area and increased in speed is achieved. The maximum time for computation is 5.248 ns and the area consumed is 13 out of 16640 slices (1%) and LUTs used is 22 out of 33280 (1%). The flip-flops used are 3% of its total latches and the GCLK used is 4%.

6. References