Home | About us | Editorial board | Search | Ahead of print | Current issue | Archives | Submit article | Instructions | Subscribe | Contacts | Advertise | Login 
  Official website of M M University Mullana      

Table of Contents
Year : 2011  |  Volume : 1  |  Issue : 2  |  Page : 83-87

Performances of Texas Instruments DSP and Xilinx FPGAs for Cooley-Tukey and Grigoryan FFT Algorithms

Department of Electrical Engineering, University of Texas at San Antonio, 6900 North Loop 1604 West, San Antonio, TX 78249-0669, USA

Date of Web Publication24-Oct-2011

Correspondence Address:
Narayanam Ranganadh
Department of Electrical Engineering, University of Texas at San Antonio, 6900 North Loop 1604 West, San Antonio, TX 78249-0669
Login to access the Email id

Source of Support: None, Conflict of Interest: None

DOI: 10.4103/0976-8580.86639

Rights and Permissions

Frequency analysis plays a vital role in the applications like cryptanalysis, steganalysis, system identification, controller tuning, speech recognition, noise filters, etc. Discrete Fourier transform (DFT) is a principal mathematical method for the frequency analysis. The way of splitting the DFT gives out various fast algorithms. In this paper, we present the implementation of two fast algorithms for the DFT for evaluating their performance. One of them is the popular radix-2 Cooley-Tukey fast Fourier transform (FFT) algorithm and the other one is the Grigoryan FFT based on the splitting by the paired transform. We evaluate the performance of these algorithms by implementing them on the TMS320C5416 DSP and also on the Virtex-II FPGAs. Finally, we show that the paired-transform-based algorithm of the FFT is faster than the radix-2 FFT; consequently, it is useful for higher sampling rates. We also discuss the performances of TMS DSP and Xilinx FPGAs and tradeoffs.

Keywords: Discrete Fourier transform, fast algorithms, fast Fourier transform, frequency analysis, paired transforms

How to cite this article:
Ranganadh N, Patel P, Grigoryan AM. Performances of Texas Instruments DSP and Xilinx FPGAs for Cooley-Tukey and Grigoryan FFT Algorithms. J Eng Technol 2011;1:83-7

How to cite this URL:
Ranganadh N, Patel P, Grigoryan AM. Performances of Texas Instruments DSP and Xilinx FPGAs for Cooley-Tukey and Grigoryan FFT Algorithms. J Eng Technol [serial online] 2011 [cited 2020 Sep 19];1:83-7. Available from: http://www.onlinejet.net/text.asp?2011/1/2/83/86639

   1. Introduction Top

In the past decade, fast orthogonal transforms have been widely used in areas of data compression, pattern recognition and image reconstruction, interpolation, linear filtering, and spectral analysis. The suitability of unitary transforms in each of the above-mentioned applications depends on the properties of their basis functions as well as on the existence of fast algorithms, including parallel ones. Since the introduction of the fast Fourier transform (FFT), Fourier analysis has become one of the most frequently used tools in signal/image processing and communication systems. The main problem when calculating the transform relates to the construction of the decomposition, namely, the transition to the short DFTs with minimal computational complexity. The computation of unitary transforms is a complicated and time-consuming process. Since the decomposition of the DFT is not unique, it is natural to ask how to manage splittings and how to obtain the fastest algorithm of the DFT. The difference between the lower bound of arithmetical operations and the complexity of fast transform algorithms shows that it is possible to obtain FFT algorithms of various speeds [1] . One approach is to design efficient manageable split algorithms. Indeed, many algorithms make different assumptions about the transform length. The signal/image processing related to engineering research becomes increasingly dependent on the development and implementation of the algorithms of orthogonal or nonorthogonal transforms and convolution operations in modern computer systems. The increasing importance of processing large vectors and parallel computing in many scientific and engineering applications require new ideas for designing super-efficient algorithms of the transforms and their implementations [2] .

In this paper, we present the implementation techniques and their results for two different fast DFT algorithms. The difference between the algorithm developments lies in the way the two algorithms use the splitting of the DFT. The two fast algorithms considered are radix-2 and paired transform [2] algorithms. The implementation of the algorithms is done both on the TMS320C5416 digital signal processor and on the Xilinx Viretx-II FPGAs. The performance of the two algorithms is compared in terms of their sampling rates and also in terms of their hardware resource utilization.

Section 2 presents the paired transform decomposition used in paired transform in the development of Grigoryan FFT. Section 3 presents the implementation techniques for the radix-2 and paired transform algorithms on FPGAs. Section 4 presents the results. Finally, with Section 5, we conclude the work and put forward some suggestions for further sampling rate improvements.

   2. Decomposition Algorithm of the Fast DFT Using the Paired Transform Top

In this algorithm, the decomposition of the DFT is done by using the paired transform [2] . Let {x(n)}, n = 0:(N−1) be an input signal, N>1. Then the DFT of the input sequence x(n) is

which is in matrix form

where X(k) and x(n) are column vectors, and matrix is a permutation of X.

which shows that the applying transform is decomposed into short transforms FNi , i = 1:k. Let S F be the domain of transform F, the set of sequences f over which F is defined. Let (D; σ) be a class of unitary transforms revealed by a partition σ. For any transform F ∈ (D; σ) the computation is performed by using paired transform in this particular algorithm. To denote this type of transform, we introduce "paired functions" [2] .

Let p, t ∈ period N, and let

Let L be a nontrivial factor of number N, and W L = e2π/L then the complex function

t = 0: (N/L-1), p ∈ ot the period 0: N - 1

is called the L-paired function [2] . Based on these paired functions, the complete system of paired functions can be constructed. The totality of the paired functions in the case of N = 2r is

Now considering the case of N = 2r N1 , where N1 is odd, r is ≥1 for the application of the paired transform:

(a) The totality of the partitions is

(b) The splitting of F N by the partition σ' is

(c) The matrix of the transform can be represented as

where is the diagonal matrix of the twiddle factors. The steps involved in finding the DFT using the paired transform are given below:

  1. Perform the L-paired transform g = χ' N(x) over the input x
  2. Compose r vectors by dividing the set of outputs so that the first L r−1 elements g 1 ,g 2 ,….,g Lr−1 compose the first vector X 1 ; using the next Lr−2 elements g Lr−1 + 1, …., g Lr−1+Lr−2 compose the second vector X 2 , and so on.
  3. Calculate the new vectors Y k , k = 1:(r−1) by multiplying element-wise the vectors X 1) by multiplying element-wise the vectors X k by the corresponding turned factors .
  4. Perform the L r−k -point DFTs over Yk , k = 1:r
  5. Make the permutation of outputs, if needed.

   3. Implementation Techniques Top

We have implemented various architectures for radix-2 and paired transform processors on Virtex-II FPGAs [3] . As there are embedded multipliers [3] and embedded block RAMs [3] available, we can use them without using the distributed logic, which economize some of the CLBs [3] . As most of the transforms are applied on complex data, the arithmetic unit always needs two data points at a time for each operand (real part and complex part); dualport RAMs are very useful in all these implementation techniques.

In the FFT process, the butterfly operation is the main unit on which the speed of the whole process of the FFT depends. So the faster the butterfly operation, the faster the FFT process. The adders and subtractors are implemented using the LUTs (distributed arithmetic). The inputs and outputs of all the arithmetic units can be registered or nonregistered.

Various possible implementations of multipliers one can consider are as follows.

3.1 Embedded Multiplier

  1. with nonregistered inputs and outputs,
  2. with registered inputs or outputs, and
  3. with registered inputs and outputs.

3.2 Distributed Multiplier

Distributed multipliers are implemented using the LUTs in the CLBs. These can also be implemented with the above three possible ways. Various considerations are made to implement the butterfly operation for its speed improvement and resource requirements. The results of these techniques are tabulated in [Table 1].
Table 1: The results of various butterfly implementations

Click here to view

By observing the results, we can say that the butterfly operation with pipelined multipliers (pipelined to maximum extent possible [4] ) distributed arithmetic operations, and registering all the inputs and outputs of the all arithmetic units provides the fastest butterfly operation. The various architectures proposed for implementing radix-2 and paired transform processors are single memory (pair) architecture, dual memory (pair) architecture, and multiple memory (pair) architectures. We applied the following two best butterfly techniques for the implementation of the processors on the Virtex-II FPGAs [3] :

  1. One with distributed multipliers, with fully pipelined stages (best in the case of performance)
  2. One with embedded multipliers and one level pipelining (best in the case of resource utilization)

Single memory (pair) architecture [Figure 1] is suitable for single snapshot applications, where samples are acquired and processed thereafter. The processing time is typically greater than the acquisition time. The main disadvantage in this architecture is that while doing the transform process we cannot load the next coming data. We have to wait until the current data is processed. So we proposed dual memory (pair) architecture for faster sampling rate applications [Figure 2]. In this architecture, there are three main processes for the transformation of the sampled data: Loading the sampled data into the memories, processing the loaded data, and reading out the processed data. As there are two pairs of dual port memories available, one pair can be used for loading the incoming sampled data, while at the same time the other pair can be used for processing the previously loaded sampled data. For further sampling rate improvements, we proposed multiple memory (pair) architecture [Figure 3]. This is the best of all architectures in the case of very high sampling rate applications, but in the case of hardware utilization, it uses a lot more resources than any other architecture. In this model, there is a memory set, one arithmetic unit for each iteration. The advantage of this model over the previous models is that we do not need to wait until the end of all iterations (i.e., whole FFT process), to take the next set of samples to get the FFT process to be started again. We just need to wait until the end of the first iteration and then load the memory with the next set of samples and start the process again. After the first iteration, the processed data is transferred to the next set of RAMs, so the previous set of RAMs can be loaded with the next coming new data samples. This leads to the increased sampling rate.
Figure 1: Single memory (pair) architecturegrave

Click here to view
Figure 2: Dual memory (pair) architecture

Click here to view
Figure 3: Multiple memory (pair) architecture

Click here to view

Coming to the implementation of the paired-transform-based DFT algorithm, there is no complete butterfly operation, as that in the case of radix-2 algorithm. According to the mathematical description given in Section 2, the arithmetic unit is divided into two parts, addition part and multiplication part. This makes the main difference between the two algorithms, which causes the process of the DFT complete earlier than the radix-2 algorithm. The addition part of the algorithm for the eight-point transform is shown in [Figure 4]. The architectures are implemented for the 8-point and 64-point transforms. The radix-2 FFT algorithm is efficient in the case of resource utilization and the paired transform algorithm is very efficient in the case of higher sampling rate applications.
Figure 4: The addition part of the eight-point paired-transform-based DFT

Click here to view

   4. The Implementation Results Top

4.1 Results Obtained on the TMS320C5416 Digital Signal Processor

The software used for implemention on the DSP is the Texas Instruments Code Composer Studio. We used C programming language for implementation. [Table 2] and [Table 3] show the implementation results and the comparison between the two algorithms on the DSP processor.
Table 2: Performance comparison of the two algorithms on the DSP processor

Click here to view
Table 3: The sampling rate of both the algorithms (starting from N = 8 to N = 1024)

Click here to view

[Table 2] shows that the paired algorithm of DFT is much faster than the radix-2 algorithm. Going to higher transform lengths, the paired algorithm gives the higher percentage improvement over the radix-2 algorithm. [Table 3] shows the time required and the sampling rates that the two algorithms can be operated at.

4.2 Results Obtained on Virtex-II FPGAs

The hardware modeling of the algorithms is done by using Xilinx's system generator plug-in software tool running under SIMULINK environment provided under the Mathworks's MATLAB software. The functionality of the model is verified using the SIMULINK

[Table 4] shows the implementation results of the two algorithms on the Virtex-II FPGAs. From [Table 2], [Table 3] and [Table 4], we can see that the paired transform is always faster than the radix-2 algorithm. Thus paired-transform-based algorithm can be used for higher sampling rate applications. In military applications, while doing the process, only some of the DFT coefficients are needed at a time. For this type of applications, the paired transform can be used as it generates some of the coefficients earlier, and also it is very fast.
Table 4: The sampling rates and the resource utilization summaries for both the algorithms, implemented on the Virtex-II FPGAs

Click here to view

   5. Conclusion Top

In this paper, we have shown that both on DSPs and also on FPGAs the paired-transform-based algorithm is faster and can be used at higher sampling rates than the radix-2 FFT at an expence of high resource utilization.

In all implementations on FPGAs, the number of bits used for the data is 16. So, all the multipliers here are used as 16-bit multipliers. The multipliers used were 18-bit multipliers. For instance, if there are some applications using only 8-bit data, then one can use the 40 dedicated multipliers as 80 multipliers, as two multiplications can be implemented by using a single embedded multiplier as long as the sum of the two product bits is less than 36 bits.

In the implementations on the DSP if the MAC engines are used explicitly, then there may be a possibility of a better comparison between the algorithms. Also, one can see some more speed improvement in the DFT processes.

   References Top

1.J. W. Cooley, and J. W. Tukey, "An algorithm for the machine calculation of complex Fourier Series", Math. Comput., Vol. 19, pp. 297-301, 1965.  Back to cited text no. 1
2.A. Grigoryan, and S. S. Agaian, "Split Manageable Efficient Algorithm for Fourier and Hadamard transforms", Signal Processing, IEEE Transactions on, Vol. 48, no. 1, pp. 172-183, 2000.  Back to cited text no. 2
3.Virtex-II platform FPGAs. Available from: http://direct.xilinx.com/bvdocs/publications/ds031-2.pdf. [Last accessed on 2011 Jul 6].  Back to cited text no. 3
4.CC Studio getting started guide, TMS320C54x.  Back to cited text no. 4

   Authors Top

Mr. Narayanam Ranganadh did his Master's in Electrical and Computer Engineering at the University of Texas at San Antonio. He has undergone guidance for his Master's thesis under Dr. Parimal Patel and Dr. Artyom Grigoryan. He has undergone Master's level training and research at University of Ottawa, Canada, on the "brain stem speech evoked potentials" under the guidance of Dr. Hilmi Dajani.


  [Figure 1], [Figure 2], [Figure 3], [Figure 4]

  [Table 1], [Table 2], [Table 3], [Table 4]

This article has been cited by
1 Tensor transform-based quaternion fourier transform algorithm
Artyom M. Grigoryan,Sos S. Agaian
Information Sciences. 2015; 320: 62
[Pubmed] | [DOI]
Narayanam Ranganadh, Nageswara rao Dhanavath
3 A TMS DSP processor based case study of Grigoryan FFT performance over Cooley - Tukey FFT ( TMS320C6748, TMS320C5515)
Narayanam Ranganadh , Rekha Andal Vangala
IOSRJEN. 2013; 3(1): 55-60
[VIEW] | [PDF]
. ;
[Pubmed] | [PDF]


    Similar in PUBMED
   Search Pubmed for
   Search in Google Scholar for
 Related articles
    Access Statistics
    Email Alert *
    Add to My List *
* Registration required (free)  

  In this article
   1. Introduction
    2. Decomposition...
    3. Implementatio...
    4. The Implement...
   5. Conclusion
    Article Figures
    Article Tables

 Article Access Statistics
    PDF Downloaded580    
    Comments [Add]    
    Cited by others 4    

Recommend this journal