HIGH SPEED, LOW COMPLEXITY, FOLDED, POLYMORPHIC WAVELET ARCHITECTURE USING RECONFIGURABLE HARDWARE

R.Lavanya and Saranya B

vlsi signal processing research group(ECE Dept), Amrita school of engineering, Ettimadai, Coimbatore
bsaranya195@gmail.com

Abstract

The main aim of this paper is to design and implement a high speed, low complexity and polymorphic architecture for reconfigurable folded wavelet filters. 5/3 wavelet results are incorporated into the 9/7 data path which reduces the number of adders compared to other solutions and also allows on the fly switching between the filters. The proposed work is to improve the speed of this reconfigurable architecture. This is accomplished by scheduling. A weight based scheduling algorithm has been used in this paper. This is an analysis method to improve inter task communication as well as data dependencies among tasks which will reduce the overall communication overhead and processing time.

Keywords: Polymorphic architecture, Canonical signed digit, Lagrange half band filter, reconfigurable hardware, folding

1. Introduction

The Discrete Wavelet Transform is used for image compression in many signal processing applications. DWT can be implemented using filter bank structures. Polymorphic DWT architecture enables the dynamic control of allocated resources to get high performance output. Polymorphism is the ability of the architecture to adapt its hardware usage to meet the desired dynamic requirements. The resultant hardware has high throughput and high image reconstruction quality, and low power consumption. This is mainly used in real time image and multimedia processing application such as video surveillance and telemedicine. Most implementations approximate the irrational coefficients to rational values, leading to image reconstruction quality tradeoffs. To overcome this Lagrange Half Band Filter [2] which can be used for the computation of rational coefficients. This method of calculation of rational filter coefficients is more accurate.

Biorthogonal filters are used for implementing polymorphic DWT. In biorthogonal filter bank all the filters are finite-impulse response (FIR) filters. Biorthogonal wavelets have symmetric filter coefficients and they have orthogonality property. Symmetric filters have linear phase property which prevents the phase distortion at the filter output. The main difference between traditional filter bank design and wavelet filter design is the
regularity of the wavelet filter design. Regularity can be achieved by preserving zeros at the aliasing frequency \( z = -1 \). The implementation of CDF-9/7 over fixed point arithmetic leads to quantization error and requires large amounts of hardware resources for perfect image reconstruction. Multiplier free implementation results in reduced complexity without reducing the accuracy. The two most common standard filters for DWT are Le Gall’s 5/3 filter and the Daubechies 9/7 filter. Using these two filters poly-DWT filter achieves high image compression and power efficient implementation. Reconfiguring 5/3 and 9/7 filters reduces the hardware cost. A typical application where this reconfigurable architecture finds use is the free JPEG 2000 software codec (Jasper). The intra frame coding is done using 9/7 filter and differential frame coding is done using 5/3 filter. In Distributed arithmetic [7] DWT is computed using adders butterflies. In cascaded implementation[8] uses only adders and shifters for calculating DWT. Lifting scheme [4] require long critical path. Flipping requires multipliers for DWT computation. Many DWT approaches uses multiplier free implementation [7],[8],[6] but all these approaches need more hardware than the proposed work. The improved lifting and reconfigured 5/3 and 9/7 wavelet filters [5], overcome the problem of traditional lifting scheme but it involves more number of multipliers which means more hardware than multiplier free implementation. The direct filter bank approach is used in [3] and [1]. By implementing multipliers in terms of adders and shifters, the complexity of the architecture can be reduced. Multipliers are replaced with adders and shifters by representing coefficients in canonical sign digit form. With polymorphic wavelet architecture further reduction in hardware is achieved by expressing 9/7 filter in terms of 5/3 filter, which reduce the number of adders in the architecture.5/3 filter implementation is common to both 5/3 and 9/7 implementation on fly switch is used to select these filters.

![Figure 1. Filter bank structure of DWT](image)

The 9/7 filter has irrational filter coefficients and requires more hardware. So quantizing the filter coefficients results in minimum hardware but leads to poor performance. In this work approximation of filter coefficients are done using a Lagrange half band filter(LHFB). This is an efficient method that preserves biorthogonality and perfect reconstruction. LHBF convert irrational values to fractional value which can be easily expressed in hardware.

2. The polymorphic architecture

The main aim of this work is to achieve a high speed, accurate, low complexity, reconfigurable hardware. To obtain a low complexity hardware multiplier free implementation is used. Reconfigurable hardware reduces number of adders in the circuit. It is expressing 9/7 filter in terms of 5/3. The number of adders is further reduced by folding the reconfigurable architecture. Overall block diagram in figure 4 is explained below.

2.1. Lagrange Half Band Filter

Lagrange filter is used for generating rational coefficients of 9/7 filter. The rational values
for 9/7 filter are computed by analyzing filter bank structure given in figure 1. The analysis and synthesis low pass filters are denoted by F0 and H0 respectively. The analysis and synthesis high-pass filters are denoted by F1 and H1 respectively and are obtained by quadrature mirroring the low-pass filters. The initial step for calculation of rationalized 9/7 coefficients is the analysis of filter bank structure [2].

\[ F_2(z) = zH_0(-z) \]  \hspace{1cm} (1)

\[ H_4(z) = z^{-1} F_0(-z) \]  \hspace{1cm} (2)

\[ D(Z) = L_{H_{11}}(Z) = Z^\alpha (1 + \frac{z^{-1}}{2})^{2\beta} R_{H_{11}}(Z) \]  \hspace{1cm} (3)

\[ B + 3C + \alpha A + 3B\alpha + 3C\alpha = 0 \]  \hspace{1cm} (4a)

\[ 3A + 3B + C + 3\alpha + 3A\alpha + B\alpha = 0 \]  \hspace{1cm} (4b)

\[ 3 + A + \alpha = 0 \]  \hspace{1cm} (4c)

**Table 1. Low pass coefficients**

<table>
<thead>
<tr>
<th>α</th>
<th>( H_0 )</th>
</tr>
</thead>
<tbody>
<tr>
<td>-3/2</td>
<td>[5,5,-4,29,46,29,-4,-5,5]/96</td>
</tr>
<tr>
<td>-5/3</td>
<td>[9,-6,-24,86,190,86,-24,-6,9]/320</td>
</tr>
<tr>
<td>-8/5</td>
<td>[5,-4,-9,40,80,40,-9,-4,5]/144</td>
</tr>
<tr>
<td>-2</td>
<td>[1,0,-8,16,46,16,-8,0,1]/64</td>
</tr>
</tbody>
</table>

**Table 2. High pass coefficients**

<table>
<thead>
<tr>
<th>α</th>
<th>( H_0 )</th>
</tr>
</thead>
<tbody>
<tr>
<td>-3/2</td>
<td>[5,5,-4,29,46,29,-4,-5,5]/96</td>
</tr>
<tr>
<td>-5/3</td>
<td>[9,-6,-24,86,190,86,-24,-6,9]/320</td>
</tr>
<tr>
<td>-8/5</td>
<td>[5,-4,-9,40,80,40,-9,-4,5]/144</td>
</tr>
<tr>
<td>-2</td>
<td>[1,0,-8,16,46,16,-8,0,1]/64</td>
</tr>
</tbody>
</table>
The variables A, B, C in the simultaneous equations are solved to get the rational 9/7 coefficients. The parameter \( \alpha \) (free factor) can be varied between -1.5 and -2. The parameters A, B and C depend on value of \( \alpha \). Setting \( \alpha = -1.6848 \) will give back the original “9/7” filter pair. When \( \alpha = -2 \), the coefficients of the “9/7” can be expressed as sum of power of two (SPT) terms and the filter can be implemented without any multipliers.

The main technique used is to allow some degree of freedom in choosing the coefficients by freeing some zeros of the LHBF. Table 1 contain low pass filter coefficients and table 2 contain high pass filter coefficients of 9/7 filter.

2.2. Canonical Signed Digit

The rational filter coefficients are expressed as Sum or difference of Powers of Two(SPT) for a multiplier less implementation. This multiplier free implementation reduces the complexity of the hardware. The total number of SPT used must be kept minimum in order to achieve a low complexity hardware without reducing the compression performance. There is a subset of SPT representation called canonical signed digit (CSD) form. In CSD no two adjacent bits are nonzero. It is an optimal representation in which nonzero digits are required to represent a given number. The main advantage of canonical representation is that it uses two third of as many nonzero digits as a binary representation, thus it reduce the number of adders required in hardware.

2.3. Reconfigurable hardware

When 9/7 filter coefficient are expressed in terms of 5/3 filter coefficients number of adders are reduced. The hardware complexity is again reduced.

<table>
<thead>
<tr>
<th>Table 3. 5/3 filter coefficients</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
</tr>
<tr>
<td>2</td>
</tr>
<tr>
<td>1</td>
</tr>
<tr>
<td>0</td>
</tr>
</tbody>
</table>

By comparing table3 with table 2 and 1 9/7 filter coefficients can be expressed in terms of 5/3 filter coefficients, which is given by the following equations.

\[
\text{Low}_{9/7} = \text{low}_{5/3}(i) - w(0)/32 + w(4)/64 \tag{5}
\]

\[
\text{High}_{9/7} = \text{high}_{5/3}(i)/2 - w(1) + w(3)/32 \tag{6}
\]

The 5/3 filter is implemented such that this portion (5/3) is common to 5/3 and 9/7 filter. Switches are used to select between 5/3 and 9/7 filter. In this implementation multipliers are removed by expressing coefficients in SPT form. Another select line is used to select high
pass and low pass filter. The reconfigurable architecture is shown in figure 4. The input signal is an image and it is given serially to the architecture.

3. Proposed work

3.1. Scheduling

Scheduling is used to improve the speed of the circuit. In this proposed work weight based scheduling algorithm is used. The output of the 5/3 filter is given to the input of the 9/7 filter. So a long delay occur at the output. To improve the speed of the circuit, schedule the computational units (adders and shifters) in the circuit. It should be scheduled based on the availability of inputs and weight (delay) of computational units. Scheduled circuit requires less time to compute the output of DWT than the unscheduled circuit. Many well known scheduling algorithms like weight based scheduling, resource scheduling, priority scheduling etc. Among these weight based scheduling is chosen for the work due to many reasons. It is simple, when compared to other scheduling algorithms like priority scheduling algorithm which requires assigning priority to the nodes of the circuit. Weight based scheduling is sufficient for simple circuits such that reconfigurable architecture considered in this work. Thus it eliminates the need the need for prioritizing the node, which is critical only for complicated circuits. Resource based scheduling is oriented towards optimal use of resources and hence will increase the delay. Taking into account the delay constraint and simplicity, weight based scheduling algorithm is optimal and then it has been implemented in this work.

The delay of the various computational elements in the architecture is computed and is assigned as weight of the nodes in the DFG. Then these nodes are sorted based on their weights, and are visited in an increasing order. Once all the nodes in the current level have been visited, the algorithm targets the next level and so on till all nodes are sorted. The edge within the DFG represent the data flow. In this way, the architecture is scheduled and reduce the overall latency.

Algorithm for weight based scheduling

Input: There are n task{1,2,...n} each task has a weight wi and W is the upper bound, a data flow graph is drawn with n nodes.
Output: Set of scheduled nodes S.
Initialize the graph.
Let O denote the set of unscheduled nodes.
While O is empty
do
Find the lightest node in the top level of DFG, let it be x. then
If wx < W then
Remove x from the O set and include it in the S set.
Repeat ,until O is empty
3. Simulation result

The image input is given to the architecture as 8 bit binary values. For 1D DWT first row filtering is done. Then the output of 1D DWT is given as the input for the next stage. The same architecture is used for next stage filtering. In the 2D DWT the image input is fed column wise to the hardware. Verilog simulation of the proposed architecture is carried out in Model sim SE 6.5. The Model sim results are verified with MATLAB output to ensure the functionality of the circuit. The timing information is obtained from Quartus II. Using that delay information C code is developed for scheduling. The first step is to convert the behavioural description into the data flow graph (DFG). The DFG represents the reconfigurable architecture and the arithmetic operations are represented by the nodes of the DFG. The next step is to schedule the operations depending on the data availability. The scheduled graph requires 2 clock cycles less than unscheduled graph. Overall speed of the circuit is increased by scheduling.

![Figure 4. Reconfigurable hardware](image)

5. Experimental results

The proposed architecture requires less number of adders and no multipliers than previous works given in table 4.
Table 4. Comparison of hardware complexity of proposed architecture with existing works

<table>
<thead>
<tr>
<th>Proposed work</th>
<th>[5]</th>
<th>[7]</th>
<th>[8]</th>
<th>[6]</th>
<th>[1]</th>
</tr>
</thead>
<tbody>
<tr>
<td>Adders</td>
<td>12</td>
<td>4</td>
<td>43</td>
<td>32</td>
<td>21</td>
</tr>
<tr>
<td>Multipliers</td>
<td>0</td>
<td>3</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>

Table 5. Comparison of speed of scheduled vs. unscheduled hardware

<table>
<thead>
<tr>
<th></th>
<th>Low pass 5/3</th>
<th>High pass 5/3</th>
<th>High pass 9/7</th>
<th>Low pass 9/7</th>
</tr>
</thead>
<tbody>
<tr>
<td>Scheduled</td>
<td>10μs</td>
<td>13μs</td>
<td>13μs</td>
<td>16 μs</td>
</tr>
<tr>
<td>Unscheduled</td>
<td>15μs</td>
<td>15μs</td>
<td>28μs</td>
<td>28μs</td>
</tr>
</tbody>
</table>

From the table 5 shows the comparison between scheduled and unscheduled architecture. The speed of 9/7 filter has been improved by scheduling.

6. Conclusions

The proposed architecture is a very efficient for fast computation of DWT. The speed of the architecture is improved by scheduling. This high speed circuit has only few adders and shifters thus reducing the complexity to a great extent. The future work includes dynamic power reduction technique for reducing the power consumption of switched modules.

7. References


