Speed and area analysis on hierarchy multiplier

This paper proposes designs of hierarchy multiplier by utilising different designs on 4:2 and 7:3 compressor and multiple compressors. The hierarchy multipliers is optimised in the term of speed or area of hierarchy multiplier by redesigning 4:2 compressor units and introducing a combination of 4:2 compressor and 7:3 compressor units in a Vedic multiplier block. All designs are simulated using Altera Quartus II software. The aim of this paper is to improve the performance in speed by moderately increasing the area without considering the power consumption. The proposed design is 4.5% to 8.3% faster and consumes -0.5% to 5.8% less area.


Introduction
Multiplication is a basic arithmetic operation in processor, computation and digital signal processing, image processing application and so on. High processing speed is needed for fast multiplication, and complexity of VLSI design increases every 18 months to satisfy Moore's law in VLSI advanced technology [2][3][4].
Partial production generation, partial product reduction and partial production addition are three basic stages of multiplication. First, the multiplicand and multiplier are multiplied to generate partial products. Then, 4:2 or 7:3 compressor units are used to reduce the delay of partial production [5][6][7]. Finally, the final addition is done by Binary-to-Excess-one converter (BEC). To optimise speed, Carry Save Adder tress is implemented by utilising 4:2 and 7:3 compressor. The detail will be discussed in the next section.
Hierarchy multiplier make use of fine-grained parallelism to reduce propagation delay of computer intensive operation. Four Vedic base multiplier in hierarchical are allowed to perform multiplication in parallel to speed up the hierarchical multiplier. The compressor blocks are implemented in the four Vedic blocks [1].

Hierarchy multiplier
A 16-bit hierarchy multiplier make use of four 8-bit base multiplier units using Vedic multiplication algorithm. Vedic multiplication is an ancient Indian Mathematics which is based on Urdhva Tiryakbhyam algorithm. The Vedic multiplier is time, area and power efficient due to the gate delay and space increase slowly when increasing the number of Vedic multiplier bit compared to other multipliers [8][9][10]. All the base multipliers consist of AND-gates, compressors, half adders and full adders. Furthermore, they also contain CSA adder, Carry Select Adder (CslA) and BEC with multiplexer (MUX) for final addition of partial product [1]. However this paper is only focusing on how is the implementation of different design of 4:2 compressors and 7:3 compressors, as well as the use of both compressors in an architecture in base multipliers to improve the performance of hierarchy multiplier.

Base multiplier Architecture
A base multiplier is built from AND gates, compressors and adders. AND gates are used to generate partial product followed by compressors and adders, which are to reduce partial product. The base multiplier is illustrated in Figure 2.

A 4:2 Compressor Architecture
A standard 4:2 compressor is implemented by cascading two full adders unit as shown in Figure 3. A 4:2 compressor has four inputs (X1, X2, X3 and X4), carry input (Cin) and two outputs (Sum and Carry which along with Cout) [11][12]. The design in [1], design 1 and design 2 can be divided into two full adder blocks as shown in

Design in journal [1]
The 4:2 compressor in [1] utilises cascading of two full adder unit which both has two XOR gates and a MUX respectively [1]. The 4:2 compressor can be divided into two groups of full adders as shown in Figure 4.

Design 1
Cascading two Mux-Add units is similar as cascading two full adders in order to implement a 4:2 compressor. The mux-add in [13] suffers in number of transistors. The modified Mux-Add reduces transistor count. A XOR or XNOR consists of eight CMOS transistors respectively, both AND or OR use six CMOS transistors, MUX has four CMOS gates and NOT has two transistors. The total number of transistors in a full adder block is summarised in Table 1. The larger the design, the less the number of transistors will save cost.  Equations of design 1 is as below:

A New Base Multiplier Architecture
A new base multiplier architecture is almost similar with Figure 2 but it uses a combination of compressors which are 4:2 compressor and 7:3 compressor for optimization. The architecture is shown in Figure 7.

A 7:3 Compressor Architecture
A 7:3 compressor consists of four Full Adder (FA) blocks. The carry and the summation of inputs (X1-X6) are cascaded to different next stage full adder respectively. The FA which carries summation output from the first two FA will add up with X7 to produce sum output. The summation of carries from the first two full adders will produce output Carry and Cout. However the carry ( +1 ) is propagate to next two stage. The 7:3 compressor allows parallel addition when the sum and carry propagate to different next stage of FA. The 7:3 compressor is used in partial product reduction, so seven partial products which are generated from AND gate can be added up at the same time. A standard 7:3 compressor is shown in Figure 8. This section highlights the existence of 7:3 compressor for summation of seven partial products in parallel at the centre of compressor tree, which lead to faster multiplication process. This is due to hidden critical path of carry signal that is reduced by optimising the connection of compressor topology [14][15][16].

Design 3
Design 3 is implemented using the design in journal [1] (Figure 4) with only a full adder unit design and duplicated by four times. The block diagram of design 3 is shown in Figure 9. Each full adder unit contains two XOR gates and a MUX.

Design 4
Design 4 utilises only a full adder block in design 2 and connect the other three full adder units as shown in Figure  10. Each full adder consists of a XOR gate, a NOT gate and two MUX.

Simulation result and Analysis
All simulation is simulated in EP2C70F896C6 device from Cyclone II family in Altera Quartus II software by describing using Verilog Hardware Description Language. The functional verification of the 16-bit hierarchy multiplier is done by simulator tools in the software. The performance is analysed after each design is successfully debugged.

Functional Verification
The function of multiplier is verified before proceeding to performance analysis. The inputs are set from the lower to the higher range to ensure that the multiplier perform correct output and propagation of carry to every components is accurate as shown in Figure 11. The boundary condition is considered in this design since the use of four base multipliers in performing multiplication. The consideration of boundary condition is to verify that the input is propagating properly between base multiplier blocks. A set of inputs is used to verify the functionality of boundary issue as shown in Figure 12.

Performance Analysis
The longest total propagation delay (tpd) and total logic elements or area is used to analyse the performance of hierarchy multiplier. The performance is summarized in Table 3. Since the aim of this paper is to optimise speed and area of the hierarchy multiplier, the percentage (%) improved is calculated using the formula below.  Table 3, both design 1 and design 2 are improved in terms of speed and area compared to journal [1] design. Design 1 reduce in 4.5% delays and consume less 5.8% area as well as design 2 improve 5.1% in terms of speed and 2.7% less area compare to journal [1]. Nevertheless, design 3 sacrifices 0.5% area and design 4 has no impact on area but both designs perform more significant effect in optimising the speed which are 7.7% and 8.3% respectively.
Design 1 is suggested for designers who suffer in limitation of size without increase in latency but designer who have plenty of space is more suitable to use design 4. Design 2 is more appropriate to have overall performance of hierarchy multiplier.

Conclusion
An efficient hierarchy multiplier is developed by redesigning 4:2 compressor and utilities multiple compressors which are 4:2 compressor and 7:3 compressor in partial production reduction. The use of design 1 to modify the 4:2 compressor architecture consumes less 5.8% area than previous design in journal [1]. The proposed designs improve the speed of hierarchy multiplier from 4.5% to 8.3%. The proposed multiple compressor has disadvantage in area consumption compare to design 1 and design 2, but it is faster compare to only using 4:2 compressor. Design 2 performs the best in terms of speed and area, with reduction of 5.1% and 2.7% respectively.
We appreciate and thanks Universiti Malaysia Perlis for providing financial support to produce this paper.