Low Area Pipelined Circuits by the Replacement of Registers with Delay Elements*

Bakhtiar Affendi ROSDI† and Atsushi TAKAHASHI†, Members

SUMMARY  A new algorithm is proposed to reduce the area of a pipelined circuit using a combination of multi-clock cycle paths, clock scheduling and delay balancing. The algorithm analyzes the circuit and replaces intermediate registers with delay elements under the condition that the circuit works correctly at given target clock-period range with the smaller area. Experiments with pipelined multipliers verify that the proposed algorithm can reduce the area of a pipelined circuit without degrading performance.

key words: pipelined circuits, multi-clock cycle paths, clock scheduling, delay balancing

1. Introduction

Circuit pipelining is one technique that has been used in order to shrink the clock period. Pipelining is a method in which a circuit is divided into a small number of stages and intermediate registers are inserted between stages to store the intermediate data. With this method, extra circuit area is required to situate the additional intermediate registers and the size of the clock tree is also increased.

Recently, to overcome this problem, several studies have been carried out on wave pipelining [2], which is a method of speeding up the circuit without the insertion of intermediate registers. However, wave pipelining requires tighter timing constraints. In wave pipelining, there may exist a number of ‘waves’ of data in a circuit at any given time. Therefore, to avoid data collisions, delay balancing is required, which increases the circuit area.

In [3], they proposed an algorithm to reduce the number of intermediate registers of a pipelined circuit by using a combination of multi-clock cycle paths and clock scheduling. A multi-clock cycle path is a path from register to register where data transmission takes more than one clock period. Note that in wave pipelining, all paths are multi-clock cycle paths. Introducing a multi-clock cycle path into a pipelined circuit allows some intermediate registers to be removed. Their algorithm removes an intermediate register without delay balancing. Clock scheduling is a technique in which the clock skew of a register is intentionally introduced to improve circuit performance by relaxing the timing constraints. Using clock scheduling, more intermediate registers can be removed, without the need for delay balancing. Note that the reduction on the number of intermediate registers will reduce the complexity of clock tree synthesis.

The algorithm proposed in [3] removes some intermediate registers, while the others remain so that the circuit works correctly at given target clock-period range. However, there are some possibilities that a remaining intermediate register can be replaced with some delay elements whose total area is smaller than the register. In this paper, we propose a new algorithm that replaces intermediate registers with delay elements whose total area is smaller than the register. In our proposed algorithm, all intermediate registers of the pipelined circuit are initially removed as in [3]. Then using the algorithm in [4], the obtained circuit is checked whether it works correctly at given target clock-period range or not. If the circuit does not work correctly at given target clock-period range, some timing constraints are violated. The violated timing constraints can be eliminated by intermediate register insertion or delay element insertion. The algorithm eliminates the violated timing constraints iteratively by inserting registers or buffers with small area cost. As the result, our proposed algorithm replaces an intermediate register with delay elements whose total area is smaller than the register.

Our algorithm is an enhancement of the algorithm proposed in [3], since an intermediate register is removed if no buffer is inserted to the location where the removed intermediate register is located. Delay balancing is implemented as the replacement of intermediate register with some delay elements.

Experiments with multipliers verify that, given a particular target clock-period range, the proposed algorithm can obtain a circuit with smaller area compared with the circuit obtained by the algorithm in [3]. The number of intermediate registers in the circuit obtained by proposed algorithm is smaller compared with the number of intermediate registers.
in the circuit obtained by the algorithm in [3].

2. Preliminaries

We consider a circuit with a single clock consisting of registers linked by combinatorial circuits. The clock timing \( s(v) \) of register \( v \) is the difference in clock signal arrival time between \( v \) and an arbitrarily chosen (perhaps hypothetical) reference register. The set of clock timings is called a clock-schedule.

We make the basic assumption that a circuit works correctly if the following two types of constraint are satisfied for each register pair with signal propagation [5],[6]:

\[
\begin{align*}
\text{Setup Const.:} & \quad s(u) - s(v) \leq \beta_{u,v} T - d_{\text{max}}(u,v) \\
\text{Hold Const.:} & \quad s(v) - s(u) \leq d_{\text{min}}(u,v) - \alpha_{u,v} T
\end{align*}
\]

where \( T \) is the clock period, \( d_{\text{max}}(u,v) \) (\( d_{\text{min}}(u,v) \)) is the maximum (minimum) propagation delay from register \( u \) to register \( v \) along a combinatorial circuit, and \( \beta_{u,v} \) and \( \alpha_{u,v} \) are given integer constants (\( \beta_{u,v} > \alpha_{u,v} \geq 0 \)). Note that for a pair of registers with a single-clock cycle path, \( \beta_{u,v} \) and \( \alpha_{u,v} \) are given by 1 and 0, respectively. This formulation is sufficiently general to deal with multi-clock cycle paths and multi-clocks that have different clock periods.

If \( \alpha_{u,v} \) is 0 for every pair, the feasible clock period has no upper bound, i.e. if the clock period \( T \) is feasible then any \( T' \) (where \( T' \geq T \)) is feasible. However, the feasible clock period is bounded above if \( \alpha_{u,v} \) is not 0 for some pair \((u,v)\).

From the above constraints, when the clock schedule and the signal propagation delay are known, the minimum and maximum feasible clock period, \( T_{\text{min}} \) and \( T_{\text{max}} \), can be determined from the setup and hold constraints, respectively.

If the clock timing is not fixed, then \( T_{\text{min}} \) and \( T_{\text{max}} \) depend on each other. \( T_{\text{min}} \) has to be minimized under the constraint that the circuit works correctly throughout a certain clock-period range, in order for the circuit to tolerate clock jitter and delay variation. The above constraints become:

\[
\begin{align*}
\text{Setup Const.:} & \quad s(u) - s(v) \leq \beta_{u,v} T_{\text{min}} - d_{\text{max}}(u,v) \\
\text{Hold Const.:} & \quad s(v) - s(u) \leq d_{\text{min}}(u,v) - \alpha_{u,v} T_{\text{min}}
\end{align*}
\]

where \( \delta \) is the clock-period range, i.e. \( \delta = T_{\text{max}} - T_{\text{min}} \). Therefore if \( \delta \) is given, then, by using the above constraints, clock timings can be determined so that the circuit works correctly for a clock period between \( T_{\text{min}} \) and \( T_{\text{min}} + \delta \). In the following, our target is to minimize \( T_{\text{min}} \) under the constraint that the circuit is feasible throughout the given clock-period range \( \delta \).

These constraints are represented by the constraint graph \( G(V,E) \) of the circuit, which is defined as follows: a vertex \( v \in V \) corresponds to a register; a directed edge \( (u,v) \in E \) corresponds to either type of constraint; an edge \( (u,v) \) corresponding to the setup (hold) constraint is called a \( Z \)-edge (\( D \)-edge), and the weight \( w(u,v) \) of \( (u,v) \) is \( \beta_{u,v} T - d_{\text{max}}(v,u)(d_{\text{min}}(u,v) - \alpha_{u,v} \delta - \alpha_{u,v} T) \). A \( Z \)-edge (\( D \)-edge) corresponding to a single-clock cycle path, i.e. \( \beta_{u,v} = 1 \) and \( \alpha_{u,v} = 0 \), is called single \( Z \)-edge (single \( D \)-edge). While, a \( Z \)-edge (\( D \)-edge) corresponding to a multi-clock cycle path is called multi \( Z \)-edge (multi \( D \)-edge).

The constraint graph \( G \) corresponding to clock period \( T \) and clock-period range \( \delta \) is denoted by \( G_{\delta}(t) \). We may denote \( G_{\delta}(t) \) as \( G(t) \) if no confusions occur.

Let the weight of a directed cycle in a constraint graph be the sum of weights of edges on the directed cycle. We refer to a directed cycle whose weight is negative as negative cycle. It is known that a circuit can works correctly if and only if there is no negative cycle in the constraint graph of the circuit [7].

3. Delay and Register Insertion Effects on the Constraint Graph

There is a negative cycle in the constraint graph of a circuit if the circuit does not work correctly for any clock schedule. To eliminate the negative cycle, a certain amount of weight needs to be added to edges on the negative cycle by an operation. However, it is important to make sure that the operation will not create other negative cycles. This section discusses about the changes of the weight and topology of the constraint graph when delay element and intermediate register are removed and inserted.

In our proposed algorithm, first, all intermediate registers are removed. When an intermediate register is removed, the weight of \( D \)-edge and \( Z \)-edge of the constraint graph are changed, as well as its topology. As for example, when an intermediate register \( v \) is removed, single-clock cycle paths \((u,v)\) and \((v,w)\) whose total minimum (maximum) delay is \( \sigma' \) (\( \sigma \)) become two-clock cycle path \((u,w)\) whose minimum (maximum) delay is \( \sigma - \omega \) (\( \sigma - \omega' \)), where \( \omega \) (\( \omega' \)) are the minimum and maximum delay of the intermediate register \( v \), respectively. In the constraint graph of the circuit, the single \( D \)-edges \((u,v)\) and \((v,w)\) whose total weight is \( \sigma \) are removed and a multi \( D \)-edge \((u,w)\) whose weight is \( \sigma - \omega - T \) is inserted. Similarly, the single \( Z \)-edges \((u,v)\) and \((v,w)\) whose total weight is \( 2T - \sigma' \) are removed and a multi \( Z \)-edge \((u,w)\) whose weight is \( 2T - (\sigma' - \omega') \) is inserted. The weight of the multi \( D \)-edge is smaller than the corresponding original single \( D \)-edges. While the weight of the multi \( Z \)-edge is larger than the corresponding original single \( Z \)-edges.

When the weight and topology of the constraint graph are changed, there are some possibilities that the constraint graph contains a negative cycle thus makes the circuit infeasible. In order to make the circuit feasible, a negative cycle should be eliminated by increasing the total weight of the cycle.

Total weight of a cycle can be increased by increasing the weight of \( D \)-edge or \( Z \)-edge. The weight of \( D \)-edge can be increased by increasing the minimum delay between registers or by recovering back an intermediate register to the original location. While, the weight of \( Z \)-edge can be increased by reducing the maximum delay between registers.

In this paper, we choose to increase the weight of \( D \)-
edge to increase the total weight of a cycle because it is more easy to do it.

As for example, let consider a multi D-edge \((u, w)\) whose weight is \(\sigma - \omega - T\) corresponding to a multi-clock cycle path \((u, v, w)\). When a delay element whose minimum delay is \(\lambda\) is inserted to the multi-clock cycle path \((u, v, w)\), the weight of the multi D-edge \((u, w)\) increases by \(\lambda\) from \(\sigma - \omega - T\) to \(\sigma - \omega - T + \lambda\). While, when an intermediate register \(v\) whose minimum delay is \(\omega\) is recovered back to the multi-clock cycle path \((u, v, w)\), the total weight of the single D-edges \((u, v)\) and \((v, w)\) is \(\omega + T\) larger than the weight of the corresponding multi D-edge \((u, w)\).

A negative cycle becomes non-negative if a certain amount of weight is added. For a cycle \(C\) in constraint graph, let weight demand \(w_d(C)\) be the weight needed to be added in order to make the total weight \(w(C)\) of \(C\) becomes 0, that is \(w_d(C) = 0\) if \(w(C) \geq 0\), and \(w_d(C) = -w(C)\) if \(w(C) < 0\). As for example, if \(w(C) = -4\), then \(w_d(C) = 4\).

The minimum delay of a multi-clock cycle path can be increased by inserting some delay elements to the path. However, inserting some delay elements may increases the maximum delay of the path thus reduces the weight of corresponding multi Z-edge. Especially, if a negative cycle consists of one multi D-edge \((u, w)\) and one multi Z-edge \((w, u)\) corresponding to the multi-clock cycle path \((u, v, w)\), the cycle remains negative even if \(w_d(C)\) is added to point \(v\). The weight of the cycle remains negative since the reduction of the weight of multi Z-edge \((w, u)\) is larger than or equal to the increase of the weight of multi D-edge \((u, w)\).

Note that, the adding of the weight to multi D-edge may reduce the weight of multi Z-edge thus reduces the weight of other cycles and makes the other cycles become negative, that we do not consider here. In our experiments, it is sufficient to make sure a negative cycle does not consist of one multi D-edge \((u, w)\) and one multi Z-edge \((w, u)\) corresponding to the multi-clock cycle path \((u, v, w)\).

4. Reduction on the Area of Pipelined Circuits

In this paper we consider a problem on how to reduce the area of a pipelined circuit, subject to the minimum feasible clock period is lower than or equal to the original circuit’s clock period \(T_{comp}\) and the obtained pipelined circuit works correctly at given target clock-period range.

4.1 Reduction on the Number of Intermediate Registers

In [3], they have proposed an algorithm that reduces the area of a pipelined circuit by reducing the number of intermediate registers.

In their proposed algorithm, all intermediate registers of the pipelined circuit are initially removed. Then the obtained circuit is checked whether it works correctly at \(T_{comp}\) by the algorithm shown in [4]. When an intermediate register is removed, the obtained circuit may not work correctly at \(T_{comp}\). If the circuit does not work correctly at \(T_{comp}\), a negative cycle is found in the corresponding constraint graph and it contains a multi D-edge. In [3], the found negative cycle is eliminated by repeatedly recovering back an intermediate register to the corresponding multi D-edge until the circuit works correctly at \(T_{comp}\).

4.2 Replacement of Registers with Delay Elements

We propose a new algorithm that reduces the area of a pipelined circuit by the replacement of an intermediate register with delay elements whose area is smaller than the register.

In the proposed algorithm, all intermediate registers of the pipelined circuit are initially removed. Then the obtained circuit is checked whether it works correctly at \(T_{comp}\) by the algorithm shown in [4]. When an intermediate register is removed, the obtained circuit may not work correctly at \(T_{comp}\). If the circuit does not work correctly at \(T_{comp}\), a negative cycle is found in the constraint graph and it contains a multi D-edge. In order to eliminate the found negative cycle, it is necessary to increase the weight of the found negative cycle. The weight of the found negative cycle can be increased by increasing the weight of multi D-edge. The weight of multi D-edge can be increased either by recovering back the intermediate register or inserting some delay elements such as buffers. In [3], the found negative cycle is eliminated by recovering back an intermediate register only. Our algorithm repeatedly recovers back an intermediate register or inserts buffer until the circuit works correctly at given target clock-period range. For each iteration, the algorithm chooses the operation that has small area cost.

In our algorithm, the replacement of intermediate register with buffers is implemented by inserting the buffers to the location where the removed intermediate register is located. We choose only one type of buffer. The limit of the number of buffers that can be inserted is set at a certain number, so that the area of the total inserted buffers is smaller than the area of a register. Let \(l\) be the limit of the number of buffers that can be inserted to each location. If the area of a register is \(m\) times larger than the area of a buffer, then \(l = m - 1\). When a negative cycle \(C\) is found, if \(C\) consists of one multi D-edge and one multi Z-edge corresponding to the same multi-clock cycle path, an intermediate register is inserted. Otherwise, based on the minimum delay of the buffer and \(w_d(C)\), the number of buffers needed to be inserted is computed. Let \(n\) be the number of buffers needed to be inserted. If \(n > l\), an intermediate register is inserted. If \(n \leq l\), \(n\) buffers are inserted.

If there is more than one multi D-edge in the negative cycle, our algorithm chooses a multi D-edge corresponding to a multi-clock cycle path in which the number of allowed clock cycles to complete the data transmission is large since its weight tends to be small and the probability that it is contained in many negative cycles is high. If the chosen multi-D-edge corresponds to a path with three clock cycles or more, then more than one register was removed from the corresponding path. In such cases, the corresponding path has more than one candidate location to recover back a reg-
Inputs: Constraint graph $G^0$ of a pipelined circuit with intermediate registers, the minimum clock period $T_{comp}$ of the original circuit at zero-clock skew framework and clock-period range $\delta$.

Outputs: Constraint graph $G^m$ of the obtained circuit, clock timing $s(u)$ and minimum clock period $T_{min}(G^m)$.

Step 0: Remove all of the intermediate registers. Let $G_0$ be the constraint graph of the obtained circuit.

Step 1: Check whether there are any negative cycles in the constraint graph $G_0(T_{comp})$ or not by the algorithm shown in [4]. If a negative cycle $C$ is found, then based on the area cost, insert an intermediate register or buffers to the multi-clock cycle path which corresponds to a multi D-edge contained in $C$, and update $G_0$. Repeat this step until there are no negative cycles in the constraint graph $G_0(T_{comp})$.

Step 2: Let $G^m$ be the constraint graph of the obtained circuit and output $G^m$, $T_{min}(G^m)$ and the clock timing for all registers.

![Fig. 1](image1)

Fig. 1 An algorithm to reduce the area of pipelined circuits by the replacement of registers with delay elements.

ister or to insert buffers. If a register is recovered back to the middle location among candidates, then timing constraints will be relaxed much since the number of multi-D-edges corresponding to a multi-clock cycle path in which the number of allowed clock cycles to complete the data transmission is large and the number of multi-D-edges are expected to be reduced much. Therefore, our algorithm recovers back registers or inserts buffers to the middle location among candidates. That is, if the path has $n$ locations then $\lceil \frac{n}{2} \rceil$-th location is selected.

Our algorithm is heuristic. A different circuit might be obtained depending on the found negative cycle. Furthermore, if there are more than one multi-D-edge with the same clock cycle path in the negative cycle, our algorithm chooses one of them randomly. Thus, a different circuit might be obtained depending on the chosen multi-D-edge.

The details of our proposed algorithm is shown in Fig. 1. In Fig. 1, $G^m$ is the constraint graph corresponding to the pipelined circuit with all intermediate registers, while $G^m$ is the constraint graph corresponding to the circuit obtained by the algorithm. Note that, in Step 1 of our proposed algorithm, if there is no more multi-D-edge, the algorithm is stopped and outputs the input constraint graph.

4.3 Example

To explain the behavior of the algorithm, we apply the algorithm to the pipelined circuit shown in Fig. 2. In this example, our target clock-period range $\delta$ is 3 and the timing of each register is scheduled. Parameters are set as follows: setup and hold time for registers are 0; the minimum and maximum delay of the intermediate registers are 4 and 8, respectively; the minimum and maximum delay of the buffers are 1 and 2, respectively; the size of an intermediate register is 4 times larger than the size of a buffer, that is, the limit number of buffers that can be inserted is 3. For the original circuit with zero clock-skew, the minimum feasible clock period $T_{comp}$ is 12. The circuit after removing the intermediate registers $v_1$, $v_2$, $v_3$ is shown in Fig. 3. $G_0^m$ is the constraint graph of the obtained circuit.

![Fig. 2](image2)

Fig. 2 Pipelined circuit with intermediate registers and the corresponding constraint graph $G^m$, $T_{comp}$ = 12.

![Fig. 3](image3)

Fig. 3 Pipelined circuit after removing all intermediate registers and the corresponding constraint graph $G_0^m$. Found negative cycle $C^0 = (u_1, w_1, u_1)$ in $G_0^m(12)$.

![Fig. 4](image4)

Fig. 4 Pipelined circuit after inserting the intermediate register $v_1$ and the corresponding constraint graph $G_1^m$. Found negative cycle $C^1 = (u_2, w_2, v_1, u_1)$ in $G_1^m(12)$.

![Fig. 5](image5)

Fig. 5 Pipelined circuit after inserting the intermediate register $v_1$ and 2 buffers at $v_2$, and the corresponding constraint graph $G_2^m$. $T_{min}(G_2^m) = 12$.

Negative cycle $C^0 = (u_1, w_1, u_1)$ where $w(t(C^0)) = -1$ is found in the constraint graph $G_0^m(12)$. Since, the multi D-edge $(u_1, w_1)$ and the multi Z-edge $(u_1, u_1)$ correspond to the multi-clock cycle path $(u_1, v_1, u_1)$, an intermediate register is inserted to point $v_1$. The circuit after inserting the intermediate register to point $v_1$ is shown in Fig. 4. $G_1^m$ is the constraint graph of the obtained circuit. Negative cycle $C^1 = (u_2, w_2, v_1, u_2)$ where $w(t(C^1)) = -2$ is found in the constraint graph $G_1^m(12)$. Since $w(t(C^1)) = 2$, the number of buffers that has to be inserted is 2, which is smaller than the limit number of the buffers that can be inserted. Therefore, 2 buffers are inserted to point $v_2$ between multi-clock cycle path $(u_2, v_2, u_2)$ which corresponds to multi D-edge $(u_2, v_2)$ in $C^1$. The circuit after inserting the buffers is shown in Fig. 5. $G_2^m$ is the constraint graph of the obtained circuit. There are no negative cycles in the constraint graph.
G^2(12), so the algorithm stops and outputs the circuit and clock timings as shown in Fig. 5. Note that in Fig. 5, the clock timings are computed based on the constraint graph G^2_0. The output circuit works correctly between 12 to 15, while the original circuit works correctly between 12 to ∞.

5. Experiments

The proposed algorithm was written in C++ and implemented on a Pentium 4 (CPU 3 GHz, memory 513764 kb). Since there are no benchmark examples of pipelined circuits, four simple examples, briefly described below, were constructed for our experiments.

- 2-stages 6bit_mul: A 2-stages multiplier that multiplies two 6-bit numbers. The first stage uses a carry-save adder with Wallace tree structure [8] and the second stage uses a ripple-carry adder.
- 2-stages 16bit_mul: A 2-stages multiplier that multiplies two 16-bit numbers. The first stage uses a carry-save adder with Wallace tree structure [8] and the second stage uses a carry-look-ahead adder.
- 2-stages 32bit_mul: A 2-stages multiplier that multiplies two 32-bit numbers. The first stage uses a carry-save adder with Wallace tree structure [8] and the second stage uses a brent-kung adder [9].
- 3-stages 32bit_mul: A 3-stages multiplier that multiplies two 32-bit numbers. The first and second stages use a carry-save adder with Wallace tree structure [8] and the third stage uses a brent-kung adder [9].

The statistics of the circuits are shown in Table 1. The ROHM 0.35 µm process library was used for these experiments. In the library, the area of a register is four times larger than the area of a buffer. Therefore, in our experiments the limit number of the buffer that can be inserted at a location where the removed intermediate register was located is 3. The timing of each I/O pin was scheduled as well as the timing for each register.

Table 1: Statistics of multiplier.

<table>
<thead>
<tr>
<th>Circuit delay [ps]</th>
<th>Total</th>
<th>1st stage</th>
<th>2nd stage</th>
<th>3rd stage</th>
</tr>
</thead>
<tbody>
<tr>
<td>circuit</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2-stages 6bit_mul</td>
<td>42</td>
<td>884</td>
<td>3691</td>
<td>-</td>
</tr>
<tr>
<td>2-stages 16bit_mul</td>
<td>120</td>
<td>757</td>
<td>5075</td>
<td>-</td>
</tr>
<tr>
<td>2-stages 32bit_mul</td>
<td>245</td>
<td>1543</td>
<td>8533</td>
<td>-</td>
</tr>
<tr>
<td>3-stages 32bit_mul</td>
<td>433</td>
<td>1543</td>
<td>6806</td>
<td>408</td>
</tr>
</tbody>
</table>

The experiments were conducted on a few target of clock periods with clock period range 400 [ps]. The graph shows that there is trade off between the minimum feasible clock period and the total area of the buffers and intermediate registers that had been inserted. Note that the area of a FF is 4 times larger than the area of a buffer. “Buff. + FF (%)” is the percentage of the total area of the buffers and intermediate registers compared with the total area of intermediate registers in the original circuit. “Time[s]” is the computation time of the respective algorithm.

From the results shown in Table 2, 19 out of 24 types of experimental conditions, the area of the circuits obtained by our proposed algorithm is reduced compared with the area of the circuits obtained by register reduction algorithm in [3]. It shows the effectiveness of delay balancing on reducing the area of pipelined circuits.

The relation between the minimum feasible clock period and the total area of the buffers and intermediate registers of the 2-stages 16 bit multiplier is shown in Fig. 6. In the graph, the label “Proposed Algorithm” indicates results using the proposed algorithm for insertion of the intermediate registers and buffers, while “Register Reduction Algorithm” label indicates results when the algorithm in [3] is applied. The experiments were conducted on a few target of clock periods with clock period range 400 [ps].

The graph shows that there is trade off between the minimum feasible clock period and the size of the circuit obtained by both algorithms. However, with almost the same minimum feasible clock period, our proposed algorithm obtains a circuit with smaller area.

6. Conclusion

It has been shown that the area of the circuit obtained by our proposed algorithm is smaller than the area of the circuit obtained by the algorithm shown in [3] in most cases. The size of the clock tree will be reduced when the number of intermediate registers is reduced.

For the future work, a clock tree synthesis that can realize our target clock schedule based on the clustering based algorithm proposed in [10] is urgent.

The proposed algorithm inserts delay elements only at
the location where the removed intermediate register is located. We believe that if delay elements can be inserted at any location, the circuit area can be further reduced. This is also a topic for future investigation.

Acknowledgements

This work is supported by VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with Synopsys, Inc., Cadence Design Systems, Inc., and Rohm Corporation.

References


d  

Table 2 Experimental results.

<table>
<thead>
<tr>
<th>Circuit</th>
<th>2-stages</th>
<th>16bit</th>
<th>32bit</th>
<th>3-stages</th>
</tr>
</thead>
<tbody>
<tr>
<td>Buffer</td>
<td>FF</td>
<td>FF</td>
<td>FF</td>
<td>FF</td>
</tr>
<tr>
<td>2-stages</td>
<td>0</td>
<td>18</td>
<td>72</td>
<td>100</td>
</tr>
<tr>
<td>16bit</td>
<td>0</td>
<td>34</td>
<td>9</td>
<td>13</td>
</tr>
<tr>
<td>32bit</td>
<td>0</td>
<td>44</td>
<td>19</td>
<td>20</td>
</tr>
<tr>
<td>3-stages</td>
<td>0</td>
<td>188</td>
<td>1220</td>
<td>100</td>
</tr>
</tbody>
</table>

The proposed Algorithm and Register Reduction Algorithm in [3] are compared in Table 2.


Bakhtiar Affendi Rosdi received his B.E., and M.E. degrees in electrical and electronic engineering from Tokyo Institute of Technology, Tokyo, Japan, in 1999 and 2004, respectively. He is currently a D.E. student of Department of Communications and Integrated Systems in Tokyo Institute of Technology. His research interests are in VLSI design automation and combinational algorithms.

Atsushi Takahashi received his B.E., M.E., and D.E. degrees in electrical and electronic engineering from Tokyo Institute of Technology, Tokyo, Japan, in 1989, 1991, and 1996, respectively. He had been with the Tokyo Institute of Technology as a research associate from 1991 to 1997 and has been an associate professor since 1997. He visited University of California, Los Angeles, U.S.A., as a visiting scholar from 2001 to 2002. He is currently with Department of Communications and Integrated Systems, Graduate School of Science and Engineering, Tokyo Institute of Technology. His research interests are in VLSI layout design and combinational algorithms. He is a member of IEEE and IPSJ.