<table>
<thead>
<tr>
<th>Title</th>
<th>Clock Period Minimization of Semi-Synchronous Circuits by Gate-Level Delay Insertion</th>
</tr>
</thead>
<tbody>
<tr>
<td>Authors</td>
<td>Tomoyuki Yoda, Atsushi Takahashi</td>
</tr>
<tr>
<td>Citation</td>
<td>IEICE Trans. Fundamentals, Vol. E82-A, No. 11, pp. 2383-2389</td>
</tr>
<tr>
<td>Pub. date</td>
<td>1999, 11</td>
</tr>
<tr>
<td>URL</td>
<td><a href="http://search.ieice.org/">http://search.ieice.org/</a></td>
</tr>
<tr>
<td>Copyright</td>
<td>(c) 1999 Institute of Electronics, Information and Communication Engineers</td>
</tr>
</tbody>
</table>
Clock Period Minimization of Semi-Synchronous Circuits by Gate-Level Delay Insertion*

Tomoyuki YODA†, Nonmember and Atsushi TAKAHASHI†, Member

SUMMARY A semi-synchronous circuit is a circuit in which every register is ticked by a clock periodically, but not necessarily simultaneously. In a semi-synchronous circuit, the minimum delay between registers may be critical with respect to the clock period of the circuit, while it does not affect the clock period of an ordinary synchronous circuit. In this paper, we discuss a delay insertion method which makes such a semi-synchronous circuit faster. The maximum delay-to-register ratio over the cycles in the circuit gives a lower bound of the clock period. We show that this bound is achieved in the semi-synchronous framework by the proposing gate-level delay insertion method.

Key words: delay insertion, clock period minimization, semi-synchronous circuit

1. Introduction

Semi-synchronous circuits are expected to achieve a high-performance by removing the constraint of complete-synchronous circuits that every register is ticked by a clock simultaneously. Among various objectives in the synthesis of high-performance circuits, the clock period minimization is the primal subject.

For given signal delays between registers, it is known that the minimum clock period in the semi-synchronous framework is determined in polynomial time [1], [2], [6]. To achieve this clock period, each register should be ticked by a clock at its own due clock input timing. A clock-tree synthesis algorithm that realizes a due clock input timing for each register was proposed in [5]. A clock-driven layout methodology that minimizes both the clock period and the clock-tree length was proposed in [7].

The purpose of these studies are in improvement of the performance of a given circuit in the semi-synchronous framework. Although the performance of a circuit is improved more or less by these methods, a circuit should be synthesized taking account of the effect of the semi-synchronous framework to make the best use of it. In the complete-synchronous framework, any increase of the minimum delay between registers is not considered since it does not lead to the clock period minimization, while it may reduce the clock period in the semi-synchronous framework.

In this paper, we discuss a delay insertion method which makes a semi-synchronous circuit faster. The maximum delay-to-register ratio (let it be $T_B$) over the cycles in the circuit gives a lower bound of the clock period. $T_B$ may change when the delay of an element or the number of registers in a cycle is changed. $T_B$ of a given circuit cannot be reduced unless delays of elements are reduced or the number of registers in a cycle is increased. However it is practical to consider that the delay of each element has already been minimized and the circuit topology optimized in the conventional circuit design. Thus we assume that this $T_B$ is invariant.

Therefore, the circuit is considered to have a possibility to be made faster only when the current clock period is larger than $T_B$. Our problem is to achieve the lower bound in the semi-synchronous framework by gate-level delay insertion. We show that the proposing gate-level delay insertion method achieves the lower bound if the delay of each element in the circuit is unique (that is, max delay = min delay for each element) and the clock input timing of each register is controlled as we design.

2. Preliminaries

In this paper, we consider a circuit with a single clock consisting of registers and gates, and wires connecting them. Both registers and gates are referred to elements. A circuit is represented by the circuit graph $G$ where a vertex $v \in V(G)$ represents an element and a directed edge $(u,v) \in E(G)$ does the signal propagation from the output of element $u$ to the input of element $v$ along the wire. The delay of an edge is the sum of the delay due to the corresponding wire and the delay due to the corresponding end element. We assume that each wire delay and element delay is unique. The circuit graph of the circuit given in Fig. 1 is shown in Fig. 2. The clock distribution network of the circuit is not depicted.

Let $V_r(G) = \{r_1,r_2,\ldots,r_n\} \subset V(G)$ be the set of registers. A register-path from register $r_i$ to register $r_j$ in $G$ is a directed path from $r_i$ to $r_j$ without other registers. Let $E_r(G)$ be the set of ordered register-pairs $(r_i,r_j)$ such that there is a register-path from $r_i$ to $r_j$. The delay of a register-path is unique since the delay of each element is unique. While the delay between registers is not unique in general when there are various paths between these registers. Let $d_{\min}(r_i,r_j)$
A semi-synchronous circuit (minimum clock period = 9).

3. A Lower Bound of the Clock Period

Most of the following facts are known in [1]–[4], [6], but given for completeness.

**Definition 1** (Delay-to-register ratio): The delay-to-register ratio of a directed cycle (or closed-walk) \( L \) in the circuit graph \( G \) is defined as the sum of edge delays over the number of registers in \( L \). Let \( T_B(G) \) be the maximum delay-to-register ratio of those of directed cycles.

Note that \( T_B(G) \) is equal to the maximum delay-to-register ratio of those of directed closed walks. Clearly, the circuit cannot work with the clock period less than the delay-to-register ratio of any cycle in \( G \). So, \( T_B(G) \) gives a “lower Bound” of the clock period which is our target to be realized.

In the following, we discuss a lower bound of the clock period of a circuit in the semi-synchronous framework and that in the complete-synchronous framework.

### 3.1 Semi-Synchronous Circuits

In semi-synchronous circuits, the clock input timing of a register may be different from other registers.

The clock-timing \( s(r_i) \) of register \( r_i \) is defined as the difference in clock arrival time between \( r_i \) and an arbitrary chosen reference register. For example, in Fig. 1, \( r_2 \) is chosen as the reference register, and clock-timings of \( r_1, r_2, r_3, \) and \( r_4 \) are \( 3, 0, 7, \) and \( 6 \), respectively.

We assume the framework that a circuit works correctly with clock period \( T \) if the following two types of constraints are satisfied [2].

**No-Double-Clocking Constraints**

\[
s(r_j) - s(r_i) \leq d_{\text{min}}(r_i, r_j) \quad \forall (r_i, r_j) \in E_r(G)
\]

**No-Zero-Clocking Constraints**

\[
s(r_i) - s(r_j) \leq T - d_{\text{max}}(r_i, r_j) \quad \forall (r_i, r_j) \in E_r(G)
\]

To calculate the minimum clock period in the semi-synchronous framework, we define the constraint graph \( H(G, T) \) as follows. A vertex corresponds to a register in \( V_r(G) \), and an edge corresponds to either type of constraints: an edge from \( r_i \) to \( r_j \) with weight \( d_{\text{min}}(r_i, r_j) \), called the D-edge, corresponds to the no-double clocking constraint on a minimum register-path from \( r_i \) to \( r_j \) in \( G \); an edge from \( r_i \) to \( r_j \) with weight \( T - d_{\text{max}}(r_i, r_j) \), called the Z-edge, corresponds to the no-zero clocking constraint on a maximum register-path from \( r_i \) to \( r_j \) in \( G \). The sets of D-edges and Z-edges are denoted by \( E_d(G) \) and \( E_z(G) \), respectively.

The constraint graph \( H(G, T) \) for the circuit graph given in Fig. 2 is shown in Fig. 3. For example, the D-edge from \( r_2 \) to \( r_4 \) with weight 6 corresponds to the minimum register-path from \( r_2 \) to \( r_4 \), and the Z-edge from \( r_4 \) to \( r_2 \) with weight \( T - 8 \) corresponds to the maximum register-path from \( r_2 \) to \( r_4 \). Notice that the direction of Z-edge is opposite to the direction of the signal propagation.

The following lemma is well known about the clock period of semi-synchronous circuits.

**Lemma 1** ([6]): A circuit \( G \) works with clock period \( T \) in semi-synchronous framework if and only if the constraint graph \( H(G, T) \) contains no negative weight directed cycle.

A negative weight directed cycle is simply called “negative cycle” in the following.

Note that if the circuit works with clock period \( T \), then the circuit works with \( T' \) such that \( T' \geq T \). Let \( T_S(G) \) be the minimum clock period of the circuit \( G \) in the semi-synchronous framework.

**Theorem 1** ([6]): \( T_S(G) \) is the minimum \( T \) such that there is no negative cycle in the constraint graph \( H(G, T) \).
We can determine the minimum clock period $T_S(G)$ and the clock-timing of each register in time polynomial in the number of vertices and edges in $H(G, T)$ [1], [2], [6].

3.2 Complete-Synchronous Circuits

In complete-synchronous circuits, above two types of constraints must be satisfied as well. However no-double-clocking constraints in the complete-synchronous framework could be ignored, as long as a complete-synchronous circuit has the premise that a clock ticks all the registers simultaneously, that is, $s(r_i) = s(r_j) = 0$ and $d_{\min}(r_i, r_j) \geq 0$ are considered to hold.

The maximum delay of any path between registers must be smaller than the clock period. Thus, a lower bound of the clock period of a complete-synchronous circuit is given as $\max_{E(G)} \{ d_{\max}(r_i, r_j) \}$. To reduce this value, an idea of retiming relocates registers of the circuit while preserving its functionality. It does not change the delay-to-register ratio of any cycle. It is shown that the lower bound $T_S(G)$ of the clock period is achieved by retiming if arbitrary amount of retiming is allowed [4]. It is counted as a merit of complete-synchronous circuits. A motivation of this paper is to show that the same performance is achievable by delay insertion in semi-synchronous framework.

The value $T_B(G)$ can be calculated in polynomial time by applying the result in [3] to the constraint graph $H(G, T)$.

**Theorem 2:** $T_B(G)$ is the minimum $T$ such that there is no negative cycle consisting of only Z-edges in the constraint graph $H(G, T)$.

For example, in Fig. 3, $T_S(G) = 9$ and $T_B(G) = 7$. It is easy to see the following corollary from Theorems 1 and 2.

**Corollary 1:** $T_S(G) \geq T_B(G)$

Notice that no circuit $G$ works with clock period less than $T_B(G)$, even if retiming techniques or semi-synchronous techniques are applied, since $T_B(G)$ is invariant in these techniques.

4. Delay Insertion Effect at Gate-Level

The delay insertion corresponds to the increase of the edge delay in the circuit graph $G$. In this section, we will show that a circuit $G'$ which works with clock period $T_B(G)$ can be obtained from $G$ by delay insertion.

In order to speed up the circuit, more precisely to eliminate all the negative cycles in $H(G', T_B(G))$, we should increase the edge weight in the constraint graph. In the constraint graph, there are two types of edges. The increase of the weight of a Z-edge corresponds to the reduction of the maximum delay, which is out of our consideration by the reason mentioned before. The increase of the weight of a D-edge corresponds to the increase of the minimum delay.

For simplicity, the constraint graph $H(G', T_B(G))$ is denoted by $H(G')$ for any $G'$ obtained from $G$ by delay insertion. If $T_S(G') > T_B(G)$, there are negative cycles in $H(G')$ (Fig. 4). Moreover for each negative cycle in $H(G')$ there exists a D-edge in the cycle.

**Lemma 2:** If $T_S(G') > T_B(G)$, every negative cycle in $H(G')$ contains at least one D-edge.

**Proof:** By Theorem 2, there is no negative cycle consisting of only Z-edges in $H(G')$, so all negative cycles in $H(G')$ have at least one D-edge.

Because every negative cycle in $H(G')$ has a D-edge, we may improve the clock period by delay insertion to the circuit corresponding to the D-edge.

To increase the weight of a D-edge in $H(G)$, the delay of an edge in the corresponding minimum register-path in $G$ should be increased. Accordingly, the path delay of all paths through the edge in $G$ increases, and the weight of other D-edges and Z-edges in $H(G)$ may change.

For example, let us consider a part of a circuit shown in Fig. 5. To increase the delay from $r_a$ to $r_c$, the delay either from $r_a$ to $g$ or from $g$ to $r_c$ should be increased. In either case, the delay from $r_a$ to $r_d$ or from $r_b$ to $r_c$ increases accordingly. Thus the delay between registers does not always change independently. To understand the effect of the delay insertion exactly, the circuit must not be modeled at register-level but at
The delay insertion may increase the maximum delay. If so, the weight of a Z-edge decreases accordingly and the maximum delay-to-register ratio $T_B(G')$ of the delay inserted circuit $G'$ may become greater than $T_B(G)$. Because $T_B(G)$ is the lower bound of the clock period, if $T_B(G')$ is greater than $T_B(G)$, our objective cannot be achieved. Thus, the delay insertion must not increase the maximum delay-to-register ratio. $T_B(G')$ remains same if the sum of delays of any cycle $L$ in $G'$ is at most $T_B(G)$ times the number of registers in $L$.

We define the delay-slack for each edge that represents the margin within which the delay insertion keeps the maximum delay-to-register ratio.

**Definition 2 (delay-slack):** For a directed cycle (or closed-walk) $L$ in $G'$, the cycle-slack of $L$ in $G'$ is defined as

$$T_B(G) \times N(L) - D(L)$$

where $N(L)$ is the number of registers in $L$ and $D(L)$ the sum of delays of $L$ in $G'$. The delay-slack of an edge $(v_i, v_j)$ in $G'$ is the minimum cycle-slack over all cycles that contain $(v_i, v_j)$. □

Note that the delay-slack of each edge can be calculated in time polynomial in the number of edges in $G$. Clearly, the delay insertion less than or equal to the delay-slack of $(v_i, v_j)$ keeps the maximum delay-to-register ratio.

## 5. Delay Insertion Algorithm

Our algorithm **Delay-Insertion** is described in Fig. 6. It finds a negative cycle $C^H$ in $H(G')$ and inserts delays calling sub-algorithm **Delay-Into-Cycle** until an optimal circuit is obtained. In the following, we will show that **Delay-Insertion** outputs the optimal circuit, and terminates in polynomial time.

**Delay-Into-Cycle**, described in Fig. 7, inserts delays to edges in a minimum register-path in $G'$ corresponding to the D-edge in $C^H$. The amount of inserted delay to each edge which is calculated just before delay insertion is equal to the delay-slack of the edge. We will show that the delay-slack of at least one edge in $G'$ becomes zero whenever **Delay-Into-Cycle** is applied to $G'$. To explain the behavior of **Delay-Insertion**, we apply **Delay-Insertion** to the circuit graph with $T_B(G) = 7$ in Fig. 2. Since the $T_B(G) \neq T_S(G)$, **Delay-Into-Cycle** is applied to the negative cycle in the constraint graph shown in Fig. 4. D-edge $(r_2, r_4)$ in the negative cycle is selected to insert the delay in **Delay-Into-Cycle**. The minimum register-path from $r_2$ to $r_4$ is $(r_2, a, e, f, r_4)$. Since the weight of cycle $(r_2, a, b, r_1, c, d, f, r_4, g, r_3, h, r_2)$ is 28, and the cycle contains four registers, the cycle-slack of the cycle is zero, and the delay-slacks of $(r_2, a)$ and $(f, r_4)$ are zero. $(a, e)$ and $(e, f)$ are contained by only one cycle $(r_2, a, c, f, r_4, g, r_3, h, r_2)$. Since the cycle-slack of the cycle is 11, the delay-slacks of $(a, e)$ and $(e, f)$ are 11. Then no delay is inserted to $(r_2, a)$, and 11 delays are inserted to $(a, e)$. Here, the delay-slack of $(e, f)$ becomes zero, since the cycle-slack of cycle $(r_2, a, c, f, r_4, g, r_3, h, r_2)$ becomes zero. So, no delay is inserted to $(e, f)$. The output of **Delay-Into-Cycle** is shown in Fig. 8(a). Until the lower bound $T_B(G)$ is achieved, **Delay-Into-Cycle** is applied to a negative cycle in the constraint graph. The output of **Delay-Insertion** is shown in Fig. 8(b). The corresponding circuit is shown in Fig. 9.

**Lemma 3:** Whenever **Delay-Into-Cycle** is applied to $G'$, a delay is inserted to at least one edge in $G'$ so
Proof: Let \( C^H \) be a negative cycle in \( H(G') \), and \( G^* \) be the circuit obtained by Delay-Into-Cycle \((G',C^H)\).

If a delay is inserted to any edge, the delay-slacl of the edge becomes zero, since Delay-Into-Cycle inserts delay with the same amount of the delay-slacl. If a minimum register-path in \( G' \) becomes non-minimum in \( G^* \), it means that a delay was inserted to some edge in the register-path. Thus we assume that every minimum register-path in \( G' \) corresponding to D-edge in \( C^H \) is non-negative.

Let us assume that \( C^H \) consists of only one D-edge \( e^d \) from \( r_i \) to \( r_j \) and Z-edges \( e^z_1, e^z_2, \ldots, e^z_m \) (Fig. 10). Let \( P^G_{\min} = (e^z_1, e^z_2, \ldots, e^z_n) \) be a minimum register-path from \( r_i \) to \( r_j \) in \( G' \) corresponding to \( e^d \), and \( P^G_{\max} \) be a path from \( r_i \) to \( r_j \) that consists of maximum register-paths in \( G' \) corresponding to the Z-edges (Fig. 11).

We show that \( P^G_{\min} \) is a maximum register-path in \( G^* \). Assume to the contrary that there exists a maximum register-path \( P^G_{\max} \) from \( r_i \) to \( r_j \) in \( G^* \) such that the weight of \( P^G_{\max} \) is greater than that of \( P^G_{\min} \). For each edge \( e^z_i \) in \( P^G_{\min} \), there exists a path \( P^G \) such that \( P^G_{\min} \) and \( e^z_i \) form a cycle whose cycle-slacl is zero in \( G^* \).

Let \( W^G_{1} \) be the closed walk \((P^G_{\min}, P^G, P^G_{n-1}, \ldots, P^G_{1})\). The cycle-slacl of \( W^G_{1} \) is zero in \( G^* \) since it consists of cycles whose cycle-slacls are zero. Then the cycle-slacl of the walk \((P^G_{\max}, P^G, P^G_{n-1}, \ldots, P^G_{1})\) in \( G^* \) is negative and contradicts the assumption that the cycle-slacl of every cycle is non-negative in \( G^* \). Thus \( P^G_{\min} \) is a maximum register-path in \( G^* \). Since the delay of each register-path is unique, \( d^z_{\max}(r_i, r_j) = d^z_{\min}(r_i, r_j) \) in \( G^* \). So the weight of \( e^z \) is \( T_B(G) - d^z_{\min}(r_i, r_j) \), and the weight of \( e^d \) is \( d^z_{\min}(r_i, r_j) \) in \( H(G^*) \).

Let \( W^H_{1} \) be the closed walk consisting of Z-edges that corresponds to \( W^G_{1} \) (Fig. 10). The weight of \( W^H_{1} \) in \( H(G^*) \) is zero since the cycle sлаcl of \( W^H_{1} \) is zero. Let \( W^H_{2} \) be the closed walk consisting of Z-edges that corresponds to the walk \((P^G_{\max}, P^G, P^G_{n-1}, \ldots, P^G_{1})\). The weight of \( W^H_{2} \) in \( H(G^*) \) is non-negative since the walk consists only of Z-edges.

The cycle \( C^H \) is obtained from \( W^H_{2} \) by adding \( e^d \) and \( e^z \) and deleting \( W^H_{1} \). Thus the weight of cycle \( C^H \) in \( H(G^*) \) is \( w(W^H_{2}) + w(e^d) + w(e^z) - w(W^H_{1}) = w(W^H_{2}) + d^z_{\min}(r_i, r_j) + (T_B(G) - d^z_{\min}(r_i, r_j)) - 0 = w(W^H_{2}) + T_B(G) \) where \( w(X) \) is the weight of \( X \) in \( H(G^*) \). Since \( w(W^H_{2}) \) is non-negative, the weight of \( C^H \) is positive.

It is easy to see that the weight of \( C^H \) is \( w(W^H_{2}) + T_B(G) \times M \) in general where \( M \) is the number of D-edges in \( C^H \).

Therefore, a delay is inserted to some edge in
a minimum register-path corresponding to D-edges in $C^H$.

If the delay of each element is not unique, the delay of a minimum register-path corresponding to a D-edge has a certain range. In such a case, the weight of $C^H$ is $w(W_2^H) + T_B(G) + d_{\min}(r_i, r_j) - d_{\max}(r_i, r_j)$ when the number of D-edges is one. Thus the weight of $C^H$ is not necessarily non-negative and delay may not be inserted to any edge.

When Delay-Insertion terminates, clearly the clock period of the obtained circuit $G'$ in semi-synchronous framework is $T_B(G)$. So, we prove only that Delay-Insertion terminates in polynomial time.

**Theorem 3:** Algorithm Delay-Insertion terminates in time polynomial in the number of edges in $G$.

**Proof:** It is not difficult to see that each step in Delay-Insertion is in polynomial time. The number of repetitions from Step 2 to Step 6 is at most the number of edges in $G$ since the delay-slash of at least one edge becomes zero at each repetition. □

6. Experimental Results

Algorithm Delay-Insertion is applied to benchmark circuits in LGSynth91. In experiments, we assume that each gate has unit delay, and routing delays are zero. The clock period is reduced in 6 out of 24 circuits. The minimum clock periods before delay insertion of the other 18 circuits in semi-synchronous framework are revealed to be equal to the maximum delay-to-register ratio of each circuit. The results of improved circuits are shown in Table 1. Here, the number of gates is denoted by Gate, the maximum delay-to-register ratio by MD, the clock period of the semi-synchronous circuit before delay insertion by Init., the clock period after delay insertion by Fin., the amount of inserted delay by ID, and the computation time (PentiumII 450 MHz) by Time (sec.).

7. Conclusions

We proved that the minimum clock period of a circuit in semi-synchronous framework achieves the maximum delay-to-register ratio by delay insertion under the assumption that the delay of each edge is unique.

As future works, the reduction of the amount of the inserted delay in Delay-Insertion, more practical delay assumption (the delay of each edge is not unique), delay realization method such as detour of routing wire or delay element insertion, and the combination of retiming and delay insertion technique to minimize the area and clock period, should be investigated.

**Acknowledgments**

The authors are grateful to Professor Yoji Kajitani, Tokyo Institute of Technology, for his helpful discussions. This work is part of a project of CAD21 at Tokyo Institute of Technology. This work is partially supported financially by a Grant-in-Aid for Scientific Research from the Ministry of Education, Science and Culture of Japan.

**References**


**Tomoyuki Yoda** received his B.E. degree in electrical and electronic engineering from the Tokyo Institute of Technology, Tokyo, Japan, in 1998. He is currently a M.E. student of electrical and electronic engineering at the Tokyo Institute of Technology. His research interests are in VLSI design automation and combinational algorithms.
Atsushi Takahashi received his B.E., M.E., and D.E. degrees in electrical and electronic engineering from the Tokyo Institute of Technology, Tokyo, Japan, in 1989, 1991, and 1996, respectively. He had been with the Tokyo Institute of Technology as a research associate from 1991 to 1997 and has been an associate professor since 1997 in the Department of Electrical and Electronic Engineering. His research interests are in VLSI layout design and combinational algorithms. He is a member of the Information Processing Society of Japan and IEEE.