# Power Aware IP Block Mapping Schemes for Interconnect Networks

Xinyu Wang<sup>1+</sup> and Zhigang Yu<sup>2\*</sup>

 <sup>1</sup>College of Management Science and Engineering, Dongbei University of Finance and Economics, Dalian, 116025, China
<sup>2</sup>Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
<sup>+</sup>Distribute\_2008@163.com; \*yuzg@live.com

### Abstract

Interconnect network provides communication service for the intellectual property (IP) blocks. Low-power design is an everlasting topic for interconnect networks. This paper investigates the influence of IP block mapping on power, and finds that power-efficient mapping is crucial to the overall system. Traditional random mapping without optimization technique results in high power consumption, especially when the problem scale increases. This paper presents a novel mathematical model for IP block mapping to minimize total power consumption, and proposes the methods to solve the problem with localsolver optimizer. Standard data sets from real applications are used to test the novel methods for the mapping problem. Experimental results show that the proposed methods are competitive, and outperform random mapping scheme in solving the problem. The proposed method with boolean decision variables yields better solutions compared to the other method with integer decision variables in the aspect of convergence speed and solution quality.

*Keywords*: Interconnect network, intellectual property (IP) blocks, power consumption, localsolver optimizer

## **1. Introduction**

In such an era of big data, the rapid growth of data from Internet has posed new opportunities and challenges to both engineers and researchers. It is tough but exciting to extract information from the vast amount of data within tolerable time [1]. Among the various hardware/software platforms for data processing missions, supercomputer with high performance computing, has promised to accommodate and handle the huge volumes of data, which will exploit data parallelism and increase computing efficiency.

Facing the technology bottleneck of "frequency wall", it is difficult to improve the performance of traditional single-core computers by merely increasing the clock rate of transistors in one chip. It is essential to design high-performance low power specific or general-purpose chips, and network on chip is widely accepted as the fundamental ingredient for communications between different intelligent property blocks.

Designing an interconnect network should give overall consideration to several tradeoffs including topology, routing algorithm, flow control mechanism, microarchitecture design of routers, mapping from IP blocks to routers, *etc.* The latter one deals with assigning each task to a specific IP, and finally connecting it to one router on a suitable location. This paper mainly focuses on this issue to minimize the overall communication traffic, and finally reduce the energy consumption and the communication delay.

Given an application, it can be described by a set of tasks between IP blocks. Figure 1 shows a real application of DSP from [14]. Then, we will dress the problem of optimally

mapping the IP blocks onto the communicating routers with the following objective function: to minimize the total power consumption of the overall network. For the DSP in Figure 1, if we select a 6-node ring as the fundamental topology, there are 6! mappings, and different mappings may result in different energy consumption.



Figure 1. A Real Application of DSP

The main contributions of this paper include: 1) the proposal of low power mapping mathematical model; 2) consideration of the effects under various popular topologies; 3) the proposal of two techniques to design decision variable with localsolver programming language. We evaluate the proposed schemes under real applications, and report that the new methods generate very promising results in energy consumption within the tolerable time.

In the rest of the paper, we present related work in Section 2. Section 3 illustrates the mathematical model for low power IP block mapping problem, and also gives the solving methods with localsolver by using boolean decision variable and integer decision variable. Section 4 gives the performance comparisons with real application test cases. Finally, Section 5 concludes the paper.

# 2. Related Work

A lot of previous literatures dedicate to improving performance of the communicating network as well as reducing power consumption induced by traffics. The methods can be mainly divided into three categories: load balancing routing scheme design, high-radix and complex topology design, and low power micro-architecture design of routers.

Xu *et al.*. proposed the shortest path minimum maximum (SPMM) link congestion and non-shortest path minimum maximum (NSPMM) link congestion routing algorithms aiming to minimize traffic load of bottleneck links in mesh networks [2]. Yu *et al.*. proposed a serial virtual channel balanced routing algorithms (VCBR) for torus networks, which efficiently released the hard constraints on virtual channel utilization as many as possible to enhance system performance [3]. The above methods take load balance into account for routing algorithms, which directly reduces resource competition between different traffic flows, and ultimately translates to better performance as well as lower power consumption.

In order to reduce average distance and diameter of the network, high-radix router based network is becoming a preferred choice. Advances in signaling technology has enabled the high-radix routers, which has inspired a lot of new topologies, such as clos [4], dragonfly [5], spidergon donut [6], exchanged crossed cube [8], king topologies [7]. In order to exploit the advantages of mesh network, some derivative topologies are reported, like concentrated mesh [9], express mesh [10] and multi-mapping mesh [11]. These high-radix networks provide shorter paths and more options for message passing from source to destination, which efficiently improves the communicating performance of the network.

Some researchers have focused on architecture design of routers to reduce latency and power consumption within the pipelines inside routers. For example, Lin proposed the power and latency efficient buffered routers and bufferless routers (BR/BLR) mechanism, and described details from routers deployment to routing algorithm [12].

Recently, some researchers have proposed the core mapping problem, and dealt it with intelligent optimization methods. The main goal is to optimally map the cores onto the platform for a specific application, with power consumption over the network getting minimized. Sepulveda *et al.*. proposed a multi-objective adaptive immune algorithm (M2AIA) based on the evolutionary approach to solve the multi-application mapping problem [13]. Lai *et al.*. proposed a suboptimal genetic algorithm based technique to synthesize application-specific topology with system-level floorplan awareness [14]. Paper [15] presented the application-driven mapping algorithms for mesh-based architectures to reduce power consumption. The algorithms are based on genetic algorithm and simulated annealing algorithm. The above methods either solved the problem with synthesize irregular topology, or aimed only on the mesh network while neglecting the impact of other topologies.

## 3. Problem Definition and Mapping Algorithm

Power consumption is a critical issue in interconnect network design, and it varies greatly with topology selection, IP deployment and application tasks. The total power consumption consists of static power (or leakage power) and dynamic power. A lot of previous power models [16, 18] are plugged into a bit-level accurate simulator so that actual network activity triggers specific capacitance calculations and derives dynamic power estimates.

#### 3.1. Power Model Definition and Motivation

This paper aims to minimize the total power consumed by the communicating resources. As reported in [16], power consumption is proportional to the amounts of bits transporting through the fundamental network. For a specific task T between IP i and IP j with C bits traffic, we can used the well accepted power model given in [16] to calculate the power as follows:

$$E_T^{c_i,c_j} = C \cdot \{ (H+1) \cdot E_r + H \cdot E_l \}$$

Suppose that IP block *i* and IP block *j* are mapped to router *s* and router *d*, then *H* in the equations means the hops of path (from router *s* to router *d*) provided by the routing algorithm. When minimal routing algorithm is adopted, *H* represents the minimal distance between *s* and *d*. Besides,  $E_r$  and  $E_1$  represent the power consumed by one router and one link, respectively. Furthermore,  $E_r$  can be extended to  $E_s + E_w + E_b$ , where  $E_s$ ,  $E_w$  and  $E_b$  represent power consumed by switch, wires inside the router fabrics, and buffers. An application scenario usually consists of a set of tasks (*TS*), the total power consumption of the overall network can be determined as follows:

$$E_{Total} = \sum_{T_{i,j} \in TS} E_T^{c_i, c_j}$$
$$E_{Tideal}^{c_i, c_j} = C \cdot \{2 \cdot E_r + E_l\}$$

The ideal power consumption corresponding to task *T* is given by:

That is, IP block i and IP block j are mapped to two adjacent routers, then H is equal is 1. However, due to the constraints under area, radix of routers and deployment of routers, it is impossible to adopt the complete connected architecture. Therefore, it is especially important to map IPs to the "right" routers to achieve optimal power consumption, which motivates the idea of this paper.

## **3.2. Problem Definition**

For a given application, it can be described by a weighted application characterization graph (WAPCG), with vertices representing IP blocks and arcs referring to tasks. For a selected topology, we can describe it with an architecture characterization graph (ARCG), in which each vertex stands for a router (or a location), and each arc represents a physical link.

**Definition 1.** A WAPCG is a weighted directed graph G(V, E), where each vertex  $v_i \in V$  represents one IP block, and each arc  $(v_i, v_j) \in E$  represents a communicating task from  $v_i$  to  $v_j$ . The weight value associated with each arc stands for the amount of the communication in bits per second.

**Definition 2.** An ARCG is also a graph G(R, L), where each vertex  $r_i \in V$  represents a router at a specific location, and each arc  $(r_i, r_j)$  represents a directed physical link from router  $r_i$  to router  $r_j$ .

In this paper, we do not consider concentration on routers. Thus, given the WAPCG and ARCG, they should satisfy the following constraints:

#### size(WAPCG) $\leq$ size(ARCG)

We mainly focus on the three low radix network topologies including ring, mesh and torus, and three mesh/torus based high radix network topologies including king mesh [7], king torus [7] and express mesh. When ring is adopted, the above formulation should be modified to:

#### size(WAPCG) = size(ARCG).

One router cannot accept more than one IP block, moreover, one IP block should be mapped to exactly one router. Based on WAPCG and ARCG, we aim to find a one-to-one mapping relationship to minimize the total power consumption of the overall network. It may happen that some routers have no IP block connected, these routers help to constitute the regular deployment and transmit traffic for other routers as the intermediate nodes.

### 3.3. Analysis of the Problem

The mapping problem from WAPCG to ARCG is an instance of constrained quadratic assignment problem, which is known to be NP-hard [15]. However, it is of great importance to study it as that mapping heavily affects the communication energy consumption. Hu *et al.* proved that the optimized mapping solution always beat the best random-generated solution in the aspect of power consumption [16]. For small scale problem, power consumption from simulated annealing optimizer saves 50% over the random solution. Moreover, the savings increase as the problem scale grows.

Besides, topology selection also greatly impacts power consumption of the mapping. When complete connected topology is adopted, the ideal total power consumption  $(E_{Total,Ideal})$  can be derived with IP blocks arbitrarily mapped to routers. Here, we use power consumption ratio (PCR) to denote the ratio between optimal mapping solution for a specific topology and  $E_{Total,Ideal}$ .

Figure 2 shows the PCRs with the proposed method M1 (given in subsection 3.4) under different topologies. Note that, the solutions provided by M1 may not be optimal. It is observed that PCRs under different topologies varies greatly, and we can derive that topology is also vital and should be taken into account for the IP block mapping problems.

International Journal of Control and Automation Vol.10, No.3 (2017)



#### Figure 2. Influence to PCR with Various Topologies for VOPD and MPEG-4 Applications

#### 3.4 The Localsolver-Based Function Model

LocalSolver combines different optimization techniques to solve the optimization problem, which provides researchers with high-quality solutions in short running time [19]. The key step is to design the function model, while determining decision variable directly impacts efficiency and accuracy of the final solutions.

In this paper, we investigate two methods of selecting decision variables. In the first method, mapping problem is translated to a 0-1 integer programming problem, in which the decision variables constitute as the following  $m \times n$  matrix.

$$X = \begin{pmatrix} x_{1,1}, & x_{1,2}, & \dots & x_{1,n} \\ x_{2,1}, & x_{2,2}, & \dots & x_{2,n} \\ \vdots & \vdots & \vdots & \vdots \\ x_{m,1}, & x_{m,2}, & \dots & x_{m,n} \end{pmatrix}$$

Each element in the matrix is 0 or 1. For  $x_{i,i}$  equal to 1, it means that IP block i is mapped to router *j*. Besides, the one to one mapping constraints are modeled as follows:

$$\sum_{i \in V} x_{i,j} = 1$$
$$\sum_{j \in R} x_{i,j} \le 1$$

In the second method, the mapping problem is translated to a regular integer programming problem, in which the decision variables constitute to the following vector: Y

$$f = (y_1, y_2, \dots, y_n)$$

Each element in the vector is an integer ranging from 1 to n. For a value of  $y_k$ , it means that IP block k is mapped to router  $y_k$ . Besides, the one to one mapping constraint is modeled as  $y_i \neq y_i$  if  $i \neq j$ .

Unless otherwise specified, we use M1 and M2 to represent the two methods mentioned above. We realize M1 and M2 in localsolver programming language, and present the experimental results in the following section.

| No  | Data set | Node | Edge | Data   |
|-----|----------|------|------|--------|
| T1  | DSP      | 6    | 8    | 2400   |
| T2  | VOPD     | 12   | 14   | 3478   |
| Т3  | MPEG-4   | 12   | 26   | 6932   |
| T4  | H263Enc  | 8    | 11   | 403997 |
| T5  | H263Dec  | 11   | 14   | 206114 |
| T6  | MP3Enc   | 7    | 8    | 70679  |
| T7  | MP3Dec   | 5    | 5    | 42500  |
| T8  | T4+T5    | 19   | 25   | 610111 |
| T9  | T4+T6    | 14   | 19   | 474676 |
| T10 | T4+T7    | 13   |      | 446497 |
|     |          |      | 16   |        |
| T11 | T5+T6    | 18   | 22   | 276793 |
| T12 | T6+T7    | 12   | 13   | 113179 |
| T13 | T4+T12   | 19   | 24   | 517176 |
| T14 | MMS      | 25   | 33   | 680790 |

**Table 1. Benchmark Data Sets** 

| Table 2. Communication Cost Comparisons between M1 and M2 for MMS |
|-------------------------------------------------------------------|
| Application                                                       |

| Topologies   | M1     | M2     |
|--------------|--------|--------|
| ring         | 761579 | 786717 |
| mesh         | 688402 | 696637 |
| torus        | 688048 | 693840 |
| king mesh    | 680790 | 680895 |
| king torus   | 680790 | 680790 |
| express mesh | 680790 | 681236 |

### 4. Evaluation and Discussion

In this section, we present the results obtained by running our formulation with localsolver. The proposed methodology is tested on video applications from the previous literature [14] [17]. The benchmarks are shown in Table I. The fourteen applications are labeled from T1 to T14, while the number of nodes, number of edges and the total amount of communication in the WAPCG are listed in the third, fourth and fifth column of the table.

As stated in [14], the power consumption for input and output directions is estimated to be 328 and 65.5 nW/Mbps, respectively. The link power consumption is estimated to be 119.4 nW/Mbps. In our simulation, we adopt the same power characteristic parameters of the communicating components such as routers and physical links as that in [14]. The localsolver programming language supports multiple threads to speed up finding optimal solutions, and we set the number of threads to four in our experiments. Besides, we set the annealing level as the default value 1. All results are obtained on the PC of Intel(R) Xeon(R) CPU E5-2620 with the main frequency of 2.10G Hz, and 2.0 GB memories.

#### 4.1. Impact of Designing Decision Variables in Localsolver

The well-designed decision variables are extremely important for performance of the final results in localsolver. Here in this subsection, we implement the two methods (M1 and M2) in localsolver programming language and evaluate the influence of different decision variables.

As described in subsection 3.4, the method M1 translates the problem as a 0-1 integer programming problem, while the method M2 translates the problem as a standard integer programming problem. When the problem scale is small, the two methods are on a par with each other in the aspects of speed and accuracy. However, when the problem scale grows, method M1 exhibits advantages over M2.

| Table 3. Communication Traffic Comparisons of the Proposed Method M1 |  |  |  |  |  |  |  |
|----------------------------------------------------------------------|--|--|--|--|--|--|--|
| against Random Mapping under Various Topologies in Different         |  |  |  |  |  |  |  |
| Applications                                                         |  |  |  |  |  |  |  |

|     | ring    |        | mesh    |        | torus   |        | king mesh |        | king torus |        | express mesh |        |
|-----|---------|--------|---------|--------|---------|--------|-----------|--------|------------|--------|--------------|--------|
| No  | RM      | M1     | RM      | M1     | RM      | M1     | RM        | M1     | RM         | M1     | RM           | M1     |
| T1  | 3000    | 3000   | 2400    | 2400   | 2400    | 2400   | 2400      | 2400   | 2400       | 2400   | 2400         | 2400   |
| T2  | 6381    | 4104   | 5740    | 3818   | 4795    | 3818   | 4306      | 3478   | 3778       | 3478   | 4083         | 3478   |
| T3  | 12634   | 9169   | 9984    | 7134   | 8301    | 7134   | 7496      | 6932   | 7053       | 6932   | 7623         | 6933   |
| T4  | 471890  | 471890 | 438042  | 404194 | 404194  | 404194 | 403997    | 403997 | 403997     | 403997 | 404194       | 404194 |
| T5  | 254363  | 210626 | 231715  | 206363 | 222487  | 206363 | 215028    | 206114 | 206311     | 206114 | 217702       | 206114 |
| T6  | 77950   | 77950  | 77740   | 77740  | 70679   | 77740  | 70679     | 70679  | 70679      | 70679  | 70679        | 70679  |
| Τ7  | 42500   | 42500  | 42500   | 42500  | 42500   | 42500  | 42500     | 42500  | 42500      | 42500  | 42500        | 42500  |
| T8  | 1321236 | 683524 | 1130518 | 610452 | 871266  | 610452 | 819064    | 610111 | 701556     | 610111 | 802724       | 610111 |
| T9  | 820665  | 549945 | 692584  | 481934 | 622139  | 481934 | 545761    | 474676 | 474873     | 474676 | 551015       | 474676 |
| T10 | 754485  | 514390 | 627440  | 446694 | 540920  | 446694 | 527039    | 446497 | 453864     | 446497 | 487712       | 446497 |
| T11 | 579978  | 289584 | 432549  | 283998 | 387114  | 283854 | 348842    | 276793 | 298709     | 276793 | 336419       | 276793 |
| T12 | 153505  | 120450 | 136070  | 120240 | 121730  | 120240 | 120454    | 113179 | 113179     | 113179 | 121090       | 113179 |
| T13 | 1280467 | 592445 | 995000  | 524434 | 782393  | 524434 | 726303    | 517176 | 590383     | 517176 | 728823       | 517176 |
| T14 | 2371973 | 761579 | 1431976 | 688402 | 1068953 | 688048 | 957639    | 680790 | 796941     | 680790 | 900773       | 680790 |

Table II compares the total amount of communications between M1 and M2 under various topologies in the application of MMS with the time limit of 200 seconds. From the table we can see that, M1 always supplies a superior solution than M2 in the aspect of solution quality.

### 4.2. Impact of Topology Selection

For comparison purposes, we also implement random mapping (RM) by randomly mapping IP blocks onto routers in different topologies. We produce 1,000 solutions and select the one with minimal power consumption.

The amount of communication volume directly affects the power consumption of the overall system. We compare the communication traffic and PCR between M1 and RM under different topologies in different applications. The results are summarized in Table III and IV, respectively. Ring topology is simple and easy to implement router architectures in it. However, among the 14 test cases, only one (T7) could achieve the ideal traffic by using the proposed method M1. As for mesh, two cases could achieve the ideal traffic, and another ten cases consume more power within 5%. The torus topology achieves similar results. For king topologies, the optimal solutions are always given, with the same amount of power consumption as the ideal case. Though the router radix in king mesh, king torus and express mesh are all 9, while the express mesh could not provide an ideal solution in T3 and T4 using M1. For a specific application T1, we select mesh topology because its PCR is 1 and it is the simplest architecture among the six topologies except ring. As for RM in ring topology, the PCR for T14 is 2.4057, which mean that power consumption is more than 2.4 times as that of the ideal case. However, the M1 scheme could provide a mapping solution with PCR of 1.0672. Overall, the topology and mapping schemes are both important to the power consumption of the system. The proposed methods outperform RM in the aspect of power consumption.

## **5.** Conclusions

Due to shrinking technology sizes and increasing transistor counts in hardwired logic, more and more IP blocks are integrated on a single chip. A well-designed placement of IP blocks directly affects the total power consumption and communication amount in the network. Traditional random mapping technique does not take the characters of the applications into account, and results in high power consumption. This paper investigates the IP block mapping problem, and proposes the mathematical model to minimize power consumption in the problem. Using the localsolver optimizer, this paper shows two methods to define the decision variables, and compares them in the convergence speed and accuracy. Experimental results on real applications show that the proposed methods could give excellent mapping solutions, and beat RM beautifully in the aspect of power consumptions. This paper does not incorporate the routing algorithms into the mapping problem, and we will do research on that in our future work.

| Table 4. PCR Comparisons of the Proposed Method M1 against Random |
|-------------------------------------------------------------------|
| Mapping Under Various Topologies in Different Applications        |

|     | ring   |        | mesh   |        | torus  |        | king mesh |        | king torus |        | express mesh |        |
|-----|--------|--------|--------|--------|--------|--------|-----------|--------|------------|--------|--------------|--------|
| No  | RM     | M1     | RM     | M1     | RM     | M1     | RM        | M1     | RM         | M1     | RM           | M1     |
| T1  | 1.1415 | 1.1415 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000    | 1.0000 | 1.0000     | 1.0000 | 1.0000       | 1.0000 |
| T2  | 1.4723 | 1.1018 | 1.3680 | 1.0553 | 1.2143 | 1.0553 | 1.1347    | 1.0000 | 1.0488     | 1.0000 | 1.0984       | 1.0000 |
| T3  | 1.4655 | 1.1826 | 1.2491 | 1.0165 | 1.1118 | 1.0165 | 1.0460    | 1.0000 | 1.0099     | 1.0000 | 1.0564       | 1.0001 |
| T4  | 1.0951 | 1.0951 | 1.0477 | 1.0003 | 1.0003 | 1.0003 | 1.0000    | 1.0000 | 1.0000     | 1.0000 | 1.0003       | 1.0003 |
| T5  | 1.1325 | 1.0124 | 1.0703 | 1.0007 | 1.0450 | 1.0007 | 1.0245    | 1.0000 | 1.0005     | 1.0000 | 1.0318       | 1.0000 |
| T6  | 1.0582 | 1.0582 | 1.0565 | 1.0565 | 1.0000 | 1.0565 | 1.0000    | 1.0000 | 1.0000     | 1.0000 | 1.0000       | 1.0000 |
| T7  | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000    | 1.0000 | 1.0000     | 1.0000 | 1.0000       | 1.0000 |
| T8  | 1.6596 | 1.0681 | 1.4827 | 1.0003 | 1.2422 | 1.0003 | 1.1938    | 1.0000 | 1.0848     | 1.0000 | 1.1786       | 1.0000 |
| T9  | 1.4125 | 1.0897 | 1.2598 | 1.0087 | 1.1758 | 1.0087 | 1.0847    | 1.0000 | 1.0002     | 1.0000 | 1.0910       | 1.0000 |
| T10 | 1.3903 | 1.0860 | 1.2293 | 1.0002 | 1.1197 | 1.0002 | 1.1021    | 1.0000 | 1.0093     | 1.0000 | 1.0522       | 1.0000 |
| T11 | 1.6198 | 1.0261 | 1.3184 | 1.0147 | 1.2255 | 1.0144 | 1.1473    | 1.0000 | 1.0448     | 1.0000 | 1.1219       | 1.0000 |
| T12 | 1.2016 | 1.0364 | 1.1144 | 1.0353 | 1.0428 | 1.0353 | 1.0364    | 1.0000 | 1.0000     | 1.0000 | 1.0396       | 1.0000 |
| T13 | 1.8352 | 1.0824 | 1.5228 | 1.0079 | 1.2902 | 1.0079 | 1.2288    | 1.0000 | 1.0801     | 1.0000 | 1.2316       | 1.0000 |
| T14 | 2.4057 | 1.0672 | 1.6244 | 1.0063 | 1.3226 | 1.0060 | 1.2301    | 1.0000 | 1.0965     | 1.0000 | 1.1828       | 1.0000 |

## Acknowledgement

This work is supported by the National Nature Science Foundation of China (No. 61402086), Scientific Research Foundation of Liaoning Provincial Education Department (No. L2015165), and DUFE Excellent Talents Project (No. DUFE2015R06).

## References

- [1] D. Singh and C. K. Reddy, "A Survey on Platforms for Big Data analytics", Journal of Big Data, vol. 1, no. 8, (**2014**).
- [2] J. Xu and J. F. Yang and C. C. Guo, Y. H. Lee and D. Lu, "Routing Algorithm of Minimizing Maximum Link Congestion on Grid Networks", Wireless Networking, (**2014**), pp. 1-20.
- [3] Z. G. Yu, D. Xiang and X. Y. Wang, "VCBR: Virtual Channel Balanced Routing in Torus Networks", Proceeding of International Conference on High Performance Computing and Communications, (2013).
- [4] J. Kim, W. J. Dally and D. Abts, "Adaptive Routing in High-radix Clos Network, In International Conference for High Performance Computing", Networking, Storage and Analysis, (2006).
- [5] J. Kim, W. J. Dally and S. Scott, "Technology-driven, Highly scalable Dragonfly Topology", In ACM SIGARCH Computer Architecture News, vol. 36, no. 3, (2008), pp. 77-88.
- [6] F. Sibai, "A Two-dimensional Low-diameter Scalable On-chip Network for Interconnecting Thousands of Cores", IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 2, (**2012**), pp. 193-201.
- [7] E. Stafford, J. Bosque and C. Marthez, "A First Approach to King Topologies for On-chip Networks", Proceedings of International Euro-Par on Parallel Processing, (2010).
- [8] K. Li, Y. Mu, K. Q. Li and G. Min, "Exchanged Crossed Cube: A Novel Interconnection Network for Parallel Computation", IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 11, (2013), pp. 2211-2219.

- [9] J. Balfour and W. J. Dally, "Design Tradeoffs for Tiled CMP On-chip Networks", In International Conference on Supercomputing, (2006).
- [10] B. Grot, J. Hestness and S. Keckler, "Express Cube Topologies for On-chip Interconnects", International Conference on of High Performance Computer Architecture, (2009).
- [11] X. Y. Wang and D. Xiang, "Multi-mapping Meshes: A New Communicating Fabric for Networks-on-Chip", International Conference on Parallel and Distributed Systems, (2010).
- [12] J. Lin and X. Lin, "Power and Latency Efficient Mechanism: a Seamless Bridge Between Buffered and Bufferless Routing in On-chip Network", vol. 61, (2012), pp. 1048-1067.
- [13] M. J. Sepulveda, W. J. Chau, G. Gogniat and M. Strum, "A Multi-objective Adaptive Immune Algorithm for Multi-application NoC Mapping", Journal of Analog Integrated Circuits and Signal Processing, vol. 73, no. 3, (2012), pp. 851-860.
- [14] G. Lai and X. Lin, "Floorplan-aware Application-specific Network-on-Chip Topology Synthesis Using Genetic Algorithm Technique", Journal of Supercomputing, vol. 61, (2012), pp. 419-437.
- [15] S. Tosun, O. Ozturk, E. Ozkan and M. Ozen, "Application Mapping Algorithms for Mesh-based Network-on-Chip Architectures", Journal of Supercomputing, vol. 71, (2015), pp. 995-1017.
- [16] J. Hu and R. Marculescu, "Energy- and Performance-Aware Mapping for Regular NoC Architectures", IEEE Transactions on Computer-aided Design of Intergrated Circuits and Systems, vol. 24, no. 4, (2005), pp. 551-562.
- [17] J. Hu and R. Marculescu, "Energy-aware Mapping for Tile-based NoC Architectures Under Performance Constraints", International Asia and South Pacific Design Automation Conference, (2003).
- [18] S. E. Lee and N. Bagherzadeh, "A High Level Power Model for Networkon-Chip (NoC) router", Computers & Electrical Engineering, vol. 35, no. 6, (2009), pp. 837-845.
- [19] http://www.localsolver.com/.

### Authors

**Xinyu Wang**, Female, born in 1985, she received her doctoral degree from the Department of Computer Science and Technology in Tsinghua in July, 2013, and her research areas include parallel and distributed computing, and application mapping in network on chip. Email address: Distribute\_2008@163.com.

**Zhigang Yu**, Male, born in 1989, he is currently pursuing his doctoral degree in the Department of Computer Science and Technology in Tsinghua University, and his research areas include parallel and distributed computing, and routing design in network on chip. Email address: yuzg@live.com.

International Journal of Control and Automation Vol.10, No.3 (2017)