#### J. MAKINO

Department of Systems Science, College of Arts and Sciences, University of Tokyo, 3-8-1 Komaba, Meguro-ku, Tokyo 153, Japan

Abstract. We overview the GRAPE-6 project, a follow-up of the teraflops GRAPE-4 project. GRAPE-6 will be completed by 1999-2000 and its planned peak speed is 200 Tflops. Its architecture will be largely similar to that of GRAPE-4, which is a specialized hardware to calculate the gravitational interaction between particles. The improvement of the speed will mainly come from the advance in the silicon semiconductor technology. GRAPE-6 will enable us to directly simulate the evolution of star clusters with up to 1 million stars.

### 1. Introduction

In 1988, we started the development of special-purpose computers for astrophysical N-body problems (GRAPE; (Sugimot et al., 1990)). The basic idea was to build a simple and small hardware, which is designed specifically to calculate the gravitational interactions between particles. This hardware would operate in cooperation with a general-purpose programmable computer, which would then perform all other calculations such as time integration and I/O (see figure 1).

We believe this approach has been so far highly successful. In 1995, GRAPE-4 (Makino et al, 1997) became the first computer to achieve peak speed of 1 Tflops and sustained speed of 500 Gflops. In addition, more than 25 institutes, both within and outside Japan, now have various versions of GRAPE hardwares and many of these GRAPE hardwares are actively used. Of course, our group have been using GRAPEs, and many of our results such as gravothermal oscillation (Makino, 1996), runaway growth of protoplanet (Kokubo and Ida, 1997), formation of CDM halo (Fukushige and Makino, 1997) would have been simply impossible without GRAPE-4.

This success of GRAPE hardwares led us to investigate the possibility of the successor for GRAPE-4, namely GRAPE-6. Just after the completion of GRAPE-4 in 1995, we started to organize an international collaboration to develop and use next-generation GRAPE system. In June 1997, project to develop GRAPE-6 was approved by JSPS (Japan Society for Promotion of Science), as one of the projects under "Research for the Future" program. In the following, we briefly overview the GRAPE-6 project.

### 2. The Pipeline Processor

The heart of any GRAPE system is the pipelined processor for the force calculation. Its architecture is the most important design decision, since it determines the cost, performance, accuracy, range



Figure 1. Basic structure of GRAPE

59

J. Andersen (ed.), Highlights of Astronomy, Volume 11B, 597-599. © 1998 IAU. Printed in the Netherlands.

598 J. MAKINO

of applications, in other words, practically all aspects of the machine. Here, we briefly discuss the difference between GRAPE-4 processor chip (the HARP chip) and GRAPE-6 chip.

To design and fabricate a chip is a time-consuming process, which we'd like to avoid if not absolutely necessary. However, it turned out to be necessary to develop a new chip to take full advantage of the rapid advance of the silicon VLSI technology.

The technology advance has two outcomes. The first is the increase in the available number of transistors on a single chip. HARP chip is fabricated using  $1\mu m$  technology, while  $0.25\mu m$  technology will be used for GRAPE-6. Roughly speaking, we can use 16 times more transistors. Secondly, switching delay of the transistor is improved roughly in proportional to its physical size, which we hope to give us around factor 4 increase in the clock cycle. Thus, we expect that GRAPE-6 chip will have 64 times more processing power than GRAPE-4 chip, by means of larger number of pipelines and higher clock speed. The power consumption per chip will be 2-3 times larger. The power consumption of a CMOS VLSI chip is proportional to the number of transistors and clock frequency. However, the reduction in the power supply voltage and physical size of the transistor reduces the power consumption per transistor by more than a factor of 10.

The most important advantage of GRAPE architecture is that we can actually use almost all available transistors on a chip to do arithmetics. This is very different from the design of general-purpose microprocessors. The number of arithmetic units in a microprocessor chip has been increasing only very slowly in the last 5 years. No RISC microprocessor is predicted to have more than 8 arithmetic units before year 2000. GRAPE-4 chip already had 20, and GRAPE-6 chip will have around 400. This difference of nearly a factor of 100 is what we can achieve by sacrificing the programmability.

### 3. The overall architecture

The design goal of GRAPE-6 is to achieve reasonable performance for the simulation of globular clusters with  $N=5\times 10^5$  at the full-size configuration. We also require that the machine can be divided to smaller pieces without sacrificing the communication bandwidth of each piece. That is to say, if the total machine has the communication bandwidth of 2 GB/s, when we use the machine as 4 separate pieces, each piece should have this same 2 GB/s of bandwidth.

As in the case of GRAPE-4, the force on single particle must be calculated as partial forces on many processor chips, in order to reduce the apparent number of hardware parallelism. To achieve this goal, the overall connection topology must be some kind of "reduction tree".

GRAPE-4 has the reduction tree with two levels, which was sufficient for its 4000 virtual pipelines GRAPE-6 will have around 10<sup>5</sup> (virtual) pipelines, and we will need more levels, and therefore more modular design of the reduction tree. We are currently investigating the advantages and disadvantages of various topology and various technological options.

The total system will consist of 4096 chips, which will be organized into 16 clusters each with 8 processor boards. Each board will carry 32 processor chips. Total power consumption will be around 40-50 kW.

# 4. Extension

Currently, a number of people are using GRAPE systems for SPH simulations (see, e.g., (Steinmetz, 1996)). In these simulations, GRAPE is used to calculate gravity and to construct the list of neighbor particles. The host handles the SPH interaction between neighbors. The calculation of SPH interaction consumes fairly large fraction of the total CPU time. If SPH interaction can also be handled on some specialized hardware like GRAPE, we can achieve further speedup for SPH calculation.

The speedup we can achieve is not very large, typically around a factor of 10 or less. This is because the calculation cost of SPH interaction is still O(N) and not as large as that of gravity. On the other hand, this fact implies that we need only modestly fast hardware.

n GRAPE-6 project, however, we will try a novel approach, so-called "reconfigurable computing" (Buell, et al., 1996). An obvious alternative is to develop a hardware specialized to SPH ((Yokono, et al., 1996)). However, whether or not a specialized hardware for SPH is worthwhile or not is still unclear. So we decided to go for more generality. "Reconfigurable computing" is a rather new concept which has become possible due to the advance in "reconfigurable logic", or field-



Figure 2. Extended GRAPE architecture

programmable gate arrays (FPGA), which evolved from programmable logic devices (PLD). PLD itself is a rather new technology, which has become practical due to the advance in the silicon VLSI technology.

The readers interested in FPGA and reconfigurable computing are referred to (Buell, et al., 1996), but the bottom line is that it may be able to achieve more flexible pipeline architecture than hardwired GRAPE pipelines and at the same time to achieve price performance better than that of programmable general-purpose computers. Of course, it implies reconfigurable computing is not as flexible as general-purpose computer, and not as efficient as GRAPE. Thus, it cannot directly compete with either of them. However, for the part of computation which is relatively time consuming, but much less so compared to gravitational force calculation, reconfigurable computing would offer an ideal solution.

Thus, GRAPE-6 might become a heterogeneous computer with three, not two, components, with the additional reconfigurable hardware. We may be able to use this part for various applications, such as the calculation of van der Waals force in molecular dynamics and evaluation and shifting of spherical harmonics in the fast multipole method.

## 5. Budget and Timetable

GRAPE-6 project is a five-year project with total budget of around 500 M JYE. The plan is to develop the processor chip by middle of 1998 and small prototype by 1999, and full-scale system by year 2000. We plan to make "small" systems (16-32 processor chips; 1-2 Tflops) available to many institutes. They will serve as main computing workhorses for large scale simulation of self-gravitating systems.

### References

Buell D., Arnold J.M. and Kleinfelder W. (1996) Splash 2: FPGAs in a Custom Computing Machine. IEEE Comp. Soc. Press, Los Alamitos, CA.

Fukushige T. and Makino J. (1997) On the origin of cusps in dark matter halos. ApJL 477, L9-12.

Kokubo E. and Ida S. (1997) Oligarchic growth of protoplanets. submitted to Icarus.

Makino J. (1996) Postcollapse evolution of globular clusters. Ap.J. 471, 796-803.
Makino J., Taiji M., Ebisuzaki T., and Sugimoto D. (1997) Grape-4: A massively parallel special-purpose computer for collisional n-body simulations. Ap.J. 480, 432-446.

Sugimoto D., Chikada Y., Makino J., Ito T., Ebisuzaki T., and Umemura M. (1990) A special-purpose computer for gravitational many-body problems. *Nature* 345, 33-35.

Steinmetz M. (1996) Grapesph: cosmological smoothed particle hydrodynamics simulations with the special-purpose hardware grape. MNRAS 278, 1005-1017.

Yokono Y., Ogasawara R., Takeuchi T., Inutsuka S., Miyama S. M., and Chikada Y. (1996) Development of special-purpose computer for cosmic hydrodynamics by sph. In Tomisaka K. (ed) Numerical Astrophysics Using Super-computers. National Astronomical Observatory, Japan.