Data locality: the new challenge for HPC
With the increasing number of computing nodes, the increasing number of cores per node, and the new many-core accelerators (e.g. GPU, Xeon Phi), HPC applications developers are struggling to make an efficient usage of all these heterogeneous resources. It is acknowledged that many applications only achieve a very low global efficiency especially because of their poor ability to handle and manage data locality that is topology-aware data management at all levels: in memory, in intra-node inter-node communication, and in I/O.

Indeed, three locality problems need to be addressed on the path to larger-scale machines: data distribution, data exchange, and data storage.

Data distribution
The first locality problem is on the application side. Each application has its own scheme of distributing the data across the compute nodes. Some use available libraries, like Scotch (developed by the COLOC partner Inria), some rely on their own algorithms. The way data are distributed and then accessed during the application execution is key factor for performance. In order to help the developer to make a better usage of the shared resources, tools are needed to leverage new parallelisation techniques and to measure data-flows. The objective is to take advantage of the full topology of the machine to enhance both data locality and synchronization, thus releasing pressure on the distributed structure (network and global I/O).

Data exchange
The second locality problem arises from the way data are actually exchanged between processes. Indeed the transfer time depends both on the affinity of the processes and their location. The increasing complexity and scale of the compute node architectures and inter-node networks has not yet been matched by capabilities of controlling and estimating data movement decision, which will have a raising impact on scalability.

Data storage
The last locality problem relates to data storage. Here, accessing data depends on many factors, in particular the location of the data and the placement of the processes. Across the vertical dimension provided by the storage I/O stack, data locality is currently managed independently by each layer (application, I/O library, I/O forwarding layer, file system), which can result in inefficient management due to data redundancies, unaligned access, unnecessary consistency constraints, etc.

These three data locality problems must be addressed altogether
Unfortunately, a quick overview of the literature and existing solutions show that none of them address the locality problem in all dimensions and in a satisfactory manner even for nowadays parallel computer and certainly not for future post-petascale machines and applications.

Appropriate models are needed
We see that modeling hardware and data movement often leads to complicated models due to the complexity of modern architectures and the wide variety of technologies that are involved. Data movement raises the question of finding a good model when dealing with intra-node and inter-node communication. Again, the complexity of the architectures and memory hierarchies makes such models hard to design. For instance MPI implementations usually rely on basic heuristics for predicting the performance of the available communication strategies. The hardware is now well described in a qualitative way thanks to tools such as Hwloc which builds a tree of hardware resources describing their locality. But this model fails to indicate which levels of the tree actually play a role on the data movement performance (caches usually do while sockets may not).

Recently, for designing placement algorithm, new data movement models have been devised, but most established network models, such as the Hockney Model or the LogP model family do not take the network topology into account.

Concerning the vertical dimension of locality we see that in the software storage I/O stack of current petaflop machines, data locality is managed independently at various levels such as application, middleware, and file system. At the lowest level of the I/O stack, parallel file systems such as GPFS, Orange/PVFS, Lustre manage block-based or object-based storage through parallel servers running on storage nodes. These parallel file systems strive to comply with the POSIX requirements. However, one of the main criticisms of POSIX is that it does not expose data locality, while researchers agree that locality-awareness is a key factor for building Exascale systems

For the horizontal dimension of locality, the increasing complexity and level of parallelism inside the computing nodes raises the question of how to control data locality in such a complex scenario. CPU architectures are based on increasingly complex NUMA architectures with multi-layered caches. Exploiting data locality in these complex architectures requires precise knowledge of the hierarchical structure of the machine. There exist very few tools and standard way to control the way processes are mapped to CPU/cores. For instance the numactl tool allows for precise mapping but is not portable outside of the Linux world, this is the same problem for allocating pages on memory node using the NUMA policy library. The Hwloc tool partly solves this problem by providing a portable interface for exposing the memory hierarchy and controlling the thread and process placement. However, there is no portable tool to expose and control quantitative knowledge about the data transfer between cores and nodes.

We are also missing tools to support data locality

For locality tools, we know that moving data inside a supercomputer with hundreds of thousands of CPUs connected by sparse (on-chip and off-chip) networks is a daunting task. Today’s parallel programming frameworks have little to no explicit features to support locality. The Message Passing Interface has addressed this issue by providing topology management routines that can renumber processes. Practically, MPI implementations seldom provide implementations of these routines able to perform any reordering, except for vendor implementations tailored for a specific hardware. MPI 2.2 offers the more scalable Graph Topology Interface for renumbering processes, and generic implementations now propose topology-aware implementations of such routines. However, it does not expose the system topology to the application, it does not allow any assumptions about the mapping performance resulting from data and processes mapping improvements, and it does not support I/O operations.

Last, the mapping problem consists in placing processes, taking into account their affinity (e.g. volume of exchanged data), onto an architecture, taking into account its topology. There are several mapping techniques in the literature. However, the growing number of CPUs and complexity of networks poses new challenges for these schemes (scalability, I/O, tool cooperation). For example, memory constraints will require efficient parallelization of the current schemes to support post-petascale computers. Another example concerns the resource management systems that mostly implement single objective optimization (fairness) without taking into account locality, I/O or energy. Moreover, there exists only few runtime solutions to cope with dynamic topology-aware locality management and these solutions are bound to specific mechanisms such as migration and load-balancing in CHARM++.

A holistic approach is required
From the above roadblocks and the quick overview provided here, it is clear that without addressing the different locality issues, it will not be possible to execute applications at large scale. Indeed many applications already suffer from inefficient data access and it is mandatory to improve data access performance to exploit hardware full capabilities. To the best of our knowledge, there exist currently no abstractions and mechanisms integrating both vertical and horizontal dimensions of the software system stack, which may facilitate the programmability of data locality control in an integrated way (e.g. by taking into consideration global knowledge about intra-node, inter-node, and storage I/O architectures).