Nathan Bell

personal website and blog

Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods

Citation

Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods
Nathan Bell, Steven Dalton, and Luke Olson
SIAM Journal on Scientific Computing (accepted)

Abstract

Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarse-grained tasks suitable for distributed computers with traditional processing cores. However, accelerating multigrid on massively parallel throughput-oriented processors, such as the GPU, demandsalgorithms with abundant fine-grained parallelism. In this paper, we develop a parallel algebraic multigrid method which exposes substantial fine-grained parallelism in both the construction of the multigrid hierarchy as well as the cycling or solve stage. Our algorithms are expressed in terms of scalable parallel primitives that are efficiently implemented on the GPU. The resulting solver achieves an average speedup of 1.8× in the setup phase and 5.7× in the cycling phase when compared to a representative CPU implementation.

Downloads

External Links

PyDEC: Software and Algorithms for Discretization of Exterior Calculus

Citation

PyDEC: Software and Algorithms for Discretization of Exterior Calculus
Nathan Bell and Anil Hirani
ACM Transactions on Mathematical Software (accepted)

Abstract

This paper describes the algorithms, features and implementation of PyDEC, a Python library for computations related to the discretization of exterior calculus. PyDEC facilitates inquiry into both physical problems on manifolds as well as purely topological problems on abstract complexes. We describe efficient algorithms for constructing the operators and objects that arise in discrete exterior calculus, lowest order finite element exterior calculus and in related topological problems. Our algorithms are formulated in terms of high-level matrix operations which extend to arbitrary dimension. As a result, our implementations map well to the facilities of numerical libraries such as NumPy and SciPy. The availability of such libraries makes Python suitable for prototyping numerical methods. We demonstrate how PyDEC is used to solve physical and topological problems through several concise examples.

Downloads

External Links

Thrust: A Productivity-Oriented Library for CUDA

Citation

Thrust: A Productivity-Oriented Library for CUDA
Nathan Bell and Jared Hoberock
GPU Computing Gems, Jade Edition, Edited by Wen-mei W. Hwu, October 2011

Abstract

This chapter demonstrates how to leverage the Thrust parallel template library to implement high-performance applications with minimal programming effort. Based on the C++ Standard Template Library (STL), Thrust brings a familiar high-level interface to the realm of GPU Computing while remaining fully interoperable with the rest of the CUDA software ecosystem. Applications written with Thrust are concise, readable, and efficient.

Downloads

External Links

Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods

Citation

Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods
Nathan Bell, Steven Dalton, and Luke Olson
NVIDIA Technical Report NVR-2011-002, June 2011

Abstract

Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarse-grained tasks suitable for distributed computers with traditional processing cores. However, accelerating multigrid on massively parallel throughput-oriented processors, such as the GPU, demands algorithms with abundant fine-grained parallelism. In this paper, we develop a parallel algebraic multigrid method which exposes substantial fine-grained parallelism in both the construction of the multigrid hierarchy as well as the cycling or solve stage. Our algorithms are expressed in terms of scalable parallel primitives that are efficiently implemented on the GPU. The resulting solver achieves an average speedup of over 2x in the setup phase and around 6x in the cycling phase when compared to a representative CPU implementation.

Downloads

External Links

Sparse Matrix-Vector Multiplication on Multicore and Accelerators

Citation

Sparse Matrix-Vector Multiplication on Multicore and Accelerators
Sam Williams, Nathan Bell, Jee Whan Choi, Michael Garland, Leonid Oliker, and Richard Vuduc
Scientific Computing on Multicore and Accelerators, December 2010

Abstract

This chapter summarizes recent work on the development of highperformance multicore and accelerator-based implementations of sparse matrix-vector multiplication (SpMV). As an object of study, SpMV is an interesting computation for at least two reasons. First, it appears widely in applications in scientific and engineering computing, nancial and economic modeling, and information retrieval, among others, and is therefore of great practical interest. Secondly, it is both simple to describe but challenging to implement well, since its performance is limited by a variety of factors, including low computational intensity, potentially highly irregular memory access behavior, and a strong input-dependence that be known only at run-time. Thus, we believe SpMV both practically important as well as insightful for understanding the algorithmic and implementation principles necessary to making eff ective use of state-of-the-art systems.

External Links

Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Citation

Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
Nathan Bell and Michael Garland
Proc. Supercomputing ‘09, November 2009

Abstract

Sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential of throughput-oriented processors for sparse operations requires that we expose substantial fine-grained parallelism and impose sufficient regularity on execution paths and memory access patterns. We explore SpMV methods that are well-suited to throughput-oriented architectures like the GPU and which exploit several common sparsity classes. The techniques we propose are efficient, successfully utilizing large percentages of peak bandwidth. Furthermore, they deliver excellent total throughput, averaging 16 GFLOP/s and 10 GFLOP/s in double precision for structured grid and unstructured mesh matrices, respectively, on a GeForce GTX 285. This is roughly 2.8 times the throughput previously achieved on Cell BE and more than 10 times that of a quad-core Intel Clovertown system.

Downloads

External Links

Efficient Sparse Matrix-Vector Multiplication on CUDA

Citation

Efficient Sparse Matrix-Vector Multiplication on CUDA
Nathan Bell and Michael Garland
NVIDIA Technical Report NVR-2008-004, December 2008

Abstract

The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many high-performance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its role in iterative methods for solving sparse linear systems and eigenvalue problems, sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra.

In this paper we discuss data structures and algorithms for SpMV that are efficiently implemented on the CUDA platform for the fine-grained parallel architecture of the GPU. Given the memory-bound nature of SpMV, we emphasize memory bandwidth efficiency and compact storage formats. We consider a broad spectrum of sparse matrices, from those that are well-structured and regular to highly irregular matrices with large imbalances in the distribution of nonzeros per matrix row. We develop methods to exploit several common forms of matrix structure while offering alternatives which accommodate greater irregularity.

On structured, grid-based matrices we achieve performance of 36 GFLOP/s in single precision and 16 GFLOP/s in double precision on a GeForce GTX 280 GPU. For unstructured finite-element matrices, we observe performance in excess of 15 GFLOP/s and 10 GFLOP/s in single and double precision respectively. These results compare favorably to prior state-of-the-art studies of SpMV methods on conventional multicore processors. Our double precision SpMV performance is generally two and a half times that of a Cell BE with 8 SPEs and more than ten times greater than that of a quad-core Intel Clovertown system.

Downloads

External Links

Algebraic Multigrid for Discrete Differential Forms

Citation

Algebraic Multigrid for Discrete Differential Forms
Nathan Bell
PhD Thesis (UIUCDCS-R-2008-2986), August 2008

Abstract

Discrete differential forms arise in scientific disciplines ranging from computational electromagnetics to computer graphics. Examples include stable discretizations of the eddy-current problem, topological methods for sensor network coverage, visualization of complex flows, surface parameterization, and the design of vector fields on meshes. In this thesis we describe efficient and scalable numerical solvers for discrete k-form problems. Our approach is based on the principles of algebraic multigrid (AMG) which is designed to solve large-scale linear systems with optimal, or near-optimal efficiency. Since the k-form problems to be solved are arbitrarily large, the need for scalable numerical solvers is clear.

Downloads

External Links

Algebraic Multigrid for K-form Laplacians

Citation

Algebraic Multigrid for k-form Laplacians
Nathan Bell and Luke N. Olson
Numerical Linear Algebra with Applications, Volume 15, Issue 2-3, Pages 165-185, February 2008

Abstract

In this paper we describe an aggregation-based algebraic multigrid method for the solution of discrete k-form Laplacians. Our work generalizes Reitzinger and Schöberl’s algorithm to higher-dimensional discrete forms. We provide conditions on the tentative prolongators under which the commutativity of the coarse and fine de Rham complexes is maintained. Further, a practical algorithm that satisfies these conditions is outlined, and smoothed prolongation operators and the associated finite element spaces are highlighted. Numerical evidence of the efficiency and generality of the proposed method is presented in the context of discrete Hodge decompositions.

Downloads

External Links

A Fast Multigrid Algorithm for Mesh Deformation

Citation

A Fast Multigrid Algorithm for Mesh Deformation
Lin Shi, Yizhou Yu, Nathan Bell, and Wei-Wen Feng
SIGGRAPH 2006 (ACM Transactions on Graphics, Vol. 24, No. 3)

Abstract

In this paper, we present a multigrid technique for efficiently deforming large surface and volume meshes. We show that a previous least-squares formulation for distortion minimization reduces to a Laplacian system on a general graph structure for which we derive an analytic expression. We then describe an efficient multigrid algorithm for solving the relevant equations. Here we develop novel prolongation and restriction operators used in the multigrid cycles. Combined with a simple but effective graph coarsening strategy, our algorithm can outperform other multigrid solvers and the factorization stage of direct solvers in both time and memory costs for large meshes. It is demonstrated that our solver can trade off accuracy for speed to achieve greater interactivity, which is attractive for manipulating large meshes. Our multigrid solver is particularly well suited for a mesh editing environment which does not permit extensive precomputation. Experimental evidence of these advantages is provided on a number of meshes with a wide range of size. With our mesh deformation solver, we also successfully demonstrate that visually appealing mesh animations can be generated from both motion capture data and a single base mesh even when they are inconsistent.

Downloads

External Links