In Search Of Clusters

Acknowledgements

The following are notes that I took on the book "In Search Of Clusters", "The Ongoing Battle In Lowly Parallel Computing" (2nd Edition)(published by Prentice Hall PTR, Copyright 1998) by Gregory F. Pfister. These notes are my work, but they are notes of Mr. Pfister's book, and thus should not be taken to be my ideas (note also that I don't have copyrights/trademarks on anything mentioned herein)

Cluster Concepts

A type of parallel or distributed system that

Consists of a collection of interconnected whole computers
Can be used as a single, unified computing resource

Why are clusters becoming more important? See the Standard Litany for generic reasons, plus recent changes in hardware and software:

Fat Boxes high performance microprocessors
Fat Pipers high performance networks
Thick Glue standard tools for distributed computing

Why aren't clusters taking over everything?

hard to administer
Limited exploitation: there isn't much software that uses clusters

Clusters can be (somewhat arbitrarily) categorized along a number of orthogonal axes:

Intracluster communication: Is the communication though a public network, or through a physically isolated cluster network? Note that ATM switches can be configured to provide the illusion that that cluster has a private network despite the switches' carrying of 'public' traffic.
Dedicated compute nodes: Are the nodes in a cluster dedicated to work done by the cluster ("Glass house"), or are the user workstations being used when the users aren't there ("campus wide"). Campus wide clusters are more complex, but allow for scavenging of unused CPU cycles.
Point Of Connection: Are the nodes of the cluster connected at the I/O subsystem (easier, but slower/ less bandwidth because of having to go through the network adaptor), or the memory subsystem. Additionally, do the clusters share (concurrently use) resources (like disks, memory?) or share nothing (which implies message-passing).
Communication Needs: Are the jobs compute-intensive, serial jobs (SPPS) with low I/O requirements, or I/O intensive jobs (which are more typical of commercial jobs).
Level Of Software Support: Does the clustering software provide low level support (ie, fast, low overhead communication, like SANs/ Virtual Interface Architecture), or higher level support (locks, data structures, group membership, like IBM's Parallel Sysplex).

Cluster, SMP, NUMA Hardware

Cluster Hardware

There are a number of technologies used, some of which are SANs, VIA, and the Coupling Facility of IBM's Parallel Sysplex.

SAN: System Area Network

Originally created by Tandem, in their ServerNet product, SANs are specialized networks that attatches directly to the system bus, thus allowing for many PCI buses (each with a small number of peripherals). Better than chaining I/O buses. Plus much higher bandwidth (since ServerNet attatches to the system bus and doesn't have to go through an adaptor, or O/S), reliable network (meaning less overhead in the protocol). SANs are often used more generally as a PCI adaptor, and thus aren't entirely as cool as ServerNet

Virtual Interface Architecture

U-Net commercialized, and without having to pay some university. Used to eliminate O/S overhead by moving as much of the networking stack into user space as possible, by:

zero copy messaging:scatter/gather lists;asynch send with the message being mutable until send completes
No interrupts:processes can poll without doing a context switch (which is good since the network is often extremely fast); multiple messages can be received at once
Virtualized User mode:There's no stack nec. associated with the interface - thus it allows for any protocol to be virtualized, so long as the network adaptor is sufficiently advanced.

DEC Memory Channel

A local distributed memory through hardware support. It operates by exploiting a bus's memory mapping ability (ie, write to this memory location, and have the bus shunt that write to a device), and by providing Memory Channel (MC) adaptors which provide additional memory mapping. What happens is a node writes to a memory mapped location which gets sent (via the MC adaptor) to a hub which broadcasts the write to all nodes which have registered (with O/S assistance) for this memory region (including the node that sent the write). The MC adaptor receives the write, and writes this into the local node's memory (if the node didn't register for this memory region, the write is ignored), thus allowing for global memory (formerly called "Reflective Memory"). This is a very low level facility, which actually isn't recommended for use in a shared memory type setting: DEC itself provides MPI and PVM libraries to actually do something with Memory Channel

IBM Parallel Sysplex: Coupling Facility

The Coupling Facility is a high-level facility, which provides commands to do things like node registration, deregistration (and fencing suspect nodes), global queues, global locks, cluster wide caching, as well as combinations of the above. The CF is layered over "CF Links', which actually shuffles the commands (and data to accomplish the commands) around

SMP: Symmetric MultiProcessors

Sometimes refered to as "Tightly Coupled Microprocessors", though SMP is a better summary. An SMP consists of a hardware and software model: there are multiple CPUs, but only one of every other subsystem (there's one I/O subsystem, one memory subsystem, etc). Each CPU can do anything any of the others can, meaning that an architecture with only 1 CPU able to do I/O isn't an SMP. Note that this also means that there can't be memory local to individual CPUs. It doesn't preclude memory caches (since they're not supposed to alter the view of memory). Note that this architecture is of somewhat limited scalability, since I/O (and, more importantly, memory) contention will reduce performance. Does tend to work well for compute bound (and betterf for "processor bound") processes, though.

Since main memory is so incredibly slow (relative to the CPU), CPUs use caches to increase the execution speed. Note that there are tricks that can be used to increase the memory bandwidth (for example, using separate memory banks so that the memory subsystem can field more requests per unit time). Note also that in an SMP setting, using a crossbar memory swtich allows any CPU to ask any memory banks for memory, simultaneously. SMPs have to worry about maintaining coherence among these caches so that different CPUs don't see different views of memory because of their (local) cache. If a process is migrated from one CPU to another, the memory must be the same on both CPUs. For parallel processes the shared memory must be consistent. Lastly, the SMP must be careful not to allow two CPUs to each cache the same X bytes, each use different exact bytes, and then have one accidentally overwrite the other when writeback occurs. There are a couple of SMP ways to solve this problem: a central directory and a snoopy bus. Note that we could opt for a non-SMP solution by having CPU instructions to flush/purge the cache, but this would violate the SMP model (one, consistent memory should be seen, but this makes the caches visible)

Central Directory
Also known as a "storage control unit" (in mainframes), this manages all reads/writes amongst the caches, simultaneously. Can be really fast, but doesn't scale. Not used in the general case

Snoopy bus
Each CPU broadcasts requests for data on a bus (when a cache miss occurs), and if another cache has the data, it returns it. Otherwise main memory does. This can be optimized by reducing the penalty for going to other caches (e.g., CPUs can also share level 2 caches), and by reducing the amount of traffic on the bus (for example, if the data came from main memory, nobody else has it, so don't broadcast writes to it). Can also reduce bus traffic by using a crossbar memory switch to transfer the actual data, and the bus just for broadcast updates.

With the above, we've accomplished cache coherency, and thus allowed access to the memory subsystem at a reasonable rate, with a consistent view of it. The next thing to address is that of sequential consistency. Two processes executing on two separate processors are sequential consistent if the instructions from the two programs are interleaved in some way. The interleaving can be arbitrary, except that the ordering of instructions from each process can't be changed. I.e., instruction 1 of process 1 must always execute before instruction 2, regardless of when instructions in process 2 execute. Since CPUs are often pipelined, and allow for reordering of the instructions (for example, by delaying store instructions in a 'pending store buffer'), this can cause problems. For machines with special instructions to facilitate locking, these instructions can be used to also ensure that the CPU itself isn't holding on to any particular state. Pfister recommends this b/c of the locality of a lock -- unlike a cache, a lock is isolated. Plus, many machines provide such instructions anyways, so it won't break the SMP model.

I/O for SMPs
Note that I/O devices must also participate in cache coherency, especially any sort of DMA thingee, so they get the right data/update it correctly. It's not unreasonable (for uniprocessors, anyways) to use flush/purge type instructions within the device driver to get this to work properly.

NUMA

NUMA stands for "NonUniform Memory Access", and refers to machine that are coupled at the memory level: any machine can access the memory of any other machine as if the remote memory was local. The remote memory will (of course) take longer to access, thus the "nonuniform" part of NUMA. If this delay is fairly small, then the SMP programming model is preserved. If this delay becomes significant, then programs have to be structured to compensate for this to achieve reasonable levels of performance. Note that the memory sharing is usually accomplished by mapping the memory of each node into a global address space. Thus, memory location X means the same thing to all nodes: a specific memory location on a specific machine. In fact, NUMA machines will run a single copy of the operating system across all of the nodes. Because of this, CC-NUMA systems don't offer the high availability that clusters do; It's also tough to offer continous availability (ie, upgrading one node while the rest continue to service requests). Next generation NUMA systems are trying to offer such high availability by allowing individual nodes to fail, but this makes them essentially clusters with a snazzy memory interconnect. NUMA can be subdivided into cache coherent (CC-NUMA) and noncoherent cache (NC-NUMA), depending on whether the system keeps the CPUs' views of memory consistent or not

CC-NUMA
Internally, the (possibly SMP) machine behaves the same. However, on the memory bus there exists an extra component (called a directory) which acts as a surrogate in accessing remote memory. As per the snoopy bus algorithm, when a location is requested, if the memory is remote (and the caches don't have it), main memory won't have it, but the directory box will notice that it know where the data is, and request the data from the appropriate machine's memory. If that machine (the 'home' machine) has already given out that data, it may have to go and get it back (unless, for example, both machines request the data in read-only mode) (note that this requires the home machine to keep a list of machines that have copies of the data). What all this work accomplishes is keeping the view of memory consistent, maintaining this coherency even in the face of caches on different machines. NC-NUMA
Noncoherent cache NUMA machines allow any machine to access any memory, but don't do any work to maintain a consistent view of memory. For example, the DEC Memory Channel is (kinda) like this, in that it allows one to share memory between machines, but does nothing to prevent different machines from making concurrent changes to their private copies of some global data

(CC)-NUMA requires changes in not only the hardware, but also the software that supports it -- for example, the memory allocation routines must be modified to preferentially allocate memory 'local' to proceses. Likewise, processes can't be migrated to any old CPU, since there's now the issue of how long it will take to access the memory. Virtual Memory systems may also need to be changed.

Arguments for NUMA

Memory Speed: NUMA is a way of accessing lots of memory at pretty much the same speed (or at least local memory at one speed, remote data at a slower speed). This gets around the Von Neumann bottleneck increasing the bandwidth between processors and main memory. Especially since CPUs tend to access memory in a 'bursty' fashion
Production Volumes: As more vendors produce NUMA machines, NUMA will aquire critical mass, thus getting cheaper & more popular
Demand for Parallelism: There exist people who want hugely parallel systems
Wave of the future: NUMA hype from the industry, and people wanting to jump on the 'wave of the future' bandwagon will create/increase critical mass
Overwhelming Memory Argument: The speed of interconnects, hardware, etc, to support NUMA will increase, thus making remote memory 'closer'. Access speeds of main memory (relative to CPU speeds) will at best stay the same, meaning that the overwhelming cost in NUMA will be accessing the memory (local or remote). Indeed, as memory access speeds get slower (relative to CPU speeds), it won't matter whether the memory is local or remote, since the memory access time will dominate the whole thing anyways. Which means that NUMA is very useful, and not much slower than regular memory access anyways

SCI: Scalable Coherent Interconnect

Consists of a number of nodes connected by SCI links (in a ring topology), with one of those directory boxes attatched to the memory bus of each node. The defining feature (from a CC-NUMA point of view) is that it maintains a 2 way linked list to keep track of who's got remote memory. This linked list is stored in the caches of the machines that hold the remote data, so that when a machine recieves a line cache from a remote source, it's also told the next and previous machine that holds the same cache. Doing things like purging a line from all the machines (e.g., when someone wants to update the line) is thus slow, but this scheme scales darn well.

DASH: Directory Architecture for SHared memory

To keep track of which nodes have a copy of a cache line, DASH uses a bit vector stored on the home machine of the cache line. Since all the data is stored in one place, purge requests can be broadcast (which is quick), but the scalability is limited b/c of the vector length. This was revised and made into FLASH (Flexible Architecture for SHared memory), which keeps the 'checked out' list locally on the home machine of a cache line, but keeps it in the forms of a linked list, so that it can scale almost arbitrarily

s-COMA: simple Cache Only Memory Architecture

This is kind of a variation on CC-NUMA, in that it uses the main memory as a cache for memory, as well as tieing it to the virtual memory subsystem. It works to overcome cache misses caused by lack of capacity.
When a memory page is first accessed, the mapper (analog of the directory box in CC-NUMA) gets told which machine the page resides on (the 'home machine'), and the address requested. It the creates a page in memory on the local machine to hold any data obtained from the remote machine, and thereafter retreives cache lines from the remote machine to the local page via standard CC-NUMA type algorithms. Thus, s-COMA uses memory to hold cache lines from pages that it's using, meaning that memory has become a huge cache for remote memory. The home machine keeps track of who's got copies so that updates and whatnot can be done. Note that if one machine uses a remote memory page extensively, the cache lines from that page pile up in the local page, and so the remote page slowly migrates to the machine that uses it most. For this reason, memory is sometimes called 'attractive memory'. The downside of s-COMA is that it's easy for multiple copies of pages to proliferate.

Software Issues

Workload Types

Area	Name	Desc.
Serial	Batch	To increase throughput (ie, more jobs per unit time). SPPS - requires no modification of software. I/O may be an issue, particularly on SMPs and if the data needs to be shuffled between nodes
Serial	Interactive-Applications	E.g., databases and web servers. Appear to be serial, but exploit parallelism
Serial	Interactive-Native Login	Not much good for highly interactive tasks (like editing), but really good for 'steering' huge tasks -- periodically examining the results, and tweaking/restarting things if they're not going well
Serial	Multijob Parallel	Lots of serparate jobs that depend on each oher. For example, a simulation composed of multiple processes, where some steps should be rechecked.
Parallel-Huge	Grand Challenge	Huge tasks that are of scientific/commercial value that can't be done with anything smaller
Parallel-Huge	Research	Parallelism is cool & grad students are cheap
Parallel-Huge	Heavy-Duty Commercial	Stuff like data mining, where such a huge volume of data needs to be treated that parallelism is needed
Parallel-Small	Commercial	Stuff like databases, where it's possible to sell it b/c the performance requirements are that demanding. Also stuff like fast ray-tracing for computer graphics (a la Toy Story)
Parallel-Small	Technical	Parallel compilers, air craft simulators, and nuclear testing programs.

Amdahl's Law
( total execution time ) = (parallel part) + (serial part) Thus, even as the parallel part gets really small, the serial part will stay the same, and thus will dominate execution time. Thus, why do w/ parallelism since it's improvements are neccessarily limited? Therefore, in order to improve system performance, we've got to improve hardware, operating system, middleware (database, communication, etc), and the application. If any one isn't tuned, whole thing falls flat.

Programming Models

General Concurrency Issues

A lot of this stuff is old hat from my OS class, thus will be covered in even less detail than the book did. Race conditions can arise from improper synchronization. Excessive use of critical sections lead to serialization, which is bad since we want parallelism, not serialization. Further, improper use of locks can lead to deadlock, but this can be foiled either by avoidance (don't let them occur - for example, don't allow cycles in the resource requests), or detection (better for things like databases, where rollback is provided for free).

SMP
SMPs are tricky because it's really easy to mess up, or to not efficiently utilize the CPUs. For example, if each CPU plucks jobs out of a pool of parallel jobs, at some point adding CPUs won't help b/c the extra CPUs will spend a good deal of time waiting for other CPUs. If all the (identical) parallel jobs are of length P, and there are N CPUs, each of which have to do a job of length S in the course of doing the parallel job (ie, getting locks, etc). then once
N * S > P
there's no point in adding more CPUs, since the additional CPUs will be bound by the amount of serial work (ie, it'll finish and then wait. How to improve? Sometimes, the source data can be alternated from the target memory, thus converting an in-place operation into something else (ie, from the old matrix into a new matrix, but this doubles the space requirements), by alternating rows of a matrix, or doing a red/black checkerboard. Be careful not to let this change the algorithm, though - Classic Parallel Error #1:Parallelizing an algorithm that's different from the one we started with. Also, can improve CPU performance by 'chunking'. Note that SMP CPU swapping may assign a CPU to a completely different job than the one it was working on, thus eliminating an cache benefit. Most OS's support some form of processor affinity to combat this.
Thus, to increase speed, reduce the amount of needed locking, and chunk work together

Load balancing may be an issue, particularly if the jobs aren't all the same length. For SMPs, a global queue is often a good idea.

Message PassingMessage passing is constrained by the fact that message passing is expensive (slow communication, lots of overhead in the OS, network). Thus, good performance is dependent on one's ability to minimize frequent messages. Overhead from messages often neccessitate larger workload granularity. Broadcast messages can be very useful. Some things get easier (CPUs won't mess with each others' memory).

Summary:
SMP: get all the CPUs working, correctly. Doesn't scale really well
Message passing: Minimize amount of messages.
cc-NUMA: A combination of SMP & Message passing: minimize remote memory accesses, and get the CPUs to work as much as possible

Other taxonomies

Flynn: SISD (not used), MISD (not used), MIMD, SIMD. Extended by Pfister to include SPPS and SPMD

Distributed Virtual Memory: software based VM. A specific example is COMA, except that COMA has hardware support, and no concept of main memory (just different levels of caches)

Declarative languages: dataflow (do work when data becomes available), or reduction (do work when results are asked for). Somewhat related is Linda, which allows for processes to put stuff into a global 'tuple-space' with put(), and get stuff out via pattern matching with get() (kinda like a funky distributed SQL).

AMO compilers / APCs : "A Miracle Occurs" / "Automatically Parallelizing Compiler" - you feed in a serial program, and out comes a parallelized program. Not ideal, since the really important (ie, algorithmic) optimizations can't be done automatically, since we don't have a way of encoding information about that in the source file (yet)

Commercial Programming Models

"Small-N" Techniques

These are applicable to small numbers of processors, the number of which varies from architecture to architecture.
Threading: SMPs are popular with threads because threads can be stapled to separate processors. This helps form the basis between SMP and client-server computing. Note that one can use kernel threads (which can be stapled to CPUs, but consume more kernel resources), or user-mode threads (which aren't scheduled by the kernel, but are more lightweight). Note that processes that can share memory can 'fake' threads -- create a shared segment, then launch a bunch of processes that access that shared segment, and voila! Threads

Glossary

APC - Automatically Parallelizing Compiler: Sometimes, and AMO ("A Miracle Occurs") compiler
Bandwidth: How much of something gets done in a certain amount of time
Cache Coherence: Keeping multiple caches (in an SMP, for example) consistent, despite updates, etc
cache cross-interrogation: The problem that caches have to ask each other for the most recent state of certain data
cache line: a chunk of data moved from memory into a cache
Cache protocol: A snoopy bus, the messages sent on it, and the responces required to maintain cache coherence. Some examples: MESI (Modified, Exclusive, Shared, Invalid), Berkeley, Firefly.
chunking: Improving performance by decreasing the granularity of work. Instead of having the CPUs compute individual matrix entries, hand out a block of the matrix, have it compute all of it. Exploits locality for caching
Compute Bound prcoess: A process that does far more computation than I/O, thus is bounded more by the computational power of the CPU than the speed of I/O. Often scientific and engineering jobs
critical section: A section of code that only 1 process is allowed to execute. Prevents race conditions.
false sharing: Two caches get the same region of memory, modify different bytes in that region, and then try and put back the whole region. Thus one will overwrite the results of the other
FCS: Fibre Channel Standard: High speed network
Fence: IBM Parallel Sysplex specific (?): (Verb) To disconnect a suspect node from the rest of the cluster.
I/O Shipping: In a shared nothing environment, if a node wants to read/write data from/to a disk that it doesn't own, it can 'ship' the I/O request to the node that currently owns the disk
latency: how long something takes
MIMD - Multiple Instruction, Multiple Data: Any system that brings mulitple CPUs to bear on a problem that has multiple data streams.
MISD - Multiple Instruction, Single Data: No real examples, not a widely used term
MPI: Message Passing Interface: A means of creating parallel/cluster programs. Perhaps used by Beowolf, the Linux cluster thingee
MPP: Massively Parallel Processing
multitailed disks: A hard disk such that multiple nodes of a cluster can access it
NORMA: NO Remote Memory Access: Computers that communicate by message passing, rather than sharing memory. See also NUMA, UMA
NUMA: Non Uniform Memory Access: Comes in both CC-NUMA (cache coherent) and NC-NUMA (Noncoherent cache), and generally refers to computer systems with memories connected in such a way that any CPU can access any location in memory as if it was local. Accessing remote memory will take longer than local memory (thus the nonuniform). See also UMA, NORMA. A.k.a. "Scalable Shared Memory Multiprocessing" (S2MP) by SGI.
processor affinity: In an SMP, processes can be migrated from one CPU to another, even as a matter of simple scheduling. This decreases the usefulness of the cache. Processor affinity means that CPUs tend to get assigned the jobs they were just working on (where possible).
"processor bound" process: A term coined by Gregory Pfister, refering to a process in an SMP setting that not only is compute bound, but also doesn't access memory that much, and thus can pretty much run full speed from what's in it's cache
PVM: Portable Virtual Machine: Created by the Oakridge National Laboratory at the University of Tennese, as a means of easily creating parallel programs. Similar to MPI
race condition: Concurrent programming: behavior of the program depends on the relative speeds of two processes. Shouldn't happen; indicates that the processes aren't synchronized properly.
SAN: System Area Network: High speed (high bandwidth), low overhead, possilby high reliability network for connecting peripherals to CPUs
SCI: Scalable Coherent Interconnect
sequential consistency: For two programs being executed on two processors, the actual order of execution of the instructions of the two programs can be interleaved arbitrarily, but the order of the instructions within each program/process can't be reordered
"shared nothing": Within a cluster, the nodes/machines never share resources - at any given point in time, only one machine owns (and thus is allowed to use) any given resource at one time. There's no reason why the resource can't be given to another machine
SIMD - Single Instruction, Multiple Data: A system built with a central control unit, and a bunch of operational units. Central unit issues the same command to all op. units, thus achieving some parallelism. Typically limited to specialized machines.
SISD - Single Instruction, Single Data: a von Neumann machine. Not a widely used term.
SPMD - Single Program, Multiple Data: Coined by Pfister, refers to situations where there's a single copy of the OS, and it executes a program.
snoopy bus: A cache coherence scheme in SMPs: when a cache miss occurs, broadcast the request to all the other caches & main memory. If another cache returns the data first, then cancel the main memory request & use it. Otherwise use what main memory gives the cache
SPPS: Serial Program, Parallel Subsystem
Standard Litany, The: The mantra chanted in support of clusters/distributed/parallel computing
UMA: Uniform Memory Access: SMP type machines, where all memory takes pretty much the same amount of time to access. See also NUMA, NORMA
UMA: Unified Memory Architecture: Means that the graphics memory is part of main memory, as opposed to being separate. First popularized by the Apple II. Unrelated to Uniform Memory Access
Von Neumann bottleneck: A Von Neumann machine separates the CPU from the memory, placing all the data in the memory of the computer. Simply putting more CPUs in won't neccessarily boost performance, since the rate at which data can be retrieved from memory becomes the (Von Neumann) bottleneck

Stuff To Look Into:

MPI, as it's used in a number of different areas for parallel programming.
Pg. 123: "The program goes into system state with changing address space, something that is possible in OS/390, Windows NT, VMS, and some other operations systems, but not UNIX". This is used to do efficient polling (instead of interrupts)
process migration: Sprite OS:

The LOCUS Distributed System Architecture
Why do cc-NUMA systems purge systems of cache lines when updates are done, instead of updating them?
HiveOS: Fault Containment for Shared Memory Multiprocessors, 15th Symposium on Operating Systems Principles, ACM, December, 1995
Almasi & gottlieb - "Highly Parallel Computing, 2nd Edition", for wait-free shared data structures

Notes written by Michael Panitz. Copyright, All rights reserved, 1998.
You're free to use this page, just not for profit. Not that I'm expecting anyone to, but I may as well put this here, just in case