In Search Of Clusters
Acknowledgements
The following are notes that I took on the book "In Search Of Clusters",
"The Ongoing Battle In Lowly Parallel Computing" (2nd Edition)(published
by Prentice Hall PTR, Copyright 1998) by Gregory F. Pfister. These notes
are my work, but they are notes of Mr. Pfister's book, and thus should
not be taken to be my ideas (note also that I don't have copyrights/trademarks
on anything mentioned herein)
Cluster Concepts
A type of parallel or distributed system that
-
Consists of a collection of interconnected whole computers
-
Can be used as a single, unified computing resource
Why are clusters becoming more important? See the Standard
Litany for generic reasons, plus recent changes in hardware and software:
-
Fat Boxes high performance microprocessors
-
Fat Pipers high performance networks
-
Thick Glue standard tools for distributed computing
Why aren't clusters taking over everything?
-
hard to administer
-
Limited exploitation: there isn't much software that uses clusters
Clusters can be (somewhat arbitrarily) categorized along a number of orthogonal
axes:
-
Intracluster communication
-
Is the communication though a public network, or through a physically isolated
cluster network? Note that ATM switches can be configured to provide the
illusion that that cluster has a private network despite the switches'
carrying of 'public' traffic.
-
Dedicated compute nodes
-
Are the nodes in a cluster dedicated to work done by the cluster ("Glass
house"), or are the user workstations being used when the users aren't
there ("campus wide"). Campus wide clusters are more complex, but allow
for scavenging of unused CPU cycles.
-
Point Of Connection
-
Are the nodes of the cluster connected at the I/O subsystem (easier, but
slower/ less bandwidth because of having to go through the network adaptor),
or the memory subsystem. Additionally, do the clusters share (concurrently
use) resources (like disks, memory?) or share nothing (which implies message-passing).
-
Communication Needs
-
Are the jobs compute-intensive, serial jobs (SPPS) with low I/O requirements,
or I/O intensive jobs (which are more typical of commercial jobs).
-
Level Of Software Support
-
Does the clustering software provide low level support (ie, fast, low overhead
communication, like SANs/ Virtual Interface Architecture), or higher level
support (locks, data structures, group membership, like IBM's Parallel
Sysplex).
Cluster, SMP, NUMA Hardware
Cluster Hardware
There are a number of technologies used, some of which are SANs, VIA, and
the Coupling Facility of IBM's Parallel Sysplex.
SAN: System Area Network
Originally created by Tandem, in their ServerNet product, SANs are specialized
networks that attatches directly to the system bus, thus allowing for many
PCI buses (each with a small number of peripherals). Better than chaining
I/O buses. Plus much higher bandwidth (since ServerNet attatches to the
system bus and doesn't have to go through an adaptor, or O/S), reliable
network (meaning less overhead in the protocol). SANs are often used more
generally as a PCI adaptor, and thus aren't entirely as cool as ServerNet
Virtual Interface Architecture
U-Net commercialized, and without having to pay some university. Used to
eliminate O/S overhead by moving as much of the networking stack into user
space as possible, by:
-
zero copy messaging:scatter/gather lists;asynch send with the message
being mutable until send completes
-
No interrupts:processes can poll without doing a context switch
(which is good since the network is often extremely fast); multiple messages
can be received at once
-
Virtualized User mode:There's no stack nec. associated with the
interface - thus it allows for any protocol to be virtualized, so long
as the network adaptor is sufficiently advanced.
DEC Memory Channel
A local distributed memory through hardware support. It operates by exploiting
a bus's memory mapping ability (ie, write to this memory location, and
have the bus shunt that write to a device), and by providing Memory Channel
(MC) adaptors which provide additional memory mapping. What happens is
a node writes to a memory mapped location which gets sent (via the MC adaptor)
to a hub which broadcasts the write to all nodes which have registered
(with O/S assistance) for this memory region (including the node that sent
the write). The MC adaptor receives the write, and writes this into the
local node's memory (if the node didn't register for this memory region,
the write is ignored), thus allowing for global memory (formerly called
"Reflective Memory"). This is a very low level facility, which actually
isn't recommended for use in a shared memory type setting: DEC itself provides
MPI and PVM libraries to actually do something with Memory Channel
IBM Parallel Sysplex: Coupling Facility
The Coupling Facility is a high-level facility, which provides commands
to do things like node registration, deregistration (and fencing suspect
nodes), global queues, global locks, cluster wide caching, as well as combinations
of the above. The CF is layered over "CF Links', which actually shuffles
the commands (and data to accomplish the commands) around
SMP: Symmetric MultiProcessors
Sometimes refered to as "Tightly Coupled Microprocessors", though SMP is
a better summary. An SMP consists of a hardware and software model: there
are multiple CPUs, but only one of every other subsystem (there's one I/O
subsystem, one memory subsystem, etc). Each CPU can do anything any of
the others can, meaning that an architecture with only 1 CPU able to do
I/O isn't an SMP. Note that this also means that there can't be memory
local to individual CPUs. It doesn't preclude memory caches (since they're
not supposed to alter the view of memory). Note that this architecture
is of somewhat limited scalability, since I/O (and, more importantly, memory)
contention will reduce performance. Does tend to work well for compute
bound (and betterf for "processor bound") processes, though.
Since main memory is so incredibly slow (relative to the CPU), CPUs
use caches to increase the execution speed. Note that there are tricks
that can be used to increase the memory bandwidth (for example, using separate
memory banks so that the memory subsystem can field more requests per unit
time). Note also that in an SMP setting, using a crossbar memory swtich
allows any CPU to ask any memory banks for memory, simultaneously. SMPs
have to worry about maintaining coherence among these caches so that different
CPUs don't see different views of memory because of their (local) cache.
If a process is migrated from one CPU to another, the memory must be the
same on both CPUs. For parallel processes the shared memory must be consistent.
Lastly, the SMP must be careful not to allow two CPUs to each cache the
same X bytes, each use different exact bytes, and then have one accidentally
overwrite the other when writeback occurs. There are a couple of SMP ways
to solve this problem: a central directory and a snoopy bus. Note that
we could opt for a non-SMP solution by having CPU instructions to flush/purge
the cache, but this would violate the SMP model (one, consistent memory
should be seen, but this makes the caches visible)
Central Directory
Also known as a "storage control unit" (in mainframes), this manages
all reads/writes amongst the caches, simultaneously. Can be really fast,
but doesn't scale. Not used in the general case
Snoopy bus
Each CPU broadcasts requests for data on a bus (when a cache miss occurs),
and if another cache has the data, it returns it. Otherwise main memory
does. This can be optimized by reducing the penalty for going to other
caches (e.g., CPUs can also share level 2 caches), and by reducing the
amount of traffic on the bus (for example, if the data came from main memory,
nobody else has it, so don't broadcast writes to it). Can also reduce bus
traffic by using a crossbar memory switch to transfer the actual data,
and the bus just for broadcast updates.
With the above, we've accomplished cache coherency, and thus allowed
access to the memory subsystem at a reasonable rate, with a consistent
view of it. The next thing to address is that of sequential consistency.
Two processes executing on two separate processors are sequential consistent
if the instructions from the two programs are interleaved in some way.
The interleaving can be arbitrary, except that the ordering of instructions
from each process can't be changed. I.e., instruction 1 of process 1 must
always execute before instruction 2, regardless of when instructions in
process 2 execute. Since CPUs are often pipelined, and allow for reordering
of the instructions (for example, by delaying store instructions in a 'pending
store buffer'), this can cause problems. For machines with special instructions
to facilitate locking, these instructions can be used to also ensure that
the CPU itself isn't holding on to any particular state. Pfister recommends
this b/c of the locality of a lock -- unlike a cache, a lock is isolated.
Plus, many machines provide such instructions anyways, so it won't break
the SMP model.
I/O for SMPs
Note that I/O devices must also participate in cache coherency, especially
any sort of DMA thingee, so they get the right data/update it correctly.
It's not unreasonable (for uniprocessors, anyways) to use flush/purge type
instructions within the device driver to get this to work properly.
NUMA
NUMA stands for "NonUniform Memory Access", and refers to machine that
are coupled at the memory level: any machine can access the memory of any
other machine as if the remote memory was local. The remote memory will
(of course) take longer to access, thus the "nonuniform" part of NUMA.
If this delay is fairly small, then the SMP programming model is preserved.
If this delay becomes significant, then programs have to be structured
to compensate for this to achieve reasonable levels of performance. Note
that the memory sharing is usually accomplished by mapping the memory of
each node into a global address space. Thus, memory location X means the
same thing to all nodes: a specific memory location on a specific machine.
In fact, NUMA machines will run a single copy of the operating system across
all of the nodes. Because of this, CC-NUMA systems don't offer the high
availability that clusters do; It's also tough to offer continous availability
(ie, upgrading one node while the rest continue to service requests). Next
generation NUMA systems are trying to offer such high availability by allowing
individual nodes to fail, but this makes them essentially clusters with
a snazzy memory interconnect. NUMA can be subdivided into cache coherent
(CC-NUMA) and noncoherent cache (NC-NUMA), depending on whether the system
keeps the CPUs' views of memory consistent or not
CC-NUMA
Internally, the (possibly SMP) machine behaves the same. However, on
the memory bus there exists an extra component (called a directory) which
acts as a surrogate in accessing remote memory. As per the snoopy bus algorithm,
when a location is requested, if the memory is remote (and the caches don't
have it), main memory won't have it, but the directory box will notice
that it know where the data is, and request the data from the appropriate
machine's memory. If that machine (the 'home' machine) has already given
out that data, it may have to go and get it back (unless, for example,
both machines request the data in read-only mode) (note that this requires
the home machine to keep a list of machines that have copies of the data).
What all this work accomplishes is keeping the view of memory consistent,
maintaining this coherency even in the face of caches on different machines.
NC-NUMA
Noncoherent cache NUMA machines allow any machine to access any memory,
but don't do any work to maintain a consistent view of memory. For example,
the DEC Memory Channel is (kinda) like this, in that it allows one to share
memory between machines, but does nothing to prevent different machines
from making concurrent changes to their private copies of some global data
(CC)-NUMA requires changes in not only the hardware, but also the software
that supports it -- for example, the memory allocation routines must be
modified to preferentially allocate memory 'local' to proceses. Likewise,
processes can't be migrated to any old CPU, since there's now the issue
of how long it will take to access the memory. Virtual Memory systems may
also need to be changed.
Arguments for NUMA
-
Memory Speed
-
NUMA is a way of accessing lots of memory at pretty much the same speed
(or at least local memory at one speed, remote data at a slower speed).
This gets around the Von Neumann bottleneck increasing the bandwidth between
processors and main memory. Especially since CPUs tend to access memory
in a 'bursty' fashion
-
Production Volumes
-
As more vendors produce NUMA machines, NUMA will aquire critical mass,
thus getting cheaper & more popular
-
Demand for Parallelism
-
There exist people who want hugely parallel systems
-
Wave of the future
-
NUMA hype from the industry, and people wanting to jump on the 'wave of
the future' bandwagon will create/increase critical mass
-
Overwhelming Memory Argument
-
The speed of interconnects, hardware, etc, to support NUMA will increase,
thus making remote memory 'closer'. Access speeds of main memory (relative
to CPU speeds) will at best stay the same, meaning that the overwhelming
cost in NUMA will be accessing the memory (local or remote). Indeed, as
memory access speeds get slower (relative to CPU speeds), it won't matter
whether the memory is local or remote, since the memory access time will
dominate the whole thing anyways. Which means that NUMA is very useful,
and not much slower than regular memory access anyways
SCI: Scalable Coherent Interconnect
Consists of a number of nodes connected by SCI links (in a ring topology),
with one of those directory boxes attatched to the memory bus of each node.
The defining feature (from a CC-NUMA point of view) is that it maintains
a 2 way linked list to keep track of who's got remote memory. This linked
list is stored in the caches of the machines that hold the remote data,
so that when a machine recieves a line cache from a remote source, it's
also told the next and previous machine that holds the same cache. Doing
things like purging a line from all the machines (e.g., when someone wants
to update the line) is thus slow, but this scheme scales darn well.
DASH: Directory Architecture for SHared memory
To keep track of which nodes have a copy of a cache line, DASH uses a bit
vector stored on the home machine of the cache line. Since all the data
is stored in one place, purge requests can be broadcast (which is quick),
but the scalability is limited b/c of the vector length. This was revised
and made into FLASH (Flexible Architecture for SHared
memory), which keeps the 'checked out' list locally on the home machine
of a cache line, but keeps it in the forms of a linked list, so that it
can scale almost arbitrarily
s-COMA: simple Cache Only Memory Architecture
This is kind of a variation on CC-NUMA, in that it uses the main memory
as a cache for memory, as well as tieing it to the virtual memory subsystem.
It works to overcome cache misses caused by lack of capacity.
When a memory page is first accessed, the mapper (analog of the directory
box in CC-NUMA) gets told which machine the page resides on (the 'home
machine'), and the address requested. It the creates a page in memory on
the local machine to hold any data obtained from the remote machine, and
thereafter retreives cache lines from the remote machine to the local page
via standard CC-NUMA type algorithms. Thus, s-COMA uses memory to hold
cache lines from pages that it's using, meaning that memory has become
a huge cache for remote memory. The home machine keeps track of who's got
copies so that updates and whatnot can be done. Note that if one machine
uses a remote memory page extensively, the cache lines from that page pile
up in the local page, and so the remote page slowly migrates to the machine
that uses it most. For this reason, memory is sometimes called 'attractive
memory'. The downside of s-COMA is that it's easy for multiple copies of
pages to proliferate.
Software Issues
Workload Types
Area |
Name |
Desc. |
Serial |
Batch |
To increase throughput (ie, more jobs per unit time). SPPS - requires
no modification of software. I/O may be an issue, particularly on SMPs
and if the data needs to be shuffled between nodes |
Serial |
Interactive-Applications |
E.g., databases and web servers. Appear to be serial, but exploit parallelism |
Serial |
Interactive-Native Login |
Not much good for highly interactive tasks (like editing), but really
good for 'steering' huge tasks -- periodically examining the results, and
tweaking/restarting things if they're not going well |
Serial |
Multijob Parallel |
Lots of serparate jobs that depend on each oher. For example, a simulation
composed of multiple processes, where some steps should be rechecked. |
Parallel-Huge |
Grand Challenge |
Huge tasks that are of scientific/commercial value that can't be done
with anything smaller |
Parallel-Huge |
Research |
Parallelism is cool & grad students are cheap |
Parallel-Huge |
Heavy-Duty Commercial |
Stuff like data mining, where such a huge volume of data needs to be
treated that parallelism is needed |
Parallel-Small |
Commercial |
Stuff like databases, where it's possible to sell it b/c the performance
requirements are that demanding. Also stuff like fast ray-tracing for computer
graphics (a la Toy Story) |
Parallel-Small |
Technical |
Parallel compilers, air craft simulators, and nuclear testing programs. |
Amdahl's Law
( total execution time ) = (parallel part) + (serial part) Thus, even
as the parallel part gets really small, the serial part will stay
the same, and thus will dominate execution time. Thus, why do w/ parallelism
since it's improvements are neccessarily limited? Therefore, in order to
improve system performance, we've got to improve hardware, operating system,
middleware (database, communication, etc), and the application. If any
one isn't tuned, whole thing falls flat.
Programming Models
General Concurrency Issues
A lot of this stuff is old hat from my OS class, thus will be covered in
even less detail than the book did. Race conditions can arise from improper
synchronization. Excessive use of critical sections lead to serialization,
which is bad since we want parallelism, not serialization. Further, improper
use of locks can lead to deadlock, but this can be foiled either by avoidance
(don't let them occur - for example, don't allow cycles in the resource
requests), or detection (better for things like databases, where rollback
is provided for free).
SMP
SMPs are tricky because it's really easy to mess up, or to not efficiently
utilize the CPUs. For example, if each CPU plucks jobs out of a pool of
parallel jobs, at some point adding CPUs won't help b/c the extra CPUs
will spend a good deal of time waiting for other CPUs. If all the (identical)
parallel jobs are of length P, and there are N CPUs, each of which have
to do a job of length S in the course of doing the parallel job (ie, getting
locks, etc). then once
N * S > P
there's no point in adding more CPUs, since the additional CPUs will
be bound by the amount of serial work (ie, it'll finish and then wait.
How to improve? Sometimes, the source data can be alternated from the target
memory, thus converting an in-place operation into something else (ie,
from the old matrix into a new matrix, but this doubles the space requirements),
by alternating rows of a matrix, or doing a red/black checkerboard. Be
careful not to let this change the algorithm, though - Classic Parallel
Error #1:Parallelizing an algorithm that's different from the one we
started with. Also, can improve CPU performance by 'chunking'. Note that
SMP CPU swapping may assign a CPU to a completely different job than the
one it was working on, thus eliminating an cache benefit. Most OS's support
some form of processor affinity to combat this.
Thus, to increase speed, reduce the amount of needed locking, and chunk
work together
Load balancing may be an issue, particularly if the jobs aren't all
the same length. For SMPs, a global queue is often a good idea.
Message PassingMessage passing is constrained by the fact that
message passing is expensive (slow communication, lots of overhead in the
OS, network). Thus, good performance is dependent on one's ability to minimize
frequent messages. Overhead from messages often neccessitate larger workload
granularity. Broadcast messages can be very useful. Some things get easier
(CPUs won't mess with each others' memory).
Summary:
SMP: get all the CPUs working, correctly. Doesn't scale really
well
Message passing: Minimize amount of messages.
cc-NUMA: A combination of SMP & Message passing: minimize
remote memory accesses, and get the CPUs to work as much as possible
Other taxonomies
Flynn: SISD (not used), MISD (not used), MIMD, SIMD. Extended by Pfister
to include SPPS and SPMD
Distributed Virtual Memory: software based VM. A specific example is
COMA, except that COMA has hardware support, and no concept of main memory
(just different levels of caches)
Declarative languages: dataflow (do work when data becomes available),
or reduction (do work when results are asked for). Somewhat related is
Linda, which allows for processes to put stuff into a global 'tuple-space'
with put(), and get stuff out via pattern matching with get() (kinda like
a funky distributed SQL).
AMO compilers / APCs : "A Miracle Occurs" / "Automatically Parallelizing
Compiler" - you feed in a serial program, and out comes a parallelized
program. Not ideal, since the really important (ie, algorithmic)
optimizations can't be done automatically, since we don't have a way of
encoding information about that in the source file (yet)
Commercial Programming Models
"Small-N" Techniques
These are applicable to small numbers of processors, the number of which
varies from architecture to architecture.
Threading: SMPs are popular with threads because threads
can be stapled to separate processors. This helps form the basis
between SMP and client-server computing. Note that one can use kernel
threads (which can be stapled to CPUs, but consume more kernel resources),
or user-mode threads (which aren't scheduled by the kernel, but are more
lightweight). Note that processes that can share memory can 'fake'
threads -- create a shared segment, then launch a bunch of processes that
access that shared segment, and voila! Threads
Glossary
-
APC - Automatically Parallelizing Compiler
-
Sometimes, and AMO ("A Miracle Occurs") compiler
-
Bandwidth
-
How much of something gets done in a certain amount of time
-
Cache Coherence
-
Keeping multiple caches (in an SMP, for example) consistent, despite updates,
etc
.
-
cache cross-interrogation
-
The problem that caches have to ask each other for the most recent state
of certain data
-
cache line
-
a chunk of data moved from memory into a cache
-
Cache protocol
-
A snoopy bus, the messages sent on it, and the responces required to maintain
cache coherence. Some examples: MESI (Modified, Exclusive, Shared, Invalid),
Berkeley, Firefly.
-
chunking
-
Improving performance by decreasing the granularity of work. Instead of
having the CPUs compute individual matrix entries, hand out a block of
the matrix, have it compute all of it. Exploits locality for caching
-
Compute Bound prcoess
-
A process that does far more computation than I/O, thus is bounded more
by the computational power of the CPU than the speed of I/O. Often scientific
and engineering jobs
-
critical section
-
A section of code that only 1 process is allowed to execute. Prevents race
conditions.
-
false sharing
-
Two caches get the same region of memory, modify different bytes in that
region, and then try and put back the whole region. Thus one will overwrite
the results of the other
-
FCS: Fibre Channel Standard
-
High speed network
-
Fence: IBM Parallel Sysplex specific (?)
-
(Verb) To disconnect a suspect node from the rest of the cluster.
-
I/O Shipping
-
In a shared nothing environment, if a node wants to read/write data from/to
a disk that it doesn't own, it can 'ship' the I/O request to the node that
currently owns the disk
-
latency
-
how long something takes
-
MIMD - Multiple Instruction, Multiple Data
-
Any system that brings mulitple CPUs to bear on a problem that has multiple
data streams.
-
MISD - Multiple Instruction, Single Data
-
No real examples, not a widely used term
-
MPI: Message Passing Interface
-
A means of creating parallel/cluster programs. Perhaps used by Beowolf,
the Linux cluster thingee
-
MPP: Massively Parallel Processing
-
-
multitailed disks
-
A hard disk such that multiple nodes of a cluster can access it
-
NORMA: NO Remote Memory Access
-
Computers that communicate by message passing, rather than sharing memory.
See also NUMA, UMA
-
NUMA: Non Uniform Memory Access
-
Comes in both CC-NUMA (cache coherent) and NC-NUMA (Noncoherent cache),
and generally refers to computer systems with memories connected in such
a way that any CPU can access any location in memory as if it was local.
Accessing remote memory will take longer than local memory (thus the nonuniform).
See also UMA, NORMA. A.k.a. "Scalable Shared Memory Multiprocessing" (S2MP)
by SGI.
-
processor affinity
-
In an SMP, processes can be migrated from one CPU to another, even as a
matter of simple scheduling. This decreases the usefulness of the cache.
Processor affinity means that CPUs tend to get assigned the jobs they were
just working on (where possible).
-
"processor bound" process
-
A term coined by Gregory Pfister, refering to a process in an SMP setting
that not only is compute bound, but also doesn't access memory that much,
and thus can pretty much run full speed from what's in it's cache
-
PVM: Portable Virtual Machine
-
Created by the Oakridge National Laboratory at the University of Tennese,
as a means of easily creating parallel programs. Similar to MPI
-
race condition
-
Concurrent programming: behavior of the program depends on the relative
speeds of two processes. Shouldn't happen; indicates that the processes
aren't synchronized properly.
-
SAN: System Area Network
-
High speed (high bandwidth), low overhead, possilby high reliability network
for connecting peripherals to CPUs
-
SCI: Scalable Coherent Interconnect
-
-
sequential consistency
-
For two programs being executed on two processors, the actual order of
execution of the instructions of the two programs can be interleaved arbitrarily,
but the order of the instructions within each program/process can't be
reordered
-
"shared nothing"
-
Within a cluster, the nodes/machines never share resources - at any given
point in time, only one machine owns (and thus is allowed to use) any given
resource at one time. There's no reason why the resource can't be given
to another machine
-
SIMD - Single Instruction, Multiple Data
-
A system built with a central control unit, and a bunch of operational
units. Central unit issues the same command to all op. units, thus achieving
some parallelism. Typically limited to specialized machines.
-
SISD - Single Instruction, Single Data
-
a von Neumann machine. Not a widely used term.
-
SPMD - Single Program, Multiple Data
-
Coined by Pfister, refers to situations where there's a single copy of
the OS, and it executes a program.
-
snoopy bus
-
A cache coherence scheme in SMPs: when a cache miss occurs, broadcast the
request to all the other caches & main memory. If another cache returns
the data first, then cancel the main memory request & use it. Otherwise
use what main memory gives the cache
-
SPPS
-
Serial Program, Parallel Subsystem
-
Standard Litany, The
-
The mantra chanted in support of clusters/distributed/parallel computing
-
Better Performance
-
Better Price/Performance Ratio
-
Scaling
-
Incremental Growth
-
Higher Availability: this is increasingly important, thus clusters
( Better than traditional, monolithic, single computers.) With the additiona
of the following, one gets the Enhanced Litany:
-
Scavenging of unused CPU cycles
-
UMA: Uniform Memory Access
-
SMP type machines, where all memory takes pretty much the same amount of
time to access. See also NUMA, NORMA
-
UMA: Unified Memory Architecture
-
Means that the graphics memory is part of main memory, as opposed to being
separate. First popularized by the Apple II. Unrelated to Uniform Memory
Access
-
Von Neumann bottleneck
-
A Von Neumann machine separates the CPU from the memory, placing all the
data in the memory of the computer. Simply putting more CPUs in won't neccessarily
boost performance, since the rate at which data can be retrieved from memory
becomes the (Von Neumann) bottleneck
Stuff To Look Into:
-
MPI, as it's used in a number of different areas for parallel programming.
-
Pg. 123: "The program goes into system state with changing address space,
something that is possible in OS/390, Windows NT, VMS, and some other operations
systems, but not UNIX". This is used to do efficient polling (instead of
interrupts)
-
process migration: Sprite OS:
Computer, 21(2):23-26, February, 1998
Computing Systems - Fall 1991 4(4):353-383
-
The LOCUS Distributed System Architecture
-
Why do cc-NUMA systems purge systems of cache lines when updates are done,
instead of updating them?
-
HiveOS: Fault Containment for Shared Memory Multiprocessors, 15th Symposium
on Operating Systems Principles, ACM, December, 1995
-
Almasi & gottlieb - "Highly Parallel Computing, 2nd Edition", for wait-free
shared data structures
Notes written by Michael Panitz. Copyright, All rights reserved, 1998.
You're free to use this page, just not for profit. Not that I'm expecting
anyone to, but I may as well put this here, just in case