KHB: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

February 20, 2008

This article was contributed by Valerie Aurora

When I was but a wee computer science student at New Mexico Tech, a graduate student in OS handed me an inch-thick print-out and told me that if I was really interested in operating systems, I had to read this. It was something about a completely lock-free operating system optimized using run-time code generation, written from scratch in assembly running on a homemade two-CPU SMP with a two-word compare-and-swap instruction - you know, nothing fancy. The print-out I was holding was Alexia Massalin's PhD thesis, Synthesis: An Efficient Implementation of Fundamental Operating Systems Services (html version here). Dutifully, I read the entire 158 pages. At the end, I realized that I understood not a word of it, right up to and including the cartoon of a koala saying "QUA!" at the end. Okay, I exaggerate - lock-free algorithms had been a hobby of mine for the previous few months - but the main point I came away with was that there was a lot of cool stuff in operating systems that I had yet to learn.

Every year or two after that, I'd pick up my now bedraggled copy of "Synthesis" and reread it, and every time I would understand a little bit more. First came the lock-free algorithms, then the run-time code generation, then quajects. The individual techniques were not always new in and of themselves, but in Synthesis they were developed, elaborated, and implemented throughout a fully functioning UNIX-style operating system. I still don't understand all of Synthesis, but I understand enough now to realize that my grad student friend was right: anyone really interested in operating systems should read this thesis.

Run-time code generation

The name "Synthesis" comes from run-time code generation - code synthesis - used to optimize and re-optimize kernel routines in response to changing conditions. The concept of optimizing code during run-time is by now familiar to many programmers in part from Transmeta's processor-level code optimization, used to lower power consumption (and many programmers are familiar with Transmeta as the one-time employer of Linus Torvalds.)

Run-time code generation in Synthesis begins with some level of compile-time optimization, optimizations that will be efficient regardless of the run-time environment. The result can thought of as a template for the final code, with "holes" where the run-time data will go. The run-time code generation then takes advantage of data-dependent optimizations. For example, if the code evaluates A * B, and at run-time we discover that B is always 1, then we can generate more efficient code that skips the multiplication step and run that code instead of the original. Fully optimized versions of the code pre-computed for common data values can be simply swapped in without any further run-time computation. Another example from the thesis:

[...] Suppose that the compiler knows, either through static control-flow analysis, or simply by the programmer telling it through some directives, that the function f(p1, ...) = 4 * p1 + ... will be specialized at run-time for constant p1. The compiler can deduce that the expression 4 * p1 will reduce to a constant, but it does not know what particular value that constant will have. It can capture this knowledge in a custom code generator for f that computes the value 4 * p1 when p1 becomes known and stores it in the correct spot in the machine code of the specialized function f, bypassing the need for analysis at run-time.

Run-time code generation in Synthesis is a fusion of compile-time and run-time optimizations in which useful code templates are created at compile time that can later be optimized simply and cleanly at run time.

Quajects

Understanding run-time code generation is a prerequisite for understanding quajects, the basic unit out of which the Synthesis kernel is constructed. Quajects are almost but not quite entirely unlike objects. Like objects, quajects come in types - queue quaject, thread quaject, buffer quaject - and encapsulate all the data associated with the quaject. Unlike objects, which contain pointers to functions implementing their methods, quajects contain the code implementing their methods directly. That's right - the actual executable instructions are stored inside the data structure of the quaject, with the code nestled up against the data it will operate on. In cases where the code is too large to fit in the quaject, the code jumps out to the rest of the method located elsewhere in memory. The code implementing the methods is created by filling in pre-compiled templates and can be self-modifying as well.

Quajects interact with other quajects via a direct and simple system of cross-quaject calls: callentries, callouts, and callbacks. The user of quaject invokes callentries in the quaject, which implement that quaject's methods. Usually the callentry returns back to the caller as normal, but in exceptional situations the quaject will invoke a method in the caller's quaject - a callback. Callouts are places where a quaject invokes some other quaject's callentries.

Synthesis implements a basic set of quajects - thread, queue, buffer, clock, etc. - and builds higher-level structures by combining lower-level quajects. For example, a UNIX process is constructed out of a thread quaject, a memory quaject, and some I/O quajects.

As an example, let's look at the queue quaject's interface. A queue has two callentries, queue_put and queue_get. These are invoked by another quaject wanting to add or remove entries to and from the queue. The queue quaject also has four callbacks into the caller's quaject, queue_full, queue_full-1, queue_empty, and queue_empty-1. When a caller invokes the queue_put method and the queue is full, the queue quaject invokes the queue_full callback in the caller's quaject. From the thesis:

The idea is: instead of returning a condition code for interpretation by the invoker, the queue quaject directly calls the appropriate handling routines supplied by the invoker, speeding execution by eliminating the interpretation of return status codes.

The queue_full-1 method is executed when a queue has transitioned from full to not full, queue_empty when the queue doesn't contain anything, and queue_empty-1 when the queue transitions from empty to not empty. With these six callentries and callbacks, a queue is implemented in a generic, extensible, yet incredibly efficient manner.

Pretty cool stuff, huh? But wait, there's more!

Optimistic lock-free synchronization

Most modern operating systems use a combination of interrupt disabling and locks to synchronize access to shared data structures and guarantee single-threaded execution of critical sections in general. The most popular synchronization primitive in Linux is the spinlock, implemented with the nearly universal test-and-set-bit atomic operation. When one thread attempts to acquire the spinlock guarding some critical section, it busy-waits, repeatedly trying to acquire the spinlock until it succeeds.

Synchronization based on locks works well enough but it has several problems: contention, deadlock, and priority inversion. Each of these problems can be (and is) worked around by following strict rules: keep the critical section short, always acquire locks in the same order, and implement various more-or-less complex methods of priority inheritance. Defining, implementing, and following these rules is non-trivial and a source of a lot of the pain involved in writing code for modern operating systems.

To address these problems, Maurice Herlihy proposed a system of lock-free synchronization using an atomic compare-and-swap instruction. Compare-and-swap takes the address of a word, the previous value of the word, and the desired new value of the word. It swaps the previous and new values of the word if and only if the previous value is the same as the current value. The bare compare-and-swap instruction allows atomic updates of single pointers. To atomically switch between larger data structures, a new copy of the data structure is created, updated with the changes, and the addresses of the two data structures swapped. If the compare-and-swap fails because some other thread has updated the value, the operation is retried until it succeeds.

Lock-free synchronization eliminates deadlocks, the need for strict lock ordering rules, and priority inversion (contention on the compare-and-swap instruction itself is still a concern, but rarely observed in the wild). The main drawback of Herlihy's algorithms is that they require a lot of data copying for anything more complex than swapping two addresses, making the total cost of the operation greater than the cost of locking algorithms in many cases. Massalin took advantage of the two-word compare-and-swap instruction available in the Motorola 68030 and expanded on Herlihy's work to implement lock-free and copy-free synchronization of queues, stacks, and linked lists. She then took a novel approach: Rather than choose a general synchronization technique (like spinlocks) and apply it to arbitrary data structures and operations, instead build the operating system out of data structures simple enough to be updated in an efficient lock-free manner.

Synthesis is actually even cooler than lock-free: Given the system of quajects, code synthesis, and callbacks, operations on data structures can be completely synchronization-free in common situations. For example, a single-producer, single-consumer queue can be updated concurrently without any kind of synchronization as long as the queue is non-empty, since each thread operates on only one end of the queue. When the callback for queue empty happens, the code to operate on the queue is switched to use the lock-free synchronization code. When the quaject's queue-not-empty callback is invoked, the quajects switch back to the synchronization-free code. (This specific algorithm is not, to my knowledge, described in detail in the thesis, but was imparted to me some months ago by Dr. Massalin herself at one of those wild late-night kernel programmer parties, so take my description with a grain of salt.)

The approach to synchronization in Synthesis is summarized in the following quote:

Avoid synchronization whenever possible.
Encode shared data into one or two machine words.
Express the operation in terms of one or more fast lock-free data structure operations.
Partition the work into two parts: a part that can be done lock-free, and a part that can be postponed to a time when there can be no interference.
Use a server thread to serialize the operation. Communications with the server happens using concurrent, lock-free queues.

The last two points will sound familiar if you're aware of Paul McKenney's read-copy-update (RCU) algorithm. In Synthesis, thread structures to be deleted or removed from the run queue are marked as such, and then actually deleted or removed by the scheduler thread during normal traversal of the run queue. In RCU, the reference to a list entry is removed from the linked list while holding the list lock, but the removed list entry is not actually freed until it can be guaranteed that no reader is accessing that entry. In both cases, reads are synchronization-free, but deletes are separated into two phases, one that begins the operation in an efficient low-contention manner, and a second, deferred, synchronization-free phase to complete the operation. The two techniques are by no means the same, but share a similar philosophy.

Synthesis: Operating system of the future?

The design principles of Synthesis, while powerful and generic, still have some major drawbacks. The algorithms are difficult to understand and implement for regular human beings (or kernel programmers, for that matter). As Linux has demonstrated, making kernel development simple enough that a wide variety of people can contribute has some significant payoffs. Another drawback is that two-word compare-and-swap is, shall we say, not a common feature of modern processors. Lock-free synchronization can be achieved without this instruction, but it is far less efficient. In my opinion, reading this paper is valuable more for retraining the way your brain thinks about synchronization than for copying the exact algorithms. This thesis is especially valuable reading for people interested in low-latency or real-time response, since one of the explicit goals of Synthesis is support for real-time sound processing.

Finally, I want to note that Synthesis contains many more elegant ideas that I couldn't cover in even the most superficial detail - quaject-based user/kernel interface, per-process exception tables, scheduling based on I/O rates, etc., etc. And while the exact implementation details are fascinating, the thesis is also peppered with delightful koan-like statements about design patterns for operating systems. Any time you're feeling bored with operating systems, sit down and read a chapter of this thesis.

[ Valerie Henson is a Linux file systems consultant and proud recipient of a piggy-back ride from Dr. Alexia Massalin. ]

Index entries for this article
Kernel	Kernel Hacker's Bookshelf
GuestArticles	Aurora (Henson), Valerie

The Kernel Hacker's Bookshelf: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

Posted Feb 21, 2008 11:09 UTC (Thu) by leighbb (subscriber, #1205) [Link] (4 responses)

Ow, my brain hurts.  I'll stick to the occasional kernel recompile.

Mine too..

Posted Feb 21, 2008 11:39 UTC (Thu) by deleteme (guest, #49633) [Link] (2 responses)

But I like those quaject callbacks I think it's more beautiful than doing exception throwing
and signal passing. But again maybe that's just my ignorance showing.. ;-)

Mine too..

Posted Feb 21, 2008 13:51 UTC (Thu) by nix (subscriber, #2304) [Link]

It's definitely more elegant. It's more like the Emacs hooks-everywhere scheme than exception
handling, though.

(I've seen one program that approaches things in the same way as this does with its quaject
callbacks, and that's the ERC IRC client, which has a fairly conventional server core, but all
the actual processing is done by invoking hooks with names constructed from the type of the
server response; said hooks can then send messages back again, or whatever. Of course ERC
doesn't have all the *other* nifty stuff in Synthesis.)

(btw, isn't the cartoon at the end a picture of a quala, not a koala? --- sorry.)

Callbacks and quajects

Posted Feb 26, 2008 16:55 UTC (Tue) by pdc (guest, #1353) [Link]

I think you would need to come up with some new programming language syntax that encapsulates
the special conventions of quajects and the locking primitives (sort of how Occam encapsulates
the CSP-style synchronization primitives of the Transputer), doing all the bookkeeping and
self-modifying code invisibly for you. Without a higher-level language to write programs in
you presumably are left with hand-edited and fragile assembly-language code...

The Kernel Hacker's Bookshelf: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

Posted Feb 24, 2008 13:42 UTC (Sun) by nix (subscriber, #2304) [Link]

Ah, but with Synthesis, the kernel can recompile iself! :)

The Kernel Hacker's Bookshelf: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

Posted Feb 21, 2008 13:54 UTC (Thu) by nix (subscriber, #2304) [Link]

I'm not sure whether to thank you or curse you for posting this fascinating article (and
linking to the thesis). I've been infected now. I'm not going to be able to write hardly
anything in future without borrowing ideas from this.

I shall have to spread the pain by telling everyone I have ever met about it. :)

The Kernel Hacker's Bookshelf: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

Posted Feb 21, 2008 18:06 UTC (Thu) by ikm (guest, #493) [Link] (4 responses)

The approaches described here seem to completely ignore the fact that the increased amount of
code would result in increased cache misses, possibly penalizing the program more than
benefiting it. Instead of using the special-case function f(p) for p == 1, it might be faster
to just call the generic one, just because it's in the cache already. Cache is everything
nowadays, so the algorithms should be tailored for cache efficiency. Having more code hardly
helps here.

Furthermore, it seems to be unclear when to optimize and for what data  the time spent
creating new code, the memory consumed to hold it, and, once again, the cache penalties
resulting from having it executed once in a while should all be worth it. The whole approach
feels like it can be simplified to the idea of 'having a compiler inside of the running
program used to recompile its parts in runtime', but somehow I fail to see too many benefits
in that. To my understanding, if one wants to optimize, the better way to do is to think more
about caches/cpus interactions. The "network channels" apporach is one nice example of that.

The Kernel Hacker's Bookshelf: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

Posted Feb 21, 2008 19:10 UTC (Thu) by bronson (subscriber, #4806) [Link] (2 responses)

More code does not imply slower.  Think bubble sort vs. quicksort.

Besides, it sounds like the recompilations only happen at exceptional times.  I highly doubt
that they result in the sort of cache thrashing that you imply.

The Kernel Hacker's Bookshelf: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

Posted Feb 21, 2008 19:33 UTC (Thu) by ikm (guest, #493) [Link] (1 responses)

The bubble sort vs. quicksort would be an incorrect analogy. The idea here is to inject
several optimized versions of the same code. My arguing was that one generic version might
perform better than a multitude of specialized copies of the same code.

The Kernel Hacker's Bookshelf: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

Posted Feb 22, 2008 10:46 UTC (Fri) by mingo (guest, #31122) [Link]

Yes. Especially as today's CPUs move towards annotating cache lines and doing certain
optimizations of the generic functions by observing their usage and annotating specific
instructions. Creating _more_ code goes into the opposite direction.

Code size and cache thrash

Posted Feb 21, 2008 21:18 UTC (Thu) by vaurora (guest, #38407) [Link]

The issue of code size and instruction cache usage is a very important one - one I didn't have space to talk about in this article, but which is addressed in the original paper/dissertation/book. The executive summary is that if you only do inline code generation when it makes sense and share code otherwise, the code size is fine. In fact, for certain types of functions, the code size is reduced by inlining the code directly instead of setting up all the arguments and calling the shared function code. (gcc can do this statically at compile time, Synthesis can do it at run-time too.)

For more detail, see section 3.4.1. I feel like I read another section addressing this but I can't find it right now...

The Kernel Hacker's Bookshelf: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

Posted Feb 21, 2008 19:03 UTC (Thu) by bronson (subscriber, #4806) [Link] (2 responses)

I played around with lock-free algorithms a year ago and came away with a rather dim view of
them.  It's easy to write a provably correct lock-free queue when you assume infinite memory
(something that most papers seem to do).  It's a lot messier when you need to ensure the safe
disposal of list items when you're done with them.

You mention he's using an atomic CAS.  Does he explicitly solve CAS's ABBA problem?  It's
easily solved if you assume infinite memory of course.  If you're on a real-world system, if
you're not careful, you can easily open yourself up to situations where CAS assures you that a
value is unchanged when in reality you lost out a long time ago and the ring has just wrapped.
(yes, it's not hard to solve if you're watching for it, but always having to worry about this
sort of subtlety is what makes working lock-free such a pain)

After realizing they were anything but maintainable, I chucked lock-free algorithms on my
shelf of stuff that is academically fascinating but unrealistic for the real world.  Should I
take them back down and re-examine them?

> When the callback for queue empty happens, the code to operate on the queue is switched to
use the lock-free synchronization code. When the quaject's queue-not-empty callback is
invoked, the quajects switch back to the synchronization-free code.

Whoa...  that ran shivers down my spine.  What an outrageously cool idea.

The Kernel Hacker's Bookshelf: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

Posted Feb 21, 2008 21:42 UTC (Thu) by vaurora (guest, #38407) [Link]

> After realizing they were anything but maintainable, I chucked lock-free algorithms on my
> shelf of stuff that is academically fascinating but unrealistic for the real world.  Should
I
> take them back down and re-examine them?

My personal opinion is that lock-free algorithms are not a good generic synchronization
technique, and are definitely very very complex and difficult to understand.  However, in
certain specific cases, lock-free can be simple, elegant, and a huge performance advantage
over traditional approaches.  It's much like RCU - you wouldn't want to use RCU for every
synchronization problem, but when it comes to highly-shared read-mostly data in the hot path
(e.g., the dcache), it's worth the trouble.

It's kind of like my advice on choosing a file system: Use ext3 unless it's not
fast/big/whatever enough for you, in which case use XFS.  My recommendation is use locks
unless they have too much contention/complexity/whatever, in which case look into lock-free.

> > When the callback for queue empty happens, the code to operate on the queue is switched to
> > use the lock-free synchronization code. When the quaject's queue-not-empty callback is
> > invoked, the quajects switch back to the synchronization-free code.
> 
> Whoa...  that ran shivers down my spine.  What an outrageously cool idea.

I know... The whole damn dissertation is like that.

The Kernel Hacker's Bookshelf: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

Posted Feb 28, 2008 19:11 UTC (Thu) by renox (guest, #23785) [Link]

>> When the callback for queue empty happens, the code to operate on the queue is switched to
use the lock-free synchronization code. When the quaject's queue-not-empty callback is
invoked, the quajects switch back to the synchronization-free code.
>
>Whoa...  that ran shivers down my spine.  What an outrageously cool idea.

Uh? I don't think that it's a particulary original idea..
But implementing this efficiently on the other hand, this seems quite hard to do!
I have some difficulties downloading the PDF, so I cannot read the paper yet..

Advanced reading

Posted Feb 21, 2008 21:26 UTC (Thu) by vaurora (guest, #38407) [Link]

For people who want to get into detailed analysis of the applicability of the ideas in Synthesis, a good place to start is Section 7.5, "Objections." It's organized around the following questions:

"How much of the performance improvement is due to my ideas, and how much is due to writing in assembler, and tuning the hell out of the thing?"
"Self-modifying data structures are troublesome on pipelined machines, and code generation has problems with machines that don't allow finegrained control of the instruction cache. In other words, Synthesis techniques are dependent on hardware features that aren't present in all machines, and, worse, are becoming increasingly scarce."
"Does this matter? Hardware is getting faster, and anything that is slow today will probably be fast enough in two years." [*snicker* - Ed.]
"Why is Synthesis written in assembler? How much of the reason is that you wanted no extraneous instructions? How much of the reason is that code synthesis requires assembler? How much of Synthesis could be re-written in a high-level language?"

The Kernel Hacker's Bookshelf: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

Posted Feb 22, 2008 2:30 UTC (Fri) by tcoppi (guest, #44423) [Link]

Thanks for this very insightful writeup, now I have yet another thing to read and become
immersed in instead of doing real work :P

Great writeup!

Posted Feb 22, 2008 18:06 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (2 responses)

Having this write-up, to say nothing of an online copy of Massalin's dissertation, would have
saved me much time back when I was doing my own dissertation.  ;-)  Good to see both now
available!!!

One of the key strengths of Massalin's work is the focus on determining what techniques work
well in given situations, and then matching up techniques with the corresponding situations.
"Use the right tool for the job!"

Although lock-free techniques can be quite valuable for situations requiring real-time
response, as can other non-blocking-synchronization (NBS) techniques, these techniques are not
a panacea.  NBS algorithms rely heavily on hardware arbitration, which is usually unaware of
process priorities.  This can result in priority-inversion-like effects when the hardware
gives the contended cache line to the low-priority process.

Great writeup!

Posted Mar 20, 2008 19:24 UTC (Thu) by olecom (guest, #42886) [Link] (1 responses)

> NBS algorithms rely heavily on hardware arbitration, which is usually
> unaware of process priorities.

There are no priorities and scheduler is I/O rate-based, PLL-managed.
On pdf.page 139 (123) find some ideas for hardware and open yor eyes
(pdf.page 142 (126).

> This can result in priority-inversion-like effects when the hardware
> gives the contended cache line to the low-priority process.

Oh, read whole dissertation, please.
_____

Great writeup!

Posted Apr 3, 2008 19:50 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Let's see, page 139:123...

Fully coherent instruction caches do exist on some systems. Replicated registers for fast
interrupt handling were present in some 8080-based CPUs back in the late 70s and early 80s,
but have fallen out of favor, though one might get the same effect on modern multithreaded
hardware by reserving one thread to process interrupts on behalf of the other, but not sure
that this would be worthwhile. Double-compare-and-swap is indeed useful, but doesn't exist on
the systems I have access to, despite its appearing in the 68K line quite some time ago.
Large caches, large cacheline sizes, and very high memory bandwidths have reduced the urgency
of partial-register saving, and again, in some cases you could assign processes to hardware
threads -- but perhaps this will become more compelling in the future. Some CPUs have vector
operations that do some of the multi-byte operations (e.g., SSE, 3DNow, Altivec, ...), and
perhaps compilers will become more aggressive about implementing them. Some of the fancier
bitwise operations are indeed useful in some contexts, so who knows.

But I do apologize for my extreme bigotry in favor of hardware features that I can use today
across a wide variety of CPU types. ;-)

The operating systems I work with -do- have priorities, so I don't get to ignore them.
Perhaps more sophisticated scheduling policies will come into vogue. There are certainly
people working on this. In any case, stock Linux running on commodity microprocessors does
context switches orders of magnitude faster than the one-millisecond figure called out on page
142:126 -- old-style Moore's law at work.

Hardware cache-coherence protocols coupled with high software memory contention really can
result in priority-inversion-like effects. I have seen this, logic analyzer and everything.
Would you care to outline your objections in more detail?

KHB: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

Posted Feb 27, 2008 17:30 UTC (Wed) by pphaneuf (guest, #23480) [Link] (3 responses)

The trick of changing between two algorithms is pretty awesome, and something more or less like that could easily be done with many pieces of code in the kernel. I did similar things in user-space C++ code, by using callbacks for some things, where I basically encoded part of the state in the function pointer itself (in state A, point at do_state_A, in state B, point at do_state_B, something like that), which was kind of ugly in C++, as it basically was hand-coded virtual methods, so that I could change them at runtime (and per object).

But in the kernel, we're already implementing virtual methods by hand, and on top of it, we don't use per-class virtual method tables, but rather embed them in each objects already, which is exactly what we need! It's just that they are usually initialized once at object creation, we'd just need to use some atomic CAS to update a function pointer when some algorithm changes.

The trick part is when more than one method needs to change. Damn, I actually expect this would be common, too. Argh.

Oh well, it's pretty cool!

KHB: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

Posted Feb 27, 2008 21:58 UTC (Wed) by zlynx (guest, #2285) [Link] (2 responses)

Your solution is not as ugly as another I've seen in C++ where the programmer used placement
new to reconstruct the object with a different derived type (but the same base type) in-place.

Now, that was scary.

KHB: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

Posted Feb 27, 2008 22:51 UTC (Wed) by pphaneuf (guest, #23480) [Link] (1 responses)

My way is ugly. That way is insane.

Still, I wish it was possible to manipulate vptrs and vtbls more in C++. Things like having two different implementations of a virtual method with the same signature inherited from two parents would make sense, for example (say, the "draw" methods you get while deriving from both Drawable and Cowboy).

KHB: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

Posted Feb 28, 2008 20:59 UTC (Thu) by nix (subscriber, #2304) [Link]

Insane?

Oh.

So, er, I should be ashamed that I did almost exactly that (via a custom 
allocator layered atop big mmap()ed hunks) in C about a year ago? (Well, 
it was more like a twisted sort of RCU: I'd construct a new instance that 
was a copy of the old one with a zero reference count, then changing 
pointers in the new instance's VMT-analogue, translating the object's data 
into a new representation at the same time. The old instance got 
reference-zapped when nobody was executing methods via those pointers 
anymore.)

(yes, there were pressing reasons. It was a bundle of fun, as well. :) )

Lock-free operations

Posted Feb 28, 2008 11:48 UTC (Thu) by forthy (guest, #1525) [Link]

There are a number of ways to do efficient lock-free operations even without resorting to a locked operation in some circumstances. Take e.g. the single-writer, single-reader queue with a finite size: All you need is a circular buffer and two pointers. The writer only writes to the end of the queue and updates only the high pointer after writing. The reader reads from the start of the queue and updates the low pointer after reading.

Or allocation/freeing of objects in a multi-processor system: Each core frees only those objects it allocated itself. Other cores can post "free this object" into a single-writer, single-reader queue, or mark objects as free and have them scanned while idle. You can even use a linked list if you keep at least one deleted object alive to avoid the empty-linked-list queue condition.

Another point is certainly serialization and deferring actions. One thing that comes in mind here is the user-kernel communication. In Linux, the user invokes a context switch into the kernel for every system call, and to make this context switch light-weight, the system call uses the user process, so in fact there are many kernel threads, which will require synchronization (in early Linux days, there was the BKL, so the kernel was serialized). If you change the syscall interface to a command queue (think of the X server), you can probably afford a more heavy-weight context switch, and less kernel threads. This however significantly changes the interface paradigm to the kernel - Unix-like syscalls do not fit well. Even the X server has too many functions which require round-trips. The upside of this change is that a network transparent kernel interface becomes feasible - and performance in clusters would improve significantly.

re: Synthesis: Operating system of the future?

Posted May 24, 2008 3:23 UTC (Sat) by olecom (guest, #42886) [Link]

As a matter of fact eight-year developer of PowerPC Linux port Cort Dougan has citations of
Massalin's work in his research and application publications.

Bottom line, as far as i can see, can be read here:

Cort Dougan: Good Programmers are not Lazy. unpublished draft.
http://hq.fsmlabs.com/~cort/papers/lazy/lazy.nohead.html

A must read.

Ouch, fsmlabs and software patents? Microsoft has patent on double-click, so what? Now Novel,
RH and others have all that protecting-patents dance. Thus, sw patents in this case are
irrelevant. About sw freedom and money, please read the article.

KHB: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

Posted Jul 26, 2009 16:28 UTC (Sun) by asdlfiui788b (guest, #58839) [Link] (1 responses)

Perhaps this is trolling but... I used to think that lock free algorithms based on CAS were a good idea, until I saw how the Alpha handled this. Load-Linked / Store-Conditional is more natural for hardware and also a damn sight more useful as it prevents the ABA problem (I've played with some quiescence algorithms for memory allocators to get round this but it's still ugly and difficult to implement). If you're interested in this sort of thing I'd strongly recommend looking up LL/SC and this kind of problem. While you're at it, look at Hans Boehm's paper on implementing LL/SC with 1 bit and a trivial amount of circuitry; it's so much easier that CAS and pretty much trivial to implement.

KHB: Synthesis: An Efficient Implementation of Fundamental Operating Systems Services

Posted Jul 27, 2009 8:27 UTC (Mon) by xoddam (guest, #2322) [Link]

Any concurrency primitive for shared-memory multiprocessors on commodity hardware must for performance reasons be implemented at the cache-coherency level, using message passing (rather than having to hit the underlying slow, shared RAM on every synchronous operation) and is therefore several orders of magnitude more complex in silicon than anything that can be done "with 1 bit".

See http://lwn.net/Articles/252125/ (scroll down to section 3.3.4, Multi-processor support).

CAS and the spicier double-word CAS are obviously 'trickier' in some sense than LL/SC but both must be implemented by grabbing exclusive access over a cache line (maybe two in the case of double-word, I suppose). The underlying mechanics are the same.