generated html version of blog/an-unorthodox-object-system.txt HOME

Let's talk about an unorthodox object system I like to use in my projects in C,
Zig, and assembly. Since this object system is language neutral it is defined
in terms of an ABI, and I will be explaining it without the use of actual
source code in any language where possible.

At a low level, when one uses polymorphism, C++'s object system, Rust's,
Microsoft's win32 COM objects, and many other systems, boil down to using an
array of function pointers called a "vtable" to locate any particular
object's implementation of any given method, with a fixed layout.

For example, one might have an abstract interface representing a file, with a
vtable that looks like this:

╭────────────╮
│ 0 close │
│ 1 read │
│ 2 write │
╰────────────╯

While this is very fast, it means that the layout of the vtable for a given
type becomes part of the binary low-level interface of the type--
and therefore, it means that if methods are added to a type, the code that uses
that type has to be recompiled if the vtable layout has changed such that an
old vtable index number now refers to a different method. For example, if the
numbers were assigned alphabetically, and one added a "seek" method to the
file interface above:

╭────────────╮
│ 0 close │
│ 1 read │
│ 2 seek │
│ 3 write │
╰────────────╯

Then old code that called "write" would now incorrectly call the "seek" method.
This can be avoided if, as with windows COM, the vtable indices are directly
controlled by the programmer- but this is not the case for C++.

One approach to fix this is to simply use a dictionary of strings to methods.
This is the case in many languages, but has obvious performance problems.
(Not as much as one would initially suspect, however!)

I have invented a different approach, which enlists the existing linker to
solve part of the problem. For each method of an abstract type or interface,
one creates global symbols with unique addresses, such that the output of nm
on the binary that defines the interface would read something like:

╭─────────────────╮
│ T file__close │
│ T file__read │
│ T file__write │
╰─────────────────╯

These symbols have very few requirements but they must be unique. They are
called in this system the "method symbols". And any file that uses the
interface would use these as external symbols:

╭─────────────────╮
│ U file__close │
│ U file__read │
│ U file__write │
╰─────────────────╯

To call a method on an object, much like on a vtable approach, one uses a small
assumption about the layout of the object- but that assumption is only that
there is a single function pointer located at the start of the object's data.
This function is called the "method locator" and it maps the unique addresses
of method symbols to the actual addresses of method implementations.

The method locator is not required to use any particular approach to implement
this mapping-- possible implementations include an if-then chain, or a sorted
table with a binary search. What matters is that the returned address can then
be tested for null (to see if the object in question supports the method
in question), and then directly called.

For reasons that I will explain later, this allows extreme decoupling of the
interface-using binary modules from the interface implementing modules, from
the interface defining module. These modules can be changed independently
without having to modify or recompile the other modules.

But first, a few extra implementation details. For convenience, it is possible
to place at the method symbol address itself, a function which calls the
method locator on itself and then tail-calls the resulting method. This has
a cost of a few additional clock cycles compared to having functions call the
method locator directly, but does make those functions smaller.

Furthermore, variations are possible. The method locator could be defined so
that you call it with the address of the object you are calling on, in
addition to the method symbol's address. In this case, the method locator
could thereby supply optimal method implementations according to the object's
internal state.

Another variation, possible on systems with a stack-based calling convention,
would be for the method locator to accept all the method's arguments directly,
and tail call it.

So, how does this help decouple modules? Let's look at users first. Since a
module that uses an interface uses undefined linker symbols to call the methods
on an object, rather than compile-time vtable indices, methods added to an
interface do not invalidate existing code using the old methods, any more than
extra functions added to a library would invalidate code using the other
functions. Furthermore, changing the name of a method in an interface *does*
invalidate these; and thus, if there is some mangling involved, changing the
type of the method will also invalidate it. From the linker perspective, there
will be a missing symbol.

Now, for the implementation of an interface, this is also decoupled.
An object can define its method locator to implement as many methods from
as many interfaces as it pleases, without needing to be recompiled when
an interface adds a new method. And of course none of the users of an object
need to be recompiled when an object changes which methods it implements,
how it locates them, or how it implements them.

As a result of the above, when a strong typing system is in play, one can
change the dependency graph for recompilation from
╭─────────────────────────────────────────╮
│ interface ──▶ implementation ──▶ user │ to something more like
├─────────────────────────────────────────┤
│ interface ─┬─▶ implementation │ thus reducing the amount of
│ ╰─▶ user │ recompilations.
╰─────────────────────────────────────────╯

This is a huge improvement on the system *typically* employed by C++
programmers, where due to the presence of object implementation details in
the interface, the user code has to be recompiled even when only the private
methods are changed. There is a "design pattern" or "idiom" called "PIMPL"
which attempts to repair this flaw in the C++ object system but it is at
best imperfect, since the vtable layout is *still* part of the ABI.

A few more variations in implementing this system. The method locator can be
extended to a member locator, returning an location for a member, but this puts
the system in an awkward position when one would want the given member to be
stored in a packed struct on systems that require aligned writes, or in some
other place that isn't a simple memory location.

The method locator of an object that extends another object, could call the
extended object's method locator. But this idea of private inheritance among
implementations should not be merged or confused with the idea of inheritance
among interfaces.

The method locator could also be used to expose runtime type information;
and intrinsically, it can be used to test whether a given object implements
a given interface.

This object system is not a fully-defined ABI- it's a collection of
possibilities based on a central idea, which I've used in different ways.
I'm still experimenting with it, and no doubt better minds than mine could
create better variations.

A language with extensions to the linker involved could also make the
limitations this system addresses irrelevant, but that has so far not happened-
instead many modern languages/systems opt to essentially compile entire program
in one fell swoop, and hope that the compiler and the computer it runs on is so
fast that it does not matter.

------
Oren Watson

generated html version of blog/an-unorthodox-object-system.txtHOME

generated html version of blog/an-unorthodox-object-system.txt HOME