generated html version of blog/an-unorthodox-object-system.txtHOME

Let's talk about an unorthodox object system I like to use in my projects in C,
Zig, and assembly. Since this object system is language neutral it is defined 
in terms of an ABI, and I will be explaining it without the use of actual 
source code in any language where possible.


At a low level, when one uses polymorphism, C++'s object system, Rust's,
Microsoft's win32 COM objects, and many other systems, boil down to using an
array of function pointers called a "vtable" to locate any particular 
object's implementation of any given method, with a fixed layout.

For example, one might have an abstract interface representing a file, with a 
vtable that looks like this:

╭────────────╮
│  0 close   │
│  1 read    │
│  2 write   │
╰────────────╯

While this is very fast, it means that the layout of the vtable for a given
type becomes part of the binary low-level interface of the type-- 
and therefore, it means that if methods are added to a type, the code that uses
that type has to be recompiled if the vtable layout has changed such that an 
old vtable index number now refers to a different method. For example, if the 
numbers were assigned alphabetically, and one added a "seek" method to the 
file interface above:

╭────────────╮
│  0 close   │
│  1 read    │
│  2 seek    │
│  3 write   │
╰────────────╯

Then old code that called "write" would now incorrectly call the "seek" method.
This can be avoided if, as with windows COM, the vtable indices are directly
controlled by the programmer- but this is not the case for C++.

One approach to fix this is to simply use a dictionary of strings to methods. 
This is the case in many languages, but has obvious performance problems.
(Not as much as one would initially suspect, however!)


I have invented a different approach, which enlists the existing linker to
solve part of the problem. For each method of an abstract type or interface,
one creates global symbols with unique addresses, such that the output of nm
on the binary that defines the interface would read something like:

╭─────────────────╮
│  T file__close  │
│  T file__read   │
│  T file__write  │
╰─────────────────╯

These symbols have very few requirements but they must be unique. They are
called in this system the "method symbols". And any file that uses the 
interface would use these as external symbols:

╭─────────────────╮
│  U file__close  │
│  U file__read   │
│  U file__write  │
╰─────────────────╯

To call a method on an object, much like on a vtable approach, one uses a small
assumption about the layout of the object- but that assumption is only that
there is a single function pointer located at the start of the object's data.
This function is called the "method locator" and it maps the unique addresses
of method symbols to the actual addresses of method implementations.

The method locator is not required to use any particular approach to implement 
this mapping-- possible implementations include an if-then chain, or a sorted 
table with a binary search. What matters is that the returned address can then
be tested for null (to see if the object in question supports the method 
in question), and then directly called.

For reasons that I will explain later, this allows extreme decoupling of the 
interface-using binary modules from the interface implementing modules, from
the interface defining module. These modules can be changed independently 
without having to modify or recompile the other modules.

But first, a few extra implementation details. For convenience, it is possible 
to place at the method symbol address itself, a function which calls the 
method locator on itself and then tail-calls the resulting method. This has 
a cost of a few additional clock cycles compared to having functions call the
method locator directly, but does make those functions smaller.

Furthermore, variations are possible. The method locator could be defined so 
that you call it with the address of the object you are calling on, in 
addition to the method symbol's address. In this case, the method locator
could thereby supply optimal method implementations according to the object's
internal state.

Another variation, possible on systems with a stack-based calling convention, 
would be for the method locator to accept all the method's arguments directly,
and tail call it.


So, how does this help decouple modules? Let's look at users first. Since a 
module that uses an interface uses undefined linker symbols to call the methods
on an object, rather than compile-time vtable indices, methods added to an 
interface do not invalidate existing code using the old methods, any more than
extra functions added to a library would invalidate code using the other 
functions. Furthermore, changing the name of a method in an interface *does*
invalidate these; and thus, if there is some mangling involved, changing the
type of the method will also invalidate it. From the linker perspective, there
will be a missing symbol.

Now, for the implementation of an interface, this is also decoupled. 
An object can define its method locator to implement as many methods from
as many interfaces as it pleases, without needing to be recompiled when 
an interface adds a new method. And of course none of the users of an object
need to be recompiled when an object changes which methods it implements,
how it locates them, or how it implements them.

As a result of the above, when a strong typing system is in play, one can 
change the dependency graph for recompilation from
╭─────────────────────────────────────────╮
│  interface ──▶ implementation ──▶ user  │   to something more like
├─────────────────────────────────────────┤
│  interface ─┬─▶ implementation          │   thus reducing the amount of
│             ╰─▶ user                    │   recompilations.
╰─────────────────────────────────────────╯

This is a huge improvement on the system *typically* employed by C++ 
programmers, where due to the presence of object implementation details in
the interface, the user code has to be recompiled even when only the private 
methods are changed. There is a "design pattern" or "idiom" called "PIMPL" 
which attempts to repair this flaw in the C++ object system but it is at
best imperfect, since the vtable layout is *still* part of the ABI.


A few more variations in implementing this system. The method locator can be 
extended to a member locator, returning an location for a member, but this puts
the system in an awkward position when one would want the given member to be 
stored in a packed struct on systems that require aligned writes, or in some 
other place that isn't a simple memory location.

The method locator of an object that extends another object, could call the 
extended object's method locator. But this idea of private inheritance among 
implementations should not be merged or confused with the idea of inheritance
among interfaces. 

The method locator could also be used to expose runtime type information; 
and intrinsically, it can be used to test whether a given object implements
a given interface.


This object system is not a fully-defined ABI- it's a collection of 
possibilities based on a central idea, which I've used in different ways.
I'm still experimenting with it, and no doubt better minds than mine could
create better variations. 

A language with extensions to the linker involved could also make the 
limitations this system addresses irrelevant, but that has so far not happened-
instead many modern languages/systems opt to essentially compile entire program 
in one fell swoop, and hope that the compiler and the computer it runs on is so 
fast that it does not matter.


------
Oren Watson