The New Jorviker

What is a Data Structure?

Long time no see. I’ve been slacking on my blog as of late. I think my problem is a certain amount of perfectionism. I’m still working on my simulation project, Ravenna. Since the last post, I’ve mostly been working on a high-performance thread-safe computational solid geometry library for procedural generation to integrate into it. I wanted to wait until I was finished with that to make a post, but clearly I’ve let time drag on for far too long. With what little time is left in the cracks between work, chores, exercise, and personal life it’s been two years since my last post. There will be more on that library (which I call Carthage) in the not-too-distant future, but for the moment, I just wanted to type out some ideas I had in my head.

These are much more general than what I’ve posted on this blog in the past. I really just want to come up with a decent answer to the question “What is a data structure?”

I’ve asked at least one person for their take on this and got a response that involved words like “stack”, “heap”, and “algorithm”. But I’d like to answer this question mathematically. I desire a certain amount of rigor that allows me to say “Yes, in fact, that is a data structure.” or “No, that is not.” Let’s start off with some examples and non-examples of a data structure.

  • Linked List
  • Circular Array
  • Hash Map

What about this series of bytes?

Address Value
0 10
1 5
2 10
3 0

Hm… Difficult to interpret. Anyone could have left these bytes here. Maybe it’s a collection of one-byte key-value pairs?

Key Value
10 5
10 0

Maybe it’s an array representing the sequence $10, 5, 10, 0$. Either interpretation is valid. This series of bytes could be an instance of either of those. But an actual data structure, it is not.

Okay, so what is a data structure? Put briefly, I think it’s a way to map a logical object onto a linear collection of bytes. Consider this example. ${ 1, 2, 3, 4 }$ is a sequence of integers – a “logical object”. In this case, by “logical object”, we mean “set theoretic object,” but any axiomatization (e.g. topoi) will work here. Then the (length-prefixed) array data structure will prescribe this serialization:

Address Value Meaning
0 4 Length
1 1 Element 0
2 2 Element 1
3 3 Element 2
4 4 Element 3

And the linked list data structure will prescribe (something like) this serialization:

Address Value Node Field
0 4 0 Next Pointer
1 1   Value
2 6 2 Next Pointer
3 3   Value
4 2 1 Next Pointer
5 2   Value
6 0 3 Next Pointer
7 4   Value

But this is still a little hand-wavey. Let’s inject some theory here. For the sake of simplicity, we’ll be talking about a computer operating on arbitrary integers, not bytes or quad words like the computer I’m typing this on surely does. We’ll model its state sparsely, as a collection of write operations. We’ll view each write operation as a 2-tuple of address and value. So we’ll say that

In other words, we view the state of a computer with $N$ addressable integers as a collection of $N$ or fewer write operations on unique addresses. For example, suppose we have a computer containing just the length-prefixed array example above. Then its logical state will be

We choose this sparse representation rather than simply modelling the state as $\mathbf{N}^N$ so that transformations on the data have the same representation as the data itself.

Now, if you spend your time working on GPGPU applications, this formulation for a “computer” might seem hopelessly narrow. Indeed, this entire post is going to be written with an eye toward something like an x86 CPU. If you’d like, though, you can salvage this indexing idea for any other sort of computer by replacing the index set $\mathbf{N}$ with whatever makes the most sense for your system. Perhaps $\mathbf{N}^2$ for a GPU?

So now that we have a model for the state of a computer, we’ll define a data structure on $C_N$ modelling a set $S$ as an injective function

Pretty simple, right? Simple enough that I think this idea has been glossed over in all of the texts that I’ve read.

Now, remember that this is a mapping, not an algorithm. An algorithm (the way I use the term here) requires a machine / instruction set on which to run. A mapping is simply a logical construct, again modelled in Zermelo-Fraenkel set theory.

So, as an example, the length-prefixed array data structure would be the following:

Let $S$ be $\mathbf{N}^\ast$, the set of lists of integers of any length (including zero). We then define $D_{ARRAY} : \mathbf{N}^\ast \to C_N$ as follows.

It’s a relatively simple formulation. Let’s take a look at the example from above annotated using this notation:

Address Value Symbolic Value Meaning
0 4 $\mathbf{card}(s)$ Length
1 1 $s_0$ Element 0
2 2 $s_1$ Element 1
3 3 $s_2$ Element 2
4 4 $s_3$ Element 3

Linked lists are a little bit harder. This is because linked lists generally require an allocator. Since we’re building mathematical functions here, we would like them to be deterministic, which malloc certainly isn’t. Let’s say we want to make a bunch of allocations of two bytes each. malloc could give us 0, then 2, then 4, then 6, then 8, but it could just as easily give us 24, then 12, then 16, then 4. And in any every case, we’ve called malloc(2) and it’s given us a different result each time.

Call Iteration Return Value
malloc(2) 0 24
malloc(2) 1 12
malloc(2) 2 16
malloc(2) 3 4

We could just gloss over this fact and pretend that linked lists always put their elements one after the other, but it would be more honest of us to build a mathemtical abstraction that encapsulates the behavior of an allocator. Let’s view an allocator as a function taking two arguments: the traditional size in bytes that we’re familiar with, as well as a unique integer representing the number of times it has been called already. And we’ll take a function like this as a mathematical parameter to our linked list mapping.

Function Iteration Return Value
$A(2, 0)$ 0 24
$A(2, 1)$ 1 12
$A(2, 2)$ 2 16
$A(2, 3)$ 3 4

Unlike malloc, this function $A$ is deterministic in its arguments. So, to formalize this notion, we’ll define an allocator as $A : \mathbf{N} \times \mathbf{N} \to \mathbf{N}$, a function taking a size in bytes of the space to be allocated, and second integer corresponding to the number of times the function has been invoked already.

So, with our allocator parameter in hand, we can write $D_{LL}$. Let $\alpha_i = A(2,i), \forall i \in [ 0, |s| - 1 ]$ and let $\alpha_{|s|} = 0$. Then

A little bit more complicated. Let’s look at our linked list example from above symbolically annotated:

Address Symbolic Address Symbolic Value Value Node Field
0 $\alpha_0$ 4 $\alpha_1$ 0 Next Pointer
1 $\alpha_0 + 1$ 1 $s_0$   Value
2 $\alpha_2$ 6 $\alpha_3$ 2 Next Pointer
3 $\alpha_2 + 1$ 3 $s_2$   Value
4 $\alpha_1$ 2 $\alpha_2$ 1 Next Pointer
5 $\alpha_1 + 1$ 2 $s_1$   Value
6 $\alpha_3$ 0 $0$ 3 Next Pointer
7 $\alpha_3 + 1$ 4 $s_3$   Value

So this function takes a set-theoretic list (and an allocator) and maps to a collection of bytes. Now, the set of all lists is certainly an object of mathematical interest. And the set of all valid linked lists $D_{LL}(\mathbf{N}^\ast)$ should be too.

When I brought this idea up to a coworker, he mentioned that this whole definition of data structures in terms of mappings might be meaningless if the mapping resulted in all possible byte configurations being valid instances of the data structure.

Well, for $D_{ARRAY}(\mathbf{N}^\ast)$, that is indeed the case. Every byte configuration, regardless of the intent of the author, is going to have an interpretation as a list. But, in general, people use linear container data structures like arrays and linked lists for things a little bit more restrictive than “the set of all integers”. Suppose instead that we were storing elements of the fibonacci sequence $\mathcal{F} \subset \mathbf{N}^\ast$. Well, then $D_{ARRAY}(\mathcal{F})$ is certainly much more restrictive than “all byte configurations”.

I think one of the central conceptual leaps I made as a programmer, was when I started holding both the logical object (like $\mathbf{N}^\ast$) and the byte configuration ($D_{LL}(\mathbf{N}^\ast)$) in my head at the same time. Programming languages are tools that are supposed to bridge the gap from one to the other, but oftentimes, it feels like they don’t reach full expressivity in either domain. The job of the programmer is to come up with a logical model that accomplishes a task, and then to come up with a byte configuration and collection of operations on it that also conform to that model and have attractive operational properties like low latency or low memory usage.

Getting any given programming language to bend to your mapping might require you to seek out dusty corners of the language, or even to extend the language itself with compiler plugins, C extension, or full formal feature proposals.

Now, I often hear from people who work in languages that run in interpreters or virtual machines that this sort of thing doesn’t apply to them – they don’t operate at the level of bytes. They operate at the level of Java objects or of Python PyObjects.

But these languages still run on CPUs. There is still a mapping from the logical construct you have in your head to a physical configuration of bytes in memory; there are just a few more layers of abstraction between what’s in your head and the bytes in the machine. Unfortunately for you, that means that someone else is in control of the lower layers of the mapping.

In some cases, it’s important for you to understand what’s happening at the byte level when working in Java. In others, it’s much less important.

In some cases, you’ll have to go beyond the level of abstraction that we’re using in this blog post and think about things like cache thrashing that occur on NUMA-based machines. The point is that, regardless of the language that you’re working in, the logic in this post should still hold.

Methods

One critical aspect of data structures is mutation – the ability to take in an instance of a data structure and change it to something slightly different, but potentially with a very different interpretation. Back in our logical space $S$, we can think of a mutation as a function $M : S \to S$. For example, in our space of integer sequences $\mathbf{N}^\ast$, one family of mutations would be $APPEND_x : \mathbf{N}^\ast \to \mathbf{N}^\ast$, where $APPEND_x$ is a function appending the integer $x$ to the end of the input sequence. In other words:

Our data structure definitions above allow us to talk about homomorphic definitions on our bytewise representation of our list, that is, about

The function that takes in a bytewise linked list representation of a list of integers and appends a new integer to the end of that list.

We can say that the logical mutator $APPEND_x$ induces the bytewise mutator $D_{ARRAY}^{-1} \circ APPEND_x \circ D_{ARRAY}$. For the sake of simplicity, let’s call this $\mu({D_{ARRAY}}, APPEND_x)$ from here on.

An important takeaway from this is that, once you’ve defined your data structure – the mapping from a set theoretic list to a collection of bytes – you’ve already defined what write operations need to happen in order to implement any given method. There may be faster and slower ways of achieving those write operations depending on how you implement the method on your particular architecture/machine, but the general form is already determined for you.

To be more explicit about it, first define $\Delta : C_n \times \mathbf{N} \to \mathbf{N}$, a dereferencing function which looks up the value at a given address. Then

It’s worth noting that $\mu({D_{ARRAY}}, APPEND_x)$ is still just a mapping, not an algorithm. While there are many algorithms that could implement the logic of the mapping, the mapping does not uniquely determine one. However, we can take a stab at it. The following assembly would implement the above in X86. I’m using 8-bit assembly here and trying to keep everything in registers to simplify a bit. This is not the sort of assembly you’ll see coming out of your compiler.

; assume value x to be appended in al
; assume array starts at address in sil
append:
  mov (%sil), %cl       ; Get the length in %cl.
  mov %al, (%cl,%sil)   ; Put x at the n'th position.
  add $1, %cl           ; Calculate length + 1.
  mov %cl, (%sil)       ; Put the new length in position 0.

Or, in (mostly) equivalent C:

void append(register unsigned char* array,
            register unsigned char value)
{
  array[array[0]] = value;
  ++array[0];
}

Linked lists are a little more complicated. An append operation would look something like the following:

struct node_t {
  node_t* next;
  unsigned char value;
};

void append(register node_t** node, unsigned char to_append) {
  while (*node != NULL) {
      node = &((*node)->next);
  }
  *node = (node_t*)malloc(sizeof(node_t));
  **node = {NULL, to_append};
}

I’ll spare you the assembly version of this, since it’s pretty long, complicated, and dependent on the malloc ABI.

Constraints

The above linked list append operation is a good target for discussion about constraints/invariants. Invariants of linked lists are facts (or “theorems”) about $D_{LL}(N^\ast)$, our bytewise object. For example, from our definition, it is true that, $\forall 0 \leq i < |s| - 1$,

Or, in other words, if an entry for $s_i$ exists at $\alpha_i$ and $i$ is not the final index of the list then there is a subsequent entry at the address $y$, which is recorded at address $x$ that has the value for $s_{i+1}$. This condition is critical. If this were not so, then we would have no way of iterating over the data structure to retrieve all of the elements and reconstruct the list.

Statements like these are necessary (but not sufficient) conditions for the correctness of our data structures and, by extension, our algorithms.

Here’s another one. The final entry in the linked list has a “next” entry of 0, or nullptr if you’d like to refer to it that way.

Seems almost too trivial to note down, but it’s key to our ability to write a proper append algorithm.

So, our goal with this append function is, given a pointer to a pointer to the first byte of a linked list representing the list $s$, mutate the linked list so that it now represents $s + x$, where $x$ is a new integer to append to the end of the list. This, of course, means that just before the function begins, we will have a valid linked list and just after it ends, we will have a (bigger) valid linked list. We make no guarantees while the function is executing. This is why multithreading is so difficult and why lock-free programming is even harder.

Let’s think about the mutations that need to be made in order to carry out this append operation. Since we already have $D_{LL}(s)$ in memory, we don’t need to redo those write operations. So we only have to perform the delta. That is:

Please excuse the sloppiness of notation with my “subtraction” above. What I mean by this is to keep any write operations on the left hand side not in the right hand side, as well as any write operations in the left hand side and the right where the two differ on the value, but to remove any entries that are exactly the same between the two.

Now the statement above only holds if $s$ is non-empty. If it’s empty, then the first write – $(\alpha{|s| - 1}, \alpha_{|s|})$ – is not necessary, and indeed, impossible. In my implementation, we avoid the need for a conditional to accommodate the operation by hijacking this write to update the caller’s parameter pointing to the head of the list. One could even consider this extra pointer a part of the data structure, but we’re glossing over that bit in this construction.

Take a look at the definition of $D_{LL}(s)$ above to convince yourself of the delta formula we’ve just written down.

So we need to perform exactly three write operations:

  • Write the value of the new entry $x$.
  • Write a NULL pointer to the end of the list.
  • Update the previous “last” entry to point to the new entry.

Now, we can use the constraints I called out above to show the validity of our algorithm. (or at least to show that we haven’t violated certain aspects of it).

Now, since we only have a pointer to the beginning of the list, we need to do a little bit of work to get the value $\alpha_{|s| - 1}$ in memory:

void append(register node_t** node, unsigned char to_append) {

At this point, *node contains the value $\alpha_0$ (or NULL, in the case of an empty list), while node points to a parameter on the stack. Next, we have a loop.

while (*node != NULL) {
    *node = &((*node)->next);
}

Now, based on our conditions above, at the end of the i’th iteration (starting from 0) of the loop, *node contains $\alpha_{i+1}$ and node contains $\alpha_{i}$. And since $\alpha_i = 0 \text{ if and only if } i = |s| - 1$, we know that when the loop exits, *node will contain 0 and node will contain $\alpha_{|s| - 1}$. So, now that we have $\alpha_{s-1}$ in memory, we can perform the first of the three write operations, namely $(\alpha_{|s| - 1}, \alpha_{|s|})$:

  *node = (node_t*)malloc(sizeof(node_t));

Remember, from our definition above, $\alpha_i$ corresponds to the i’th call to the allocator, which is exactly what this line is. So, now, *node contains $\alpha_{|s|}$, so the final line makes the writes $(\alpha_{|s|}, 0)$ and $(\alpha_{|s|}+1, x)$:

  **node = {NULL, to_append};

I’ve often heard that good programmers “think in terms of constraints”. I think this is exactly the sort of reasoning those comments are referring to.

Injectivity

So we can’t allow just any old mapping to be a data structure. It must be injective. In other words, each byte configuration must correspond to exactly one logical configuration. If this were not so, we could say that a mapping to all zero bytes for any input list constituted a “data structure”. But this goes against intuition. A single collection of bytes can’t be used to represent all possible values.

This property allows us to do something interesting. Let $S$ again be any logical object (say the set of all lists of integers) and let $D_1$ and $D_2$ be two different data structure representations of lists of integers (like the array and the linked list). Since injective functions admit an inverse, we can talk about

This function is a mapping from an element of $D_1(S)$ to an equivalent element of $D_2(S)$. So, in other words, you’re able to map from an instance of any integer list data structure to an equivalent instance of any other integer list data structure.

Even better, since both the domain and codomain of this function are byte-based, we know that an algorithm implementing the translation must exist.

P.S. Cloud Run is Kind of Cool

I’ve quietly moved this blog from Digital Ocean to GCP’s Cloud Run. I’m quite liking it so far. Scale from Zero means I get some good savings on CPU costs and the combination of auto-scaling and CDN mean I’ll be able to withstand it if I’m ever the victim of the Hacker News hug of death. Plus, I get TLS for free. Who wants to deal with those certs on their own?