The continuing emphasis on pointer-heavy data structures in computer science curricula is really doing a massive disservice to students and all the employers that go on to hire those students. Pointer-heavy data structures are TERRIBLE on modern hardware. They were great back in the era when pointers were small (≤32 bits) and unpredictable memory access was roughly the same speed as CPU operations. We have left that era long ago. Today pointers are a whopping 64 bits and looking up your next node through a pointer is so slow that it's the CPU equivalent of taking a ten year sabbatical.
For example: at what size does it become better to use a list rather than a vector when doing lots of insertions and deletions which are O(1) for lists and O(n) for vectors? NEVER. It is always better to use a vector. Don't take if from me, this is according to Bjarne Stroustrup:
Pointer-heavy data structures like linked lists are so bad for memory locality and storage overhead on modern hardware that it is impossible for them to catch up with contiguous memory vectors.
There are still situations where you need a tree but if you do, try very hard to use a representation like a binary array heap where the child links are implicit. Or if you must use "pointers" consider allocating all the values in a contiguous block and storing the link structure separately as 32-bit or smaller integer indexes into the data array. This at least improves the memory overhead and locality.
> There are still situations where you need a tree but if you do, try very hard to use a representation like a binary array heap where the child links are implicit. Or if you must use "pointers" consider allocating all the values in a contiguous block and storing the link structure separately as 32-bit or smaller integer indexes into the data array. This at least improves the memory overhead and locality.
There is an ironic side to that advice. If the node in the tree is larger than 8 bytes (ie 2 don't fit into a cache line), then pointers vs vector doesn't get you a win as either way you hit a different cache line when you access the next element. Stroustrup's example is a nice "gotcha", but the reality most of us don't store integers in a vector or tree, we store bigger things. That boils down to two cases: this thing is so big storing a couple of pointers along side them doesn't matter, or you store a pointer to the bigger thing in your tree which means you are using a pointer heavy data structure anyway.
OK, OK, I've cheated. I get what you are saying is using a linked list to store those pointers to the big object is always worse than storing them in an array. But there is a third option which is better than both: embed the pointers into the big object. If that pointer ends in the same cache line as whatever you are using to distinguish the objects the pointer heavy method will be faster than the non-pointer method, and that is an irony.
This pattern has been followed by hoary old C programmers like myself for decades. It's all over the Linux kernel for instance. Granted it ain't so easy to do in languages other than C. (Even rust lacks the necessary "I have a pointer to linked list node that I know is embedded in a object of type T, give me the pointer to the T object" operation because it's so hard to make type safe.) But just because most languages don't support it doesn't mean pointer heavy data structures are bad.
> but the reality most of us don't store integers in a vector or tree, we store bigger things.
I work in numerical computing where storing 64/32-bit floats is by far the most common use case. In this world, storing a lot of small, immutable values is the norm and using pointer-heavy data structures would be a disaster.
That's a good trick that I'll have to think about good ways to expose safely in a higher level language.
Do all programmers program on 64 bit x86 machines? I am not an embedded or systems programmer so I don't have the answer to this question. But does your statement hold true for all of the other chips that uses in the embedded space? If not, then teaching those pointer heavy structures is useful for programmers that go on to use arm or other non x86 chips.
I'm not sure why he made a big deal about pointer size. That is not the problem as far as I have seen. Memory latency (in combination with TLB lookups, which can end up also hitting memory latency) is the problem.
Everything else he says I completely agree with, but I think pointer sizes muddies the water.
Doubling of pointer size doubles the memory overhead (where overhead is defined as memory used that isn't your data). Sure, it's not the worst part of pointer-heavy data structures, but if the overhead is doubled, that's potentially twice as many unnecessary cache misses.
"Austin" provides several container data structures in a single instance. With a function call it can be instructed to change the representation from any one of them to any other. E.g. a red-black tree can be told to become an AVL tree or sorted list and so on.
When I wrote this and announced it, a gentleman from Austria named Carl Van Tast contributed the AVL. I've not encountered him since; I remember telling him it was Van-Tast-ic. That could why I haven't encountered him since.
The API is in "udict.h" (universal dictionary). The algorithms are enumerated thus:
Let me explain to everyone else how your function works. It's turning a tree into a list. The tree is rooted at “rest”, and the list starts at “pseudoRoot” and ends at “tail”.
If the root of the tree has no child on the left, it pops that node and adds it to the end of the list. If it does have a left child, then it does a right rotation on the tree, which makes the left side smaller. (See https://en.wikipedia.org/wiki/Tree_rotation .)
I was wondering if this really runs in linear time. Call a node in the tree good if it lies on the right-most part of the tree. In other words, a node is good if the path from the root to that node only goes right at each step. The root is always good. A node is bad if it is not good. Each right rotation turns a bad node into a good one, and the pruning removes one good node. So it is really O(n) time.
Yes, this is all exactly right. To convince everyone else that I'm not a wizard - my function is just a slightly modified version of the tree-to-vine function from the Day-Stout-Warren algorithm. https://en.wikipedia.org/wiki/Day%E2%80%93Stout%E2%80%93Warr...
there is is an alternative solution where you basically do it using an inorder traversal. it uses more stack space because it holds 1 node pointer in global state and two node pointers between the left and right recursion. While the building a list solution just holds a single node pointer during the left and right recursion.
lets say you want to convert a tree to single linked list in order then you can just do an in order traversal of the tree and keep track of the previous node and assign previous.left = current_node and this is fine because if you are at the successor to a node then the whole of the previous nodes left subtree has been visited because it is smaller and you will never need to follow the left pointer again.
previous = nil
in_order(tree) do |node|
if previous
previous.left = node
end
previous = node
end
however, you can't do node.right = previous until you have visited the right subtree of the current node so you can't use exactly the same code for a doubly linked list. however, if you do the fix up after you have finished processing the current node then it works fine. so if you have an iteration that will give you both an inorder and postorder callback and a way of sharing state between them you can just do the fixup in the postorder callback where it is safe.
previous = nil
head = nil
multi_order(tree) do |node, order, state|
if order == :inorder
if previous
fixup_state = [node, previous]
else
head = node
end
previous = node
fixup_state #passed back into iterator and is return in post_order callback
else
if state
fixup_node, fixup_previous = state
fixup_previous.left = fixup_node
fixup_node.right = fixup_previous
end
end
end
previous.next = head
head.prev = previous
head
Why wouldn't you do a pre-order traversal and connect the appropriate nodes? When they say "recursion is key", that's actually misleading and this problem doesn't seem hard at all.
I think if you place a limit on O(log(n)) space being allocated and guarantee the height of the tree is at most O(log(n)) then it breaks the scratch solution and makes the inplace solution where you use scratch on the stack the only solution. probably the inplace solution can be fixed to use O(log(n)) space even if it the tree is unbalanced because you don't need to allocate stack space unless a node has both a left and a right pointer.
There is a follow up to this question. Given two (balanced) binary search trees, where the first BST has N nodes and the second has M nodes, how would you merge both trees in linear time and constant space?
The continuing emphasis on pointer-heavy data structures in computer science curricula is really doing a massive disservice to students and all the employers that go on to hire those students. Pointer-heavy data structures are TERRIBLE on modern hardware. They were great back in the era when pointers were small (≤32 bits) and unpredictable memory access was roughly the same speed as CPU operations. We have left that era long ago. Today pointers are a whopping 64 bits and looking up your next node through a pointer is so slow that it's the CPU equivalent of taking a ten year sabbatical.
For example: at what size does it become better to use a list rather than a vector when doing lots of insertions and deletions which are O(1) for lists and O(n) for vectors? NEVER. It is always better to use a vector. Don't take if from me, this is according to Bjarne Stroustrup:
https://www.youtube.com/watch?v=YQs6IC-vgmo
Pointer-heavy data structures like linked lists are so bad for memory locality and storage overhead on modern hardware that it is impossible for them to catch up with contiguous memory vectors.
There are still situations where you need a tree but if you do, try very hard to use a representation like a binary array heap where the child links are implicit. Or if you must use "pointers" consider allocating all the values in a contiguous block and storing the link structure separately as 32-bit or smaller integer indexes into the data array. This at least improves the memory overhead and locality.
</rant>