.. _bikeshed:
phkmalloc
=========
Jason Evans laid
`jemalloc `_
to rest yesterday, and gave a kind shoutout to my malloc, aka.
"phkmalloc", and it occured to me, that I should write that story down.
I wrote a little bit about it in my article for the
`30 year aniversary issue of the FreeBSD Journal `_
but there is more to it than that.
Why
---
In FreeBSD we inherited Chris Kingsley's malloc implementation from
BSD, it worked, and we had much larger fish to fry, so we paid it
no attention.
During 1994 and 1995, RAM prices went through the roof, and therefore,
conciously or subconciously, efficient use of RAM became a concern.
In my case it became a huge concern, because my machine only had
4MB RAM and being release engineer, I ran GCC a lot.
My system paged more than I thought it should, and in particular
I noticed a very distinct burst of disk-activity whenever GCC
terminated.
The "death-rattle" was even more prominent with Tcl and Tk programs.
That made no sense to me, why would things page when the program
was freeing things prior to termination ?
The "canonical" malloc implementation, the malloc implementation to
start all malloc implementations, is in chapter 8.7 in The C
Programming Language, and as all code in that book it is simple and
elegant: A linked list holds the free blocks of memory, and that
is pretty much it.
If you are up for a challenge: Go read chapter 8.7 now, and see if
you can predict what comes next :-)
The problem
-----------
K&R's malloc was written on a swapping system, probably a PDP-11,
where either the entire process is in memory or the process does
not run.
In the meantime we had gotten Virtual Memory, where pages may or
may not be in RAM when you need them, but the kernel handles that,
so from a programmes point of view, the only difference is that
sometimes a memory access causes disk activity.
Chris Kingsley's malloc starts with this comment:
.. code-block:: none
/*
* malloc.c (Caltech) 2/21/82
* Chris Kingsley, kingsley@cit-20.
*
* This is a very fast storage allocator. It allocates blocks of a small
* number of different sizes, and keeps free lists of each size. Blocks that
* don't exactly fit are passed up to the next larger size. In this
* implementation, the available sizes are 2^n-4 (or 2^n-10) bytes long.
* This is designed for use in a virtual memory environment.
*/
So clearly he had thought about the swap/paging thing, and yet,
performance felt less than stellar and there was that death-rattle
phenomena.
The first things I did was to hack a version of the malloc, to log
all calls to malloc(3), free(3) and realloc(3) to a file and tried
to make sense of the, for the time, vast amount of data that produced.
It was pretty obvious from my PostScript plots, that performance
was far from stellar, and even more obvious that the death-rattle
was associated with freeing a lot of memory.
Why would *freeing* memory cause the kernel to page things in and
out ?
To ``free(3)`` a block of memory, you have to walk the free-list to
find the right place to insert it, and the linked list is stored
in the first couple of words of all the free chunks.
Worst case, to free a chunk of memory, we have to read the first
couple of words of all the memory not currently use. The death-rattle
was the kernel paging all the memory the process explicitly did not
use, into RAM, so the process could mark even more memory as
explicitly not used.
I may have exclaimed "Duh!" at this point.
The fix(es)
-----------
I went to work fixing this, and my first hack was brutal: Whenever
a chunk was to be freed, I would chop a small struct from front of
the first free chunk on the free list:
.. code-block:: none
struct free_chunk {
struct free_chunk *next;
struct free_chunk *prev;
void *chunk;
size_t length;
};
And put that structure on the free list, so that I would never have
to touch the actual free memory again, unless it was reallocated.
That nearly eliminated the death-rattle, and things ran markedly faster.
Calling ``free(3)`` with only a pointer, and no size, means malloc
implementations need some "trick" to remember how big that chunk
was when it was allocated.
K&R malloc, and every other malloc, hides that in a small struct
right before the chunks:
.. code-block:: none
free(ap) /* put block ap in free list */
char *ap:
{
register HEADER *p, *q;
p = (HEADER *)ap -1; /* point to header */
[…]
That bugged me, because it meant that a chunk of memory which had
been unused for a long time (in VM-traffic terms) before the process
calls ``free(3)`` still get paged into RAM just to mark it unused.
That reminded me of a trick I had seen on a very old and very strange
computer, which needed to keep track of stuff on the drum memory.
Biting the bullet
-----------------
So I blew away all of Chris' code and started from scratch.
Along the way I collected a lot of data and plotted it on
my inkjet printer, and one of those plots still have a
place in my heart and on my wall, I call it "Towers of
Barad-dûr":
.. image:: barad_dur.png
:width: 60 %
I will not recount the resulting architecture, you can find
that in the paper I wrote for the "FreeNix" track of the
USEnix Annual Technical Conference in 1998 in New Orleans:
`Malloc(3) revisited `_
Messing up with ``malloc(3)/free(3)/realloc(3)`` was a very well known
problem, and there were several malloc implementations were written
explicitly to catch that sort of trouble, but clearly not enough
people used them, usually because all the checking made them slow.
Because I kept the "metadata" away from the chunks themselves, and
because I used a binary "buddy" layout for sub-page-sized allocations,
I could detect some of the most common mistakes.
First I thought "We're not having any of that" and made phkmalloc
``abort(2)`` on any wrong usage. Next time I rebooted my laptop
``fsck(8)`` aborted, and left me in single user mode until I could
fix things with a floppy disk.
So I changed the logic to detect the problems, spit out a
warning but do nothing when ``free(3)`` was passed bad pointers.
But I wanted to be able to control that at run time, but how
do you run time configure ``malloc(3)`` ?
An environment variable was obvious, but how do you set the system
wide default ?
It may also sound obvious to have a configuration file, but calling
``open(2)/read(2)/close(2)`` and doing a bunch of string parsing
in all programs seemed excessive, and I was not even sure that it
would be possible, because, by definition, ``malloc(3)`` was not
working yet.
A symbolic link does not *have* to point to something
in the filesystem, it is essentially just a very tiny text-file:
.. code-block:: none
ln -sf A /etc/malloc.conf
tells phkmalloc to call ``abort(2)`` at the first sign of trouble.
Later more flags were added: ``J`` for 'fill with junk', ``Z``
for "fill with zeros" and so on.
Reasonable people who's opinions I respect, have called this
hack anything from "brilliant" to "an afront to all morals".
I think it is OK.
Rubber - Road, Road - Rubber
----------------------------
When I got to a point where I felt performance was sufficiently
improved, not only over Chris' malloc, but also over the
GNU-malloc which we used as a bandaid in some corners, I
committed "phkmalloc" in the middle of september 1995:
.. code-block:: none
``phkmalloc''
Performance is comparable to gnumalloc if you have sufficient RAM, and
it screams around it if you don't.
Compiled with "EXTRA_SANITY" until further notice.
see malloc.3 for more details.
Lots of positive feedback on the speedup, but also trouble reports.
I want to credit the band at this point: Instead of blaming phkmalloc,
demanding a backout, people went to work, even if things could be
pretty obscure.
For instance John Dyson:
.. code-block:: none
These changes fix a bug in the clustering code that I made worse when adding
support for EXT2FS. Note that the Sig-11 problems appear to be caused by
this, but there is still probably an underlying VM problem that let this
clustering bug cause vnode objects to appear to be corrupted.
The direct manifestation of this bug would have been severely mis-read
files. It is possible that processes would Sig-11 on very damaged
input files and might explain the mysterious differences in system
behaviour when phk's malloc is being used.
Or later the same day Bill ("Wild Bill Ethernet") Paul:
.. code-block:: none
phkmalloc strikes!
#ifdef out a number of calls to free() left over from the original
GNU ypserv implementation. As near as I can tell, the Berkeley DB
package does its own garbage collection, hence the caller doesn't
have to worry about free()ing the memory returned in the DBT
structures during lookups (I'm still not 1005 sure about this:
the DB code is very hard to follow. I must use dynamically
allocated memory since you can retreive arbitrarily large records
from a database, but I'm not sure where it ends up letting go
of it). This was not true with GDBM; you had
to do your own garbage collection.
The general rule is that if you allocate memory inside an RPC
service routine, you have to free() it the next time the routine is
called since the underlying XDR routines won't do it for you.
But if the DB package does this itself, then we don't need to do
it in the main program.
Note that with the original malloc(), there were never any errors
flagged. phkmalloc complained quite loudly.
I would be wrong to say that "all Hell broke loose", but maybe Heck
had a cat-flap, and we would need to get to the bottom of this,
before our next release, which lead to one of my all-time favorite
commit message punch-lines:
.. code-block:: none
phkmalloc/2
"zero' and 'junk' options to help find and diagnose malloc abuse.
EXTRA_SANITY defaults "junk" to on.
[…]
SANITY is not an option anymore. (!!)
And then it just went on, and on and on:
.. code-block:: none
Avoid bogus free() of a junk pointer.
/bin/sh corruption caused by non-zeroed malloc() in libedit
Another use of un-cleared returns from malloc squashed...
Two uninitialised variables were causing a phkmalloc warning
Phkmalloc strikes again.
Remove one bogus free(result) (from _havemaster()) that slipped by me.
I suppose this means we can chalk up another victory for phkmalloc. :)
Fix some long-standing malloc bugs in the package handling code
"zero' and 'junk' options to help find and diagnose malloc abuse.
[…]
I gradually made phkmalloc more and more picky, until we got to the
point where it was downright hostile in -current, at a small
performance cost.
Cybersomething-or-other
-----------------------
Because the metadata about chunks were not right next to the chunks,
phkmalloc was very resistent to trivial variations of buffer-overflow
attacks, resulting in a number of "linux, solaris, ... are vulnerable,
but FreeBSD is not." incidents, which caused some people to mistakenly
think that phkmalloc could *never* be compromised.
Of course phkmalloc could be comproposed!
If the program can be brought to color outside the lines, the program
is unsafe. Period!
But it can be easier or harder to figure out *how* to exploit those
mistakes.
In January 2003 Stefan Esser found a `double-free in CVS
`_, which was trivial to exploit
on pretty much all platforms, but the advisory does not mention any
of the three BSD's, which had all adopted phkmalloc by then,
because the proof-of-concept exploit did not work.
Four months later, somebody known to the world only as "BBP" spent
55 slides at blackhat,
`and a wall of text `_
to show how it was in fact possible to craft
an exploit to get through the CVS-hole, also with phkmalloc.
Good for him and hat-tip from me: At least now two people knew how
phkmalloc worked :-)
But gaining people months to patch CVS, rather than mere hours is
a huge win in my book, and I still count phkmalloc is my major
contribution to improving cybersecurty, even if it was not yet
called that.
Kirk's mustache & Old man river
-------------------------------
Presenting phkmalloc at USEnix ATC 1998 was an adventure in itself,
with most of my worries focused on slide 20 of my presentation,
where I basically told many of the unix.gods, which unaccountably
attended my talk, that their code sucked:
.. image:: slide20.png
I had absolutely *no* idea how that would go down, but I could see
no other way to drive home the point, that ``malloc(3)`` usage bugs were
real, and a real problem. But being in the good old days where slides
were real slides etc, I figured I could always skip that slide during
the presentation, if that seemed prudent.
My brain does neither faces nor names very well, and I had never
met most of the people in the packed audience before, but there
were no way to miss Kirk Mckusic's Mustache™: Center front
row, or to mangle metaphors terminally: Right in my face.
I had his ``fsck(_ffs)`` at the top of the list, because it was literally
the "first kill" for phkmalloc, so I decided that I would skip that
slide, and only, reluctantly, put it up, if somebody asked a relevant
question.
But when I got to that slide in the talk, I was running high on
100% adrenalin, slapped the slide on the projector, and started
saying something about how everybody was making malloc mistakes,
but I dont think anybody heard it for the laughter in the room.
I did not exactly expect to be peltered with rotten tomatoes, but
I did absolutely not expect unix.gods be laughing and teasing each
other for having their programs on my list. My well-meaning
moralizing was 100% surplus to requirements: Everybody
in that room had been there, done that, and knew the malloc(3) API
was dangerous.
In retrospect: Why else would the have been there ?
I cannot remember any of the questions or who asked them: I was
too shell-shocked, and after the final applause died down, I was
totally flat. I left the bustle of the conference venue, walked
out in the blazing sun, followed Canal Street down to the river,
found a small park and for the first (and only) time in my life, I
stood at the banks of the Mississippi river.
Ten years previously, I had done sound for an amateur production
of Show Boat in hometown (Slagelse, Denmark), and my first real
girlfriend was one of the girls from the chorus, so I couldn't
help but sing "Old man river" to myself:
.. code-block:: none
Man sku' være gamle Mississippi,
drive langsomt i det grønne land,
drive under alle årets himle,
le a' kiv og kævl og tidens tand.
Floden driver, til alle tider,
[…]
While I stood there singing, a posh elderly black gentleman
edged up to me, and asked in a slightly worried tone: "I can hear
what you're singing, but I cant understand *any* of the words ?"
I told him that I was singing in Danish, and hastily disclaimed
that I had not even thought about if that song would be considered
an insult it modern society.
"Naah, that's OK" he answered, "I was just worried about my hearing
because I couldn't understand the words." We chatted for a bit,
and I went back to the conference, sunburned and drenched in sweat.
In retrospect, it is obvious that Kirk must have known what was comming:
He had reviewed the fix to fsck himself.
Having gotten to know him afterwards, it would not even surprise
me, if he had gently pulled USEnix strings, to put me to beard-to-beard
with him that day: He was at least on the governing board of USEnix,
possibly even its chairman.
On the last day of the conference somebody, it may have been Eric,
half apologizing told me that if I had been on the main track, I
would have received the "Best Paper" award, but for political reasons
the FreeNix track was off limits.
No need to apologize: I did not even know the award existed, and
it would never meant as much to me, as spending an hour over breakfast,
chatting with Dennis Ritchie about device nodes and timekeeping
in early UNIX kernels, and receiving his blessing for my DEVFS.
Your membership, should you accept it…
--------------------------------------
The USEnix presentation kind of marked the apogee of phkmalloc.
By that time we had found and fixed who knows how many malloc
mistakes it had detected, not only in FreeBSD but also in the ports
collection - I had long since stopped keeping track.
It had also been adopted by a lot of other projects, and I
regularly received emails along the lines of "somebody told
me to use phkmalloc and it immediately spotted the bug we
have been chasing for ${long_time}" etc.
But multi-threading and multi-CPU systems were rapidly becoming a
thing, and the tightly knit data structures of phkmalloc left no
other option than to slap one big mutex around all of it, which
were OK one or two CPUs, but became a performance problem from 4
cores and upwards.
RAM had also become a LOT cheaper, so some of the "bit-pinching"
where also not warranted any more.
Jason tackled that in jemalloc, and I was happy to hand the job of
malloc maintainer in FreeBSD over to him, and I am just as happy
to induct him into the secret society of retired malloc maintainers
now.
(Your laminated membership card will arrive in the mail in 12 to 18 weeks.)
`phk`