Don’t optimize prematurely, BUT

But please do, at some point. If you can (I know all too well that we are rarely given the time and budget to do things cleanly in the industry).

Cleanups, factorization, use of better APIs and/or libraries, use of better algorithms, everything counts towards less resource usage.

With the incredibly vast amounts of resources we have now, it’s easy to think design choices and implementation don’t have that much impact on how much is required to perform a given task, but when nobody cares, you end up with this kind of sentiment:

But get yourself hard limits and it soon becomes obvious how much “not caring” costs, and how much more you can do with almost no overhead if you spend some time on it.

As an example, when I first released Mastodon for Apple II, it did fit all the Apple II’s three hard limits:

  • it did fit on the floppy. Apple II floppies are 140kB, remove 17.5kB for ProDOS and there is the limit: 126kB available.
  • it did fit into memory. An single Apple II binary can be, at most, 49kB.
  • there was enough free memory left to actually work. (A 30kB binary would leave 19kB usable for data).

This first release consisted of five binaries, because not all features could fit in a single one. These binaries totalled 104kB, thereby fitting comfortably on the floppy, and were 34kB (main program), 20kB (composer), 23kB (image viewer), 20kB (login handler), and 8kB (configurator). (The drawback of this is that all common code is duplicated in each binary on the floppy. Splitting programs without duplicating code is possible but a huge headache inducer, so I’m not there yet.)

Since that first release, I’ve added a number of features to that application, like content warnings, blocking, following, masking, bookmarks, image saving, polls, virtual drive, account metadata, and audio-video streaming.

I have been stopped in my tracks quite a few times when I couldn’t build the floppy image because I was one block short. Fitting audio-video streaming was challenging. I had to go back to existing code over and over again.

The binaries now occupy 96.5kB on the floppy. About 7.5Kbytes less than the first release to be precise. You can easily guess that all these features don’t fit in -7.5Kb of code. They don’t. If I had kept churning out features with no regard for optimisations, I’d have been stuck very soon. I iterated; each time I went close to a limit, I stopped adding features and focused on size saving. Here is a timeline of the new features and how they relate to the binaries sizes.

As you can see, quite often new features arrive and the programs size decrease.

Some of these optimizations were hard and/or counter-intuitive to obtain, requiring rewriting C to assembly – I even went optimizing the compiler’s runtime (cc65) a few times. But others were surprisingly easy and rewarding, like changing a CFLAG to minimize stack usage, changing the start address to avoid having to keep a very useless “hole” over the graphics page stored on disk, or switching from stdio (fopen/fread/…) to fcntl I/O (more simple open/read/…).

This one was fun: I replaced all the

if (pointer) {
  do_something();
}
if (!other_pointer) {
  do_something();
}

With this little macro:

#define IS_NOT_NULL(ptr) (((int)(ptr) & 0xFF00) >> 8)
#define IS_NULL(ptr)     (!IS_NOT_NULL(ptr))

if (IS_NOT_NULL(pointer)) {
  do_something();
}
if (IS_NULL(other_pointer)) {
  do_something();
}

And gained about 500 bytes all over the place (yes, there are other things in this commit, I should separate my commits, yada yada). This is because pointers in cc65 are 16-bit integers, from 0x0000 to 0xFFFF. So the compiler checks both bytes to determine if a pointer is 0x0000 or not. But these pointers never point to the zero page (addresses 0x0000 to 0x00FF), as the zero page is special on Apple II and certainly not where one stores data. So it is enough to check that the high byte is 0x00 or not.

Once assembled, a check for pointer nullity goes from, in the best case:

  lda pointer
  ora pointer+1
  beq :+
  jsr _do_something
: ...

to simply:

  lda pointer+1
  beq :+
  jsr _do_something
: ...

This saves three bytes per if, and it counts.