Notes on the design of the saugns software (and SAU language implementation), as it has evolved, and the main ideas involved. (The SAU language has evolved in parallel with the software implementing it.)
This is mainly for people interested in the design of audio/music programming/scripting software systems, at the level of ideas. It covers the basic ideas needed to make "something like this" (leaving out everything which is general audio programming knowledge). While somewhat structured according to the history of development, newer design largely builds on the old, and the early program shows main concepts that endure.
The program was developed from scratch, as a hobby experiment beginning in early 2011; the language started out as the most straightforward way I could get a trivial program to generate sounds and handle timing flexibly according to a script (some early examples are shown on the history page). While details and further features of the language implemented have been added and changed since, its core remains the same. Consider the below line (which also runs with modern versions).
Wsin f440 t1.5
If a single big letter (W here) is used to start a wave oscillator of some type, and a small letter followed by a number assigns a value for frequency (f), amplitude (a), or time duration (t), then parsing is trivial. No abstractions for lexing or syntax trees, etc., are necessary. For each W the parser can simply add an oscillator node to a list, and go on to set values in the current node according to any parameter assignments (t, f, etc.) which follow while that oscillator is still being dealt with. Parameters can come in any order when name rather than placement indicates which one is given a value – and it's also easy to allow skipping parameters, by simply giving them all default values.
Several parts creating new nodes may be written in a script. Additional syntax can tell the parser that the next node should be given a time offset value, so that it's not interpreted as something to run/mix at the same time position (in parallel) with the previously added node when generating audio. After parsing is done, the resulting list of nodes specifies what to run or do to generate the audio.
When a series of things are read from a script, the resulting list of nodes produced can be viewed as a sequential list of "steps" to take, "events" to handle, or "instructions" to follow, for setting up or changing audio generation. For a sequence of nodes with no time offsets between them, the steps taken are a series of configuration changes made one immediately after the other, to set up how audio generation should run thereafter. Audio generation generates/uses signals, timed so it pauses/interrupts this when configuration changes arrive in time. (This can look like number-crunching loops running until a time, then leaving them to handle events, then re-entering them to continue.) The list of nodes can first be examined once prior to audio generation, to figure out what data structures need to be allocated and prepared (how many oscillators, etc.) in advance, and a second time when actually using their instances for running the internal "program" the script has been translated into.
To allow implementing modulation techniques (FM, PM, AM, etc.), a way to connect oscillators to oscillators is needed. Support for lists within which modulating oscillators could be placed was then added to the parser and audio generation code, including support for nesting lists within lists (for modulation chains longer than 2 in depth). Such a list of oscillators is assigned in connection with a parameter for frequency, phase, amplitude, etc. for an oscillator – naturally such script parts are parsed with recursion. This was a fairly simple extension for the language, with time proceeding along a linear list of steps to take like before, though the data used in connection with each main node and time position may branch out like a tree – recursion also then entering in how audio generation code handles what's configured to run, mirroring what's specified in the scripts.
It was when trying to add more flexible syntax for timing, on top of that, that complexity first began to grow unmanageable for my past self. In the early project I had difficulty keeping it all working – I had not developed the methodology I use nowadays. (I first tried to solve this, a little later, by instead reworking the design, moving in the direction of adding another pass/stage of script processing between parsing and audio generation. Much later, I saw how to undo that and resimplify.) The design didn't include much of the classic structure of compilers and interpreters, and I didn't have the experience to grow my own design well and make it maintainable, as complexity grew non-linearly. The language also looked quite different from typical well-known and well-described paradigms, and when I came up with ideas and explored literature, I didn't see anything approaching the shape of things I had in mind. So with some timing features added, early features hit a plateau...
Early choices in how the language looks were made for brevity and ease of writing at the small scale, including not requiring any symbol between a name and a value for assignment. Instead, immediate name-value pairs form a large part of the script contents, along with whitespace, different types of brackets for grouping, and some extra symbols. The lack of an assignment symbol imposes some limits on the names used as the left-hand part of an expression, because such names still need to be distinguishable from what follows them – from the values read, which may be numerical, use alphabetical characters, and/or other things. The simplest thing to do is to use one-character names as prefixes, each followed by the value assigned. (Regardless of the length of the prefix-names, having it the same for every prefix-name makes it clear where it ends and the value after it begins.) The pattern of such name-value pairs can be seen in many places in the language, but it doesn't always repeat when breaking down larger subexpressions into parts; e.g. for numerical expressions (later extended further), I settled for the ease of conventional infix syntax for arithmetic in values, rather than elaborating some (in my view clunkier) alternative to it.
The easy and terse solution of using one-letter names for the left-hand part of name-value pairs (in some cases with a special character as the name), works well enough as long as there's not too many things to name. It limits possible additions to the language somewhat, but works well for smaller, fixed sets of named things. The limitation is loosened somewhat by allowing several values of different types for the same name (e.g. a number and/or a modulator list). Beyond that, subnames nested under names can be added for extra related parameters (as done systematically in later versions). As used, the one-character names very loosely mirror ordinary written language, in that context for smaller things is set by a mixture of capital letter names (e.g. adding a new wave oscillator) and special symbols. Lowercase one-character names denote smaller things which are accessed and used specific to the context; lowercase names can be a little like function parameters, record fields, or even function calls (toggling something by merely being used).
For user-defined label/variable names, longer and more flexible names are allowed using a special symbol placed before a name string as the left-hand part of a pair, the name string being the right-hand part. As in how a label for an object (e.g. oscillator) after it is written, 'name . (Until 2022 variables were only used to label objects like that, so that they can be referred back to by name. The 2022 numerical variables feature uses a = symbol as the left-hand part of a second name-value pair, in an unusual imitation of conventional assignment syntax. Eventually it was made to look like $name = number, whitespace optional.)
The basic design for how time works is very simple. Time for a script begins at 0 ms. A script is translated into a list of timed update instruction nodes, or "steps", each new step taking place after the previous, with or without time (ultimately translated into a number of sample values to generate) passing between any two steps. Each step configures the system, e.g. telling it to start generating something or changing parameter values for some generator object.
The running of a script primarily advances through time, and secondarily through the timed update steps which are laid out in a list like a timeline of events. Time proceeds as output sample values are written, while update events only come with time increments and do not advance it. The handling of such updates takes priority over output generation, pausing it until the updates at that time position have been handled.
Each thing which generates output, such as a wave oscillator, has a time duration of use configured, beginning at one time position and ending at one time position. The script has ended when both upcoming events have run out and all things configured by the old events have had their durations run out. In other words, the duration of a script is equal to the total sum length of times to wait before each new update step, plus the longest remaining duration of play after the last update step for still-active "things" (e.g. oscillators).
A main example of a limitation not dealt with early on is the nature of the nested list, i.e. tree structures, as the form of what can be specified in a script. Early on, the capabilities of old FM synthesizer systems had been an inspiration, but they also support connecting oscillators in arrangements other than the tree structures of carriers and modulators provided for by nested lists; e.g. several carriers may share a modulator, and in general the oscillator connection schema is a DAG (directed acyclic graph) in Yamaha's old FM "algorithms". (Technically, self-modulation could however be viewed as adding self-loops to an otherwise acyclic graph. Possibilities for going beyond acyclic graphs by supporting feedback loops more generally also exist, and are done in some synthesizer systems.)
But most conspicuously missing from the early language are features like defining and using functions with audio generation code in scripts, looping and other control flow constructs, etc. I skipped all that at first because I wanted to explore other things rather than inventing yet another "typical" language. The absence of such things is half of what defines the old design. Often, it isn't audio generation features which suggest the greatest departures in design, but conventional programming language ideas or their absence.
Relative to the early language, some kinds of extensions for it would mainly require reworking and complicating the design closer to the parsing end of the program – maybe using another layer of early data structures and processing of them to preprocess script contents into something with a form closer to the old parser output. Other ideas would mainly require reworking the other end of the program, which in the simple design does audio rendering and can be viewed as an interpreter that only follows a flat line of instructions. (When considering creating a more powerful interpreter, whether at the parsing end or at the rendering end, it's also worth noting that some basic big limitations in features are necessary for e.g. time durations for scripts to remain pre-calculable, as they ended up being. A Turing-complete language would not allow it, meaning that features like function calls and loops must forever remain limited.)
Eschewing numerical variables and such in the early project, instead I added a very simply parsed and used mechanism for setting script options with S, a pseudo-type name used like an oscillator W with such parameters, but with the effect of changing default values and other settings at parse time instead of adding an object. This has remained, been tweaked (making it lexically scoped in 2023, imitating how generator objects are scoped in lists), and extended whenever convenient.
Other features also tie into default values. Something which was both designed more intricately than needed, and ending up buggier and trickier to get right than much else in the language, was the flexible default values for time durations. In wanting the most concise language possible, I put some thought into how the time duration t parameter for an audio generator should be filled in if nothing is written, and it turned out that intuitive "make it fit other durations as used in the script at the current time position" behavior was deceptively simple.
The old default time logic was debugged and preserved when the project was revived, and as of 2026 can also be found in the latest versions. In the following script, two audio generators are inserted at the same time, but time is set only for one of them. Yet, both are given the same time value, as the default values for the left-out value is increased to match the other one. (Default values may be increased, but never decreased, from the short default value which is 1 second unless changed with the S t option in a script.)
Wsin f220 t2 Wsin f440
But what if audio generators are inserted at different times, not at the same placement in seconds? In such cases, the default time logic counts down by subtracting time advancement in the script from the current longest time at the current position. Values at a later time position however also count when setting values at an earlier position; for consistency, default values must count up going backward. If the above script had been Wsin f220 /1 Wsin f440, the result is the same – 2 seconds for the 1st generator, 1 second for the 2nd, as the 2nd is given a 1 second (short) default time, and the backward counting-up adds the time shift in order to arrive at the longer default time for the 1st generator.
The use of the time separator | separates duration groups in scripts – basically the time scopes under consideration – so that what's after a | never alters default values for what's before and vice-versa.
It's tricky and a matter of taste how to combine some other features with the default time logic. The "compound steps" feature, for arranging a series of sub-steps for one generator as if one step in a script, can either receive longer default times for the first sub-step (later sub-steps have the time of the prior one as their default), or give a longer default time from the combined duration of all sub-steps, but not both (as then forcing consistent behavior demands an infinite loop of ever-lengthening times, the alternative being inconsistent and confusing behavior). I opted for the second option, to give a longer default time using the sum duration, using an unextended (short) default time for the first sub-step.
Experimenting on in 2011 and beyond, and then looking for potentially useful ideas for programming languages and compilers in 2012–2014 (in part while taking a few basic courses in related things), led to a series of old notes; they contain a list of thoughts on a new possible language, and ideas for possible design elaborations. Back then, while studying, I discounted my own early design and language as a starting point for something better, after both learning some theory and having got stuck with the old project. In part, that was because basic standard concepts are usually connected to different-looking syntaxes and designs and implementations, and I couldn't see how what I'd already come up with may correspond to those concepts. I vaguely dreamed of different things, and put it all aside for years, until, on my own, gradually bridging that gap in thought roughly a decade later, arriving at a road to working out more in practice.
Programming-wise the old project ended in April 2012, while considering various ideas in the time 2012–2014. After cleaning up the old program from 2018 and on, some smaller old ideas (alongside new ones) have been explored from time to time and made it into the program, but I maintain a conservative approach towards adding "typical" programming language features.
The April 2012 program had grown a parser producing a flat main list of time-ordered events (the main nodes), combined with tree structures attached to those events (for data nodes for the things added or changed in a step, which may involve nested syntax elements). This corresponds fairly simply to the language: time-ordering (with time placement syntax) is one dimension of structure, and nesting as in e.g. setting up oscillators for modulaton in lists is another. But! Some of the semantics had begun to be handled after parsing but before interpretation, a middle-layer finishing some details of timing which seemed too messy to attempt during parsing. (It also counted and allocated needed voices for audio generation prior to running it.)
One kind of nesting in the SAU language however applies to time, the ; "compound steps" feature. This nesting, allowing time placement to branch out, is flattened away by the post-parse semantics code. It was (buggily) added early on to make script text neater, allowing writing a series of timed changes for one object together, without advancing the timing for other objects. That way, timed updates can be grouped per object, rather than mainly according to a global flow of time. To implement this, during parsing the event nodes for follow-on compound steps are placed in side-timeline lists attached to the main event node made before the first ;. Then the timelines are merged into one. The same design was tweaked a little for the 2022 "gapshift" feature, which uses the same mechanism to replace an earlier oscillator parameter (s) for padding time durations with leading silent time.
Strictly speaking, there's been even more than one extra pass of script processing between the two main stages, the parser and the final interpreter (long usually called the "generator" module, as it runs all audio generation). Several loops working through all event nodes have been used to finalize the data before it's fully ready to use. For a period of time 2019–2021, the design was temporarily complicated by adding another extra module and set of data structures, with another conversion pass. Then I began to figure out how to do more with fewer loops, and over time (2023–2025) simplified away all the extra layers and loops until only one parser pass and one audio rendering pass remained.
While it was clear that everything can't be done event-by-event straight away while parsing produces nodes, it turns out it's enough to accomodate the ; compound step (smaller grouping) and the time separator | (larger grouping) in the SAU language. The latter says that all which follows it is separated in time from all which goes before it (and is tied to the old (2011) feature of flexible default time logic). Contents delimited by | became the units for the mini-pass that replaced a full separate pass for extra semantics just after parsing. (If no | is ever used in a script, it does in practice still become a full pass.)
Voices are simultaneous sounds, but technically more like "main audio generators" or outputs which behave as such. One thing done first more as a puzzle, later tied to features, is counting and allocating the voices before running audio generation, while keeping the number down. Code to count voices had been in my program since earlu on, but only after resuming the project did I suddenly realize the pre-counted value could be used to auto-adjust amplitude per script (like a global volume control, see S a.m).
But how to count that value, the voice number? For each voice, in short a current time duration and a main audio generator is tracked. There's an incoming list of events from the script with a "wait for a time" value attached to each (meaning subtract the wait time from the duration of all current voices), and some audio generators configured by events play the main role and are given a voice each, with updates for them often setting a new time duration which translates into a new or extended voice duration. Either you can reuse an expired voice or you need to add another one when faced with a new audio generator needing a voice.
The early implementation avoided the simplest, greedy algorithm which just reuses a voice as soon as something stops running and something is added in the script, because I didn't think to allow arbitrary renumbering of voices. What if an old voice, now expired, comes of use later again? An audio generator may have a label allowing it to be given a new time later. If free renumbering of such voices is allowed (a later change), then the crudest greedy algorithm is also the optimal solution; it is the approach which avoids excess voice count using minimal computation. It can also be implemented without requiring any separate full semantics pass/loop, unlike avoiding the greedy renumbering approach with minimal inflation of voice count.
But what about counting and allocating a minimal number of audio generator objects? I put it off until 2025, but ultimately settled for an approach without renumbering of non-dead objects, which avoids reusing objects given labels in scripts, after experimenting. (To allow free renumbering for a minimal count has a complication here – audio generators refer to each other by their IDs, and dependable IDs are internally needed for modulator lists and such. Thus you then need another set of IDs which is not freely renumbered, in part defeating the point, unless solving the issue in an even more complicated manner.)
The semantics code which also counts audio generators does a depth-first traversal of linked modulators from a carrier for several purposes. One purpose is to mark audio generator objects nested under dead ones also dead, so ID reuse can then happen. Traversal is done per voice, with the voice and its "root generator object" as starting point, with timing and tracking for voices and audio generators managed in tandem. When marking an audio generator as expired, two things need to be avoided: it can't be reachable from a label (directly or nested beneath an object which is), and it can't have follow-on events inside the current "duration group" (tied to ; and | timing syntax). Following those two conditions for exclusion, reuse becomes safe. Compared to a "free renumbering" approach, the only waste is the exclusion of labeled objects from reuse; other objects, once related t times set run out, are available as soon as no follow-on events exist for them, given the constraints of the SAU language.
Very early on, numerical infix expressions were added and handled in the parser with one dedicated recursive parsing function for such, handling all subexpressions with recursive calls. The parser parsed and calculated in-place, nested calls combining and reducing numerical expressions to their results. As long as numerical expressions don't contain both side-effects (modifying of state in a script), and re-evaluation of script contents (like in loops or script-defined functions), no more is needed. Only if both are added would the parser need a redesign, in order to allow a numerical expression to be evaluated separately from the initial crunching of the text – and thus possibly several times afterwards.
In 2022 one of the two crucial things – statefulness, in the form of mathematical functions like rand() – were added to numerical expressions.
Adding more types of audio generators than the initial W oscillator (like the R oscillator, a major side-project) only happened beginning 2023 in main versions of the software. Anyway, doing so is mainly about audio generation code, and so requires little design change. The R and W oscillators share most parameters and most parsing code, for example. Line types, used by R similarly to how W uses wave types, already existed for a different purpose, added in 2011 for use with parameter sweeps – though more variations on value-filling functions implementing each line were needed to fit the new needs.
The feature of sweeps for parameters is the earliest (2011) example of attaching extra logic and audio rendering features to parameters, beyond support for modulators in lists. Such features have been expanded over the years. ADSR envelopes, added later (2025), take what sweeps do to a next level. Such features, as done, add forms of subparameters and/or alternative value types to parameters in the language; thus they can be implemented by mainly some extra parsing code and some extra audio rendering code. Sweeps and envelopes are both little state machines used as subcomponents for the larger audio generator types.
With the old design, the growth of features, from types of modulation, to sweeps, to envelopes, more audio generators, etc., complicated the branching logic of what exactly to do during audio rendering. An idea going back to 2012 but only truly done in 2026 is to move most of those decisions away from the audio generation code, pre-scheduling a series of instructions using pre-assigned audio buffer IDs. The old traversing of nodes, modulator lists, etc., is moved to the late parser semantics code, simplifying the audio rendering pass.
This opens the door to better optimization of audio rendering. Complicating the logic of selecting what to do during audio rendering isn't worth it inside the audio rendering itself, beyond a certain (though platform-dependent) point, as it causes various stalling "misses" in the CPU. (This is more so for simpler, often older, CPUs.) But there's basically no issue if extending the logic before audio rendering instead. Plus, the bookkeeping is simpler since actual audio rendering stuff doesn't need to be juggled; this also goes for implementing various further features, not only optimization.
The new main audio rendering loop replaces looping over a set of voices (each with a root or top-level carrier audio generator) with looping over a set of instructions, which mainly each have buffer inputs and outputs and do stuff using them. The set of instructions can be swapped out during each timed update event, much like the contents of a voice could be changed per update event with the previous design. (Why have all instructions swapped out per event, not only those previously corresponding to one voice? It's not only simpler, but opens the door to later features having re-use of audio/signals between voices.)
While not planned initially, it turned out that the parameter sweep component can be handled almost fully at parse time, moving up all of its "state machine" logic except the actual rendering/array filling instruction. Everything except time position tracking is supplied as read-only data with each sweep rendering instruction. (While making this change, it was easiest, and also seemed more desirable, to make sweeps statically timed rather than behaving more like modulators as before, though the opposite is also possible.)
Some general ideas for cleaner code have evolved since reviving the project in November 2017. One little discovery is that staggered region/arena add-only allocator mempools are a perfect fit for much dynamic memory allocations in a program like this. Most of the rest can be handled using a generic dynamic array module (which can be done pretty neatly in C).