Note: This document is still unfinished and is expected to be completed on 12/10/21.
MIDI is an acronym that stands for Musical Instrument Digital Interface, It is a technology that was first introduced in August of the year 1983. It has been a widely popular way to efficiently communicate between devices that produce and control sound such as synthesizer, mixers, MIDI interfaces, etc. A MIDI file also works the same way, you slap in onto a software program such as FL Studio and it contains the data on how a particular instrument will be played. It is nothing but a concise set of instructions that represents player information.
The purpose of MIDI files is to provide a way to contain a variety of time information for each event for every MIDI stream. Song, sequence or track, instrument names, tempo and copyright notice are all supported. It can contain multiple tracks and sequences in case we have to import the file to another program.
One of the major goals of this type of file format is to have a compact representation of the data, it has lots of compressing techniques involved to help minimize the final data which can be a little challenging to implement such as having the data stored in a binary file, nibble (4 bits), 7-bit-ized and other exotic type of storage for maximum data transmission.
In this document, we will teach you every parts of a MIDI file in ELI5 as possible covering from how basic informations are stored to a pace at which you will be able to make your very own software program that parses MIDI files.
A MIDI file is composed of two types of chunks (Header and tracks). It is simply a group of related bytes. It starts with a header track This piece of data is a type of chunk that pertains to the overall MIDI song. It is important to mention that unlike track chunks, this rather has a fixed amount of bytes which is exactly 6 bytes long excluding the Chunk ID and Bytes to read as every chunk has this in common. The other one is the track chunk, it stores an independent stream of events such as voice note on and off.
Here is an example of a header chunk represented in hexadecimal numbers we will try to analyze.
4d54 6864 0000 0006 0001 0003 0060 -- -- --
Hex | Description | Size (bytes) | Byte position | Data Type | Value |
---|---|---|---|---|---|
4d546864 | Chunk ID | 4 | 0x00 | String | "MThd" |
00000006 | Bytes to read | 4 | 0x04 | Unsigned Int | 6 |
0001 | Format Type | 2 | 0x06 | Unsigned Short | 0, 1 or 2 |
0003 | Track Count | 2 | 0x8 | Unsigned Short | 0 to 65535 |
0006 | Time Division | 2 | 0x10 | Unsigned Short | Various |
A header is always denoted as “MThd”, this always takes place on the first byte of the file and can be used to check if the file that you are reading is genuinely a MIDI file regardless of its file extensions (.smf/.mid).
Every chunk of a MIDI song must contain a size or the number of bytes to read. This comes after the chunk ID which is at byte 0x04 relative to the start of the chunk. Since it’s a 2 byte unsigned integer, this will range from 0 to 65, 535. This will tell how many more bytes you have to read before the actual content. In this case you have to exactly read 6 bytes to skip the header chunk, it is not necessary to do this but it would make sense to pass a particular chunk of track.
The format type can have a value of 0, 1 or 3. A value of 0 means the MIDI song will contain a single track which will have all the important events like the Meta typed events as well the musical events that plays a sound. A value of 1 means that it will contain two or more tracks, it contains a separate type of information for every track. By contrast, the first track may contain all the meta events and the other track will have the musical events such as voice note on and off. A value of 2 almost has the same functionality as format 1, this also contains multiple tracks which may or may not be played synchronously and may have its own meta and music events for each track.
Properties | Type 0 | Type 1 | Type 2 |
---|---|---|---|
Number of tracks | Single | Two or more | Two or more |
Flow of track(s) | Synchronous | Synchronous | Synchronous or asynchornous |
The track count is pretty self-explanatory, it is the number of tracks which the MIDI song contains. It is determined inside the header chunk so we don’t have to read the whole MIDI file just to count how many tracks it has.
The time division or the resolution of a MIDI file is the indication of how MIDI ticks should be translated into real time. There are two types of possible translation, when the MSB (most significant bit) of the 16 bits is 10, the time division is determined by “ticks per beat” or how many pulses per quarter note, otherwise the “frames per second” if set to 0. The steps for calculating the absolute time in microseconds will be explained later as this would take a whole new topic to explain.
This type of chunk is kind of similar to header chunk however it does not have the fixed 6 bytes information right after the "bytes to read" data. A track chunk contains three basic information which are "chunk ID", "length", and a stream of MIDI events preceded by "delta time values" which I'll explain later on. It holds informations such as notes and playback sound.
Here's the syntax for a track chunk:
<chunk ID> <length> <stream of events>
As mentioned before, every chunk has an ID and a length. The chunk ID is defined on the very first byte relative to the chunk's initial location and a length which comes right after it. We can also use the chunk ID to verify if we are actually reading a track chunk in case something has gone wrong during parsing or perhaps the file is somehow corrupted. The length is denotes the number of bytes you have to read before the track ends or the actual size of the chunk. It is useful for skipping tracks if you don't the track to persuade without having to read all the way to the track you desire to conclude.
MIDI events are the data that holds the instruction for a specific instrument on how to play it (what note? what velocity? what channel?). Not only that, this also includes non-musical events such as instances where you have to specify the time signature, copyright notice, track name and etc. Every event has it's own "delta time", this specifies the time at which the event shall be triggered. It is represented in "ticks".
The syntax for a MIDI event are as follows:
<delta time> <status byte> <stream of data bytes>
The delta time is the amount of time in ticks have passed since the previous event in the same track. The lengh of a tick is defined in the "time divison" in ticks per beat or pulses-per-quarter-note (PPQ) stored in the header chunk aforementioned. It is important to that this delta time value is expressed as a "variable-length quantity" which I'll talk more on later.
The status byte is the data that comes after the bytes that conveys the delta time value. It is a one byte data that defines the type of event to be sent and the channel which it should be registered. The question is, how does this status byte holds two type of information (event type and channel) despite its one byte size? To answer to that question, we first have to discuss what a nibble is. A nibble, also refered to as nybble or nyble is a four-bit aggregation or half the total bits of a byte. The first four bits (xxxx ––) of the byte refers to the event type and the other four bits (–– xxxx) however refers that channel of which the event should be registered.
The Data bytes are preceded by the status byte, it tell more additional information about what is about to be done like for instance if we received a voice note on message – sure it tells us that a certain key will be pressed but on what key and how hard or what velocity? This is where data bytes comes in handy. It contains the information needed to be prescribed. Take note that the length of the data bytes varies accordingly to its event type.
In this example, we have the delta time set to 0 for the sake of simplicity because we still have not yet discussed what are "variable-length-quantity". The status byte above which is 0x94 in hexadecimal represents the the type of event and channel. The first nibble of the status byte is "9" which basically translates to a voice note event. The second nibble represents the channel which is this case is channel 5. You probably are confused how come it's channel 5 despite the fact that the second nibble is 0x4 which is 4 in decimal representation, this is because every machine just likes to country starting from 0 so it would make sense to increment it by 1 to get the actual channel that would make sense in real world aplication. This event particularly is followed by two data bytes which is the key number and velocity because as mentioned before, a status byte having the first nibble as 0x9 represents voice-note-on event and thus we will have two pairs of data byte to store additional informations need to perscribe the event. The first byte which is 0x24 tell as the key number and the second data byte tells as the velocity.
Event Type | Hex | Binary | Data Size | Description |
---|---|---|---|---|
Note Off | 0x8n | 1000 nnnn | 2 | asdfsdf |
Note On | 0x9n | 1001 nnnn | 2 | |
After Touch | 0x4n | 1010 nnnn | 2 | |
Control Change | 0xBn | 1011 nnnn | 1 | |
Program Change | 0xCn | 1100 nnnn | 1 | |
Channel Pressure | 0xDn | 1101 nnnn | 2 | |
Pitch Bend | 0xEn | 1110 nnnn | 2 | |
System Exclusive | 0xFn | 1111 nnnn | varies |