# ML Presentation Script
## Version 1
### Bullet Points
(This is like a summary of the Script)
* Introduction ? Beide
* What we learned about the state of the art
* Transformers are the shit rn N
* So that's what we wanna do N
* Data encoding
* what data from where L
* What is midi L
* First Idea matrix encoding L
* But transformers need different encoding (strings of words) N
* What does transformer encoding look like? N
* Show failed attempts cause entertaining
* ~~simplify midi data~~ Further ideas for Preprocessing L
* Training and music generation L
* Ideas for what we want to do with our paper
* Test different methods and models and compare performance N
* generate music from different genres L
* how to compare the quality of the generated music L
### Script
(Everything I could think of that we could talk about)
* introduction of ourselfs
* Our project is “midi net” the goal is for our network to generate original music in the midi format
---
* What weve learned about the state of the art
* RNNs and LSTMs used to be state of the art up until around 3 years ago but
* But then Transformers came and revolutionized everything
* Transformers first showed up in this and that paper and were designed to translate text between different languages
* but they’ve been used to achieve state of the art results in a number of other applications.
* Eg Open AI's GPT
* Has been revolutionary for natural text generation
* ((They didn’t wanna publish it / there were big concerns)) because it is so good at generating fake text that it might help create and spread fake news
* ~~Eg that cool text to image net from the 2 minute paper video~~
* ~~Describe what it does / show short video clip~~
* (go in detail about transformer architecture?)
* So of course we want to try using transformers too!
---
* Data encoding
* Explain midi
* midi data is our dataset: huge amount of data online
* 1d
* everything is in events
* There are these types of events: ...
* show midi representation and musical notation
* Basic idea for our model: predict next time step based on previous time steps
* Originally we did not think we’d be using a language model (Transformer) which requires a 1d string of words as input
* So we thought it would be nice to use a rasterized matrix encoding
* Show picture
* A similar representation is also commonly used in professional audio software (show picture)
* There seems to be a strong connection between how we perceive music and this representation
* Its obvious from the data which notes are ringing at the same time (which is key to how humans perceive music) and timing is rasterized to musical grid making it easy to spot rhythmical patterns.
* This encoding should Benefits we hoped for: Its obvious from the data which notes are ringing at the same time (which is key to how humans perceive music) and timing is rasterized to musical grid.
* We hoped this encoding would make it easier for the network to learn the aspects of music which matter to us humans
* But because Transformers promise the best results we decided to try that (at least at first)
* What does encoding for Transformer look like?
* Transformer is language model so input is 1d string of words
* We’ve chosen these words ...
* Talk about compression:
* This encoding compresses the data somewhat without losing a lot of perceived quality and it also allows the network maximum expressiveness
* Wait word
* Time between events is quantized to 5ms, which is right at the threshhold of of human perception.
* This is a significant compression from the original midi where the time between events can be any floating point number into 180 differnt variants which still captures
* Loudness words
* 128 -> 32
* Factor of 4 compression
* Not noticable
* Other papers do this too
* Instrument words
* 129 -> 17
* Factor of almost 8 compression
* There are 8 different pianos, 8 different woodwinds, etc.
* Does not sound worse at all in our testing
* ~~We've adopted the for encoding to a large extent from this paper but~~ Don't focus so much on it just mention it's inspired by
* ~~We've added the ability for the encoding to capture different instruments playing~~
* ~~which allows us to generate not just piano music but also music from other genres~~
* ~~And we've found that for generating modern music which is usually set to a fixed time grid it greatly helps perceived quality if we make the time quantization a little finer
* ~~For this type of music it would actually be more useful to use quantization to something like 16th or 32th notes
* ~~But this more fine grained quantization can theoretically work for all types of music, both music set to a fixed time grid and music with varying tempo. On the other hand quantization to some larger, musical time unit like 16ths, may make it easier for the model to discover rhythmic patterns, it will only work for music set to a fixed time grid.
* ~~So we're trying the more flexible approach of fine grained quantization for now
* Words encoding is heavily instpired by the one suggested in: This time with feeling: Learning expressive musical performance (2018)
* Which has also been used for different papers
* ~~Maybe touch on the compression and stuff (Only use 17 instead of 129 instruments, etc)~~ We can touch on this when showing the failed results
* ~~Maybe show code
* ~~It might be good and impressive cause it’s a already lot and looks pretty complicated
* ~~Maybe touch on the instrument / channel mapping stuff and how we solved that cause it’s a big part of what makes the code look complicated~~ Can touch on that when showing failed results
* Show some failed results from the recording (they sounds funny) and talk about how we fixed those issues
* "Okay so we got a good understand of how the encoding should work now, let's hear how it works on What is love ((explain what song that is or show it or sth?))"
* *Show WIL fail 1.mid*
* "Ouh that seems to be a little fast how about we work on the timing little"
* *Show WIL fail 2.mid*
* (It's rididiculously fast)
* "Whoops that's the wrong direction let's try that again"
* *Show WIL fail 3.mid*
* "Okay that's better but if you listen closely the notes just seem to never end"
* Explain why
* Midi has 16 channels which notes can be played on but midi standart has 129 instruments
* To use all instruments "program change" events occur in the middle of songs to change which instruments the channels stand for (or equivalently which instruments are available to play).
* Our "words" encoding for the transformer only stores info about instruments being played not about this channel abstraction
* When reconstructing midi files from our words encoding we have to choose which channel is mapped to which instrument ourselves.
* First we took a naive approach but unfortunately it seems that when you switch a channel away from an instrument which currently has notes playing it's impossible to terminate those notes later, even when you switch a channel back to that instrument and send a "stop this note event"
* So what we do is we have to forcefully end all notes playing on the instrument which is being swapped out, else those notes keep playing forever.
* To minimize the effects of that to playback quality, we track of all notes that are currently ringing on all instruments and then find an instrument with the least (or hopefully 0) notes ringing and choose that instrument to be swapped out for the new instrument we want to map a channel to. This seems to work perfectly as there is always an instrument with zero notes playing (else the original midi file would also have this problem of forever ringing notes)
* "Okay great so after we figured that out everything should work now"
* *Show WIL win.mid*
* "Nice as you can hear everything sounds fine now"
* Maybe touch on issues that still persist and quirks you might notice
* Timing is slightly off due to rounding errors but that's not a noticable problem with most other midi files.
* We also have this time resultion hyperparam which can help us alleviate these problems
* "So if you compare some reconstructed midi file to the original"
* *Compare first 7 seconds of 1chance.mid and 1chance-recon.mid*
* "You'll notice that the instrument sounds are a little different. That is because we compressed the original 129 instruments of the midi standard into 17 different instruments based on category. So the flute sound is a little different, but it still sounds nice right?"
* "But if you listen to the beginning you'll notice the reconstructed file plays a piano where the original plays a hihat"
* That's because there are some quirks in the midi format, namely the drums don't have they're own instrument number which you can map a channel to, but instead they are hard mapped to channel 9 for some reason. There was some bug in the program treating the instrument number -1 as the 'empty' instrument in some parts of the program but as 'drums' in others.
* There were a few more small bugs to be ironed out but now everything works
* We've done a lot of implementation work on this "words" encoding
* We got about 137000 songs encoded and ready to be fed to a transformer model
* Efforts to speed up processing
* Maybe touch on our efforts to speed up processing and make data storage use small
* ((Talk about how my Google Drive ran out and I couldn't receive any emails and I had to buy Google storage for precious 2€))
* Simplifing data:
* transpose into different keys, to generate more data,
* we decided not to use it because of our huge amout of data
* or transpose into same key to simplify learning e.g. C-major and a-minor
* we might still try it, planing to experiment with it if still have time
* might be hard to transpose because the data might not be labled with the key of the song.
* will not have the correct effect if songs change their key or use notes outside the typical scale of C-major
---
* training and generation of music:
* we will use python trax libary from google, introduces different transformer models
* transformers can be used to generate words (like the GTP generates natural words)
* will use a transformer encoder, because we want to predict the next word, not sequence to sequence
* idea to take a part of a sequence and have it predict the next word
* for training compare the result to the ground truth (the next word) and adjust the network accordingly (in a sense of a classification problem)
* the longer the sequences the better, the more context the network has to predict the next note
* but increases computation time
* use sequences of different lenght, use padding tokens to give them the same length
* for word generation: have a start sequence, for which the network will predict the next word and recursivly apply the network to the new sequence
* in our case the words encode music notes and chords
* after the generation transform the words back to midi data
---
* Ideas for what we want to do with our paper
* Test different methods and compared the results for example compare combinations of:
* Transformer / LSTM / maybe Temporal CNN
* Rasterized time encoding (based on musical units if time like 16ths) or free time encoding (based on milliseconds)
* Non rasterized timing allows for dynamic speed changes by the network and is often used in older music. It can make the music much more expressive and natural
* But Rasterized timing reduces the the complexity of the information the network is fed and should make it easier for the network to learn about rhythm (encodes the musical content of rhythm much more clearly)
* But it’s less flexible
* Also it requires the Training set MIDI files to be good citizens and contain accurate information about their tempo (they’re not required to, as the true necessary timing info is not rasterized either)
* Compare matrix based encoding with 1d (Does it help the network learn about polyphony and chords like we predicted?)
* train different genres:
* either train model with different datasets for different genres, or with one huge one (which is unlikly to have good results)
* maybe depends on computing power
* might be interesting to see if model can learn specifics of the genre
* evaluate the results:
* difficult because music is different for everybody
* possibilies:
* ask people to rate generated music, from 10 (very plausible) to 1 (not plausible)
* ask people how are "experts" (eg. friends who study music) to evaluate, e.g. about melody, chords, motifs ect. Is also genre specific, how much the generated piece fits into the genre, deeper analysis of the music piece