Leo's Home page -- Github Page -- License: CC BY-SA 4.0

MIDI encoding analysis

Leonardo M. Rocha

Contact Me

This document contains a complete description step by step of the process and ideas behind the proposed encodings, it might be long.

The ideas here were inspired by the work I was doing to help with Few Shot Music Generation project. My fork is located here

TLDR:

The encoding instead of being a huge one-hot vector (dimension 4708) it can be separated in several inputs each of them a one-hot vector of lower dimensionality and a time encoding given by a sum of sinusoids with different period.

The source code is provided to automatically do the encoding as a python package with numpy.

Introduction

This current analysis tries to find another alternative encoding to try in conjunction with the previous one.

Current encoding based in the Magenta project has the following properties:

The idea of the current analysis is to discuss the following points:

The current encoding separates a TIME_SHIFT event as a separate event, in my opinion the time of an event is an intrinsic property of it, so separating it might make more difficult for the network to learn what the time means.

The other issue is that similar structures (in music as well as in other kind of languages) might repeat at different time scales, the fact that the time event is treated as one separate event with it's own duration (actually 100 different types of TIME_SHIFT event) might make it harder to the network to detect patterns. This added to the high number of input features (4708) makes me think that the training time and number of samples needed for the training might be big.

In this study I propose another way of encoding that I would like to try reducing the number of features between 1 and 2 orders of magnitude and providing a Fourier series approximation inspired time and velocity approximation.

I know that is not a traditional encoding, but I think might be worth trying (specially as I don't have so many computational resources)

Basic Idea Description

The basic idea is that a feature vector will be composed of several different features (7) encoded in different ways

From a midi file extract:

Event = [instrument, octave, note(s), action (on|off), velocity, time]

Codes:

Encoding Problems & Questions

As there might be more than 1 note, octave or instrument event at the same this could be dealt as:

  1. doing an actual one-hot encoding and separating all the events, but writing the same event time -> can this confuse the network??
  2. doing a multi-hot encoding and trying to find out a way ouf of this during the decoding stage (something like auto-correlation or a sort of dynamic-routing ? ) This would also mix the signals of the velocity of each note, which is problematic for the purpose of the exercise.

Features candidates for One-hot Encoding

As MIDI events are independent (each note on|off event of a single instrument is a different event) and for the purpose of this encoding proposal the solution would be to do actual one-hot encoding and separate all the events. A single event will correspond to a single instrument and note event with the corresponding velocity and time encoding.

The basic coding of the event will contain a minimum of 41 dimensions to which must be added the encoding of the velocity and time encoding

Velocity encoding

Both solutions are presented here, One-Hot Encoding and Fourier Inspired Encoding.

One-Hot encoding of Velocity

In this case the same number of velocities will be used as in the current approach, this means that the velocity vector will have a dimension of 32

Sinusoidal (Fourier Inspired) Encoding of Velocity

The resolution that is looked at is to be able to differentiate at least 32 different velocities.

As a description later on time encoding analysis will show, we can deal with a resolution of 256 velocities with $ 256 = 2^8 $ sinusoidals which would take the dimensionality of the input to $ 41 + 8 = 49 $

For a resolution of 32 velocities we need only $ 32 = 2^5 $ for a the dimensionality of the input to $ 41 + 5 = 46 $

Time encoding

Why not linear encoding?

Let's do a simple linear function, this can not hold as it diverges

Let's imaging an encoding where the time is cutted in 3 different parts:

the functions will be linear in a range, but will have undefined elements at the borders, the derivatives are undefined there and the NNs are shown that they do not behave nice with these kind of functions

Then we can think of a continuous function, sinusoidal ones are really good for this, well studied and implemented in every standard library.

Time resolution Analysis

Now let's think for a moment from the point of view of listening to music (instead of time passing) for this we'd like a time resolution:

Note: There are studies on real-time latency issue but is not of our concern here.

For a perfect reconstruction the sampling frequency to discretize a function is given by the Nyquist theorem:

If a function x(t) contains no frequencies higher than B hertz, 
it is completely determined by giving its ordinates at a series 
of points spaced 1/(2B) seconds apart.

Which means that for a perfect reconstruction for a given sample rate $fs$ is guaranteed for a Bandlimit $B < fs/2$

Let's imagine that we want to approximate the time for a minimum and maximum time interval, we could use sum of sinusoids to approximate it.

In our case $B = 100Hz $ for the maximum frequency (100 times 10 ms for 1 sec), hence the sampling rate would be 5ms, taking in account that also there will be reconstruction errors and some other issues we could argue that $ fs<5ms $ let's say $ 4ms $ to be sure

And for the case of the 10 minutes one, this gives us $ 10*60 = 600 s $ so $ 1/600 ~ 0.00167Hz => fs < 0.000833Hz $

As a reference the human hearing resolution is about $ 20KHz $ and the CD sampling is $ 44.1KHz $

With a sampling in each resolution level we should be able to reconstruct the patterns at different time-scales.


Time encoding details

Intuition

We are using time as reference, Not all the existing time but just a part of it, a range that is big enough to accomodate our work. In the case of music a few minutes (~10?) would suffice. In the case of lyrics they are even shorter if we count the number of letters than the number of milliseconds in a music piece.

Periodic functions should be easier for a NN to learn to include periodic elements (languages have periodic elements) than a linear one.

The Fourier Series can represent periodic functions with sums of sinusoidal functions.

To be able to represent a linear function with a Fourier Series, we can think of a sawtooth wave and just take only one period.

(TODO graph here)

Reference Formulas

Euler's Formula \begin{equation} e^{ix} = cos(x) + i\sin(x) \end{equation}

\begin{equation} cos(x) = \frac{e^{ix} + e^{-ix} } {2} \end{equation}\begin{equation} sin(x) = \frac{e^{ix} - e^{-ix} } {2 i } \end{equation}

Fourier Series

\begin{equation} f(x) = \sum_{n=-\infty}^\infty(c_n(f) e^{i2\pi {n/T} x} ) \end{equation}

where \begin{equation} c_n(f) = \frac{1}{T} \int_{-T/2}^{T/2} ( f(t) e^{-i2\pi {n/T} x} ) \end{equation}

In our case there is no need to go so far as using the Fourier Series but this gives an inspiration on how to encode the values. Also we don't want (or even need):

We could use either the Complex Domain values or go for the Discrete Cosine Transform but we'll do something simpler that is choosing sinusoidals

Let's come back to time and think how we can encode the time, which is a linear function that just grows.

For this analysis we constrain the time to be only a segment between $ 0s $ and a max value $ Ms $ in seconds

To be able to recreate the original function we would need a series of functions that can represent elements that go from the maximum to the minimum resolution.

Let the first sinusoidal represent the values up to the period $ T = 5ms = 0.005s $ and the last one $ Ms $

Now let's work directly in milliseconds and forget about the unit, which for the purpose of the experiment does not affect the result as long as we can reconstruct the input correctly.


Hypothesis .... (not proved yet and in working progress...)

The main idea:

To mix Fourier Series with the Shannon sampling, trying to reconstruct not only the input signal patterns at each time scale, but being able to reconstruct the time signal (a linear function) restrained to a period interpreting it as a SawTooth-Wave, which means that we should be able not only to give different time scales to the network but also be able to reconstruct the (locally) linear signal in an easy way, which should be needed to be able to re-encode a MIDI or text file from the output of the network.

We can encode any value between $ [0,5]ms $ , for the next we can duplicate the period $ T $ and we'll get the next resolution level. Advancing like this we need at least $ n $ values such as $ 2^n >= M $ and for the minimum resolution we need at most $ n $ values such as $ 2^n < 5 $

Finding the exponents can be quickly achieved with a bit shifting trick:

WARNING

I've found an error in the reasoning (to verify) I need to do one approximation PER SCALE not for all together ....

Which means that we can encode 10 minutes worth of music with a $ 20 - 2 = 18 $ dimensional vector, $ 36 $ if we count a complex-value vector.

For a total of $ 46 + 36 = 82 $ dimensions instead of $ 4708 $

Let's think as an encoding as the $sin$ and $cos$ values, and the network works directly on this domain giving only the $sin$ and $cos$ values as results, this allows the network to work in sinusoidal periodic functions instead of the time domain.

The decoding will take place just later to be able to do a reconstruction.

The series needed then is one representing a SawTooth Wave but restrained only to the time period we need. This allows us to represent periodic elements inside a linear function.

The other advantage of this function is that we can approximate it only with a real valued function, avoiding the complex domain and restraining the number of coefficients (and so dimensions) needed to only the real valued ones.

This gives is a total of $ 46 + 18 = 64 $ dimensions for a complete representation of the input vector

Which should be also great about this encoding is that if we create hierarchical representations, and each layer in the hierarchy corresponds to a different base resolution (ms, s, minutes, ...) then with limited amount of extra dimensions we could encode increasingly longer time periods.


WARNING : there is stil something wrong with the scale of the reconstruction. Also, there is an issue that now I have not managed to reproduce and is the deformation that occurred before with the samples .... I have to find out what is there

Other tools (python)

There are already available libraries that can compute the fourier series approximation for a function, so this would make it easier to compute.

Here there are some tests with sympy to compute it.