Learning is the acquisition of knowledge by study or instruction, as such, we see it as something additive: By repetitive expose to facts and concepts we “add” knowledge to our understanding. When we forget some of that knowledge it is seen as unintentional, a mere side effect of the limited capacity of the brain. This idea also seems to be present in contemporary models in neuroscience as part of the Hebbian learning rule as adding associations between neuron: What fires together wires together. The flip side of this rule is seldom explored: The association between neurons can also become weaker, as others gain importance. One of the defining ideas of Artificial Neural Networks is applying the Hebbian learning rule to a machine learning system. A network of connected “neurons” is repeatedly exposed with training data to teach it some concept. In some cases this can be a simple classification task: Does this image contain a cat? In other cases, the function might be more generative: Given the stock price of the last few weeks, predict the price tomorrow. Given the correct output, the error of the network on the training data can be computed. Using this error, the connections between the responsible neurons are either strengthened or weakened.
In this work we want to explore how we can use the learning process itself to generate interesting glitches and distortions. For this purpose we incrementally train a neural network specialized for audio generation called “Wavenet” with musical samples. We start with a single sample and train the network until it has fully memorized it. We incrementally add new samples to our training set, to encourage the network to learn the essential parts of our samples. By capturing the reproduction after each additional training step, we get an insight into the learning process of a neural network. We want to highlight some interesting results we achieved by using a training set that consists of songs from the icon Hip Hop formation “Public Enemy”. Our primary training sample was their song “Don’t believe the Hype” consisting of 220 seconds of audio. The Wavenet network is never able to perfectly reproduce the piece, converging after a hundred thousand iterations.
The reproduction at a high temperature like 0.9 sounds like a very tortured version of the rap sections, with two audible features of the song, a bass drum setting in twice and the high-pitched scratch noise characteristic for the song as continuous tone in the background. Sampling at lower frequencies reveals a more human sounding babbling that sounds similar to the rap voice, but is still far removed from actual pleasing music. These “smeared” ghostly artifacts seem to be characteristic for systems based on learned correlations.
When adding the song “Can’t Truss It” to the training dataset and training for an additional twenty thousand iterations, the reproduction of our original sample changes dramatically. Instead of a very noisy reproduction of inaudible rap, the network produces a sound that sound like a man screaming “Don’t. Doooon’t”. Driven by the pressure to generalize the network recognized that the important feature of “Don’t believe the Hype” is the phrase “Don’t” repeated many times in the song as a kind of chorus.
Adding the song “Bring the Noise” to the training dataset, we observe that the reproduction again changes dramatically. Sampling at a high temperature like 0.9 yields a series of characteristic “scratching” and high pitched “shriek” noises as they are common in “Don’t believe the hype” at the end of verses when the DJ rewinds the vinyl. Sampling at lower temperatures like 0.7 or 0.6 increases the prevalence of this “shriek” noise, sometimes as continuous tone.
→ Full version of the essay can be found here