By: Alicia Wang [PL], Alena Chao, Eric Liu, Zane Mogannam, Chloe Wong, Iris Zhou
So, what is lofi bytes?
Picture this: it is midterm season, you have a ton of work to finish, and you need a long library grind session. You reach for your headphones, but instead of turning on Spotify, you open lofi bytes: an aesthetic web app that takes song samples and outputs chill lofi, complete with customizable background sounds and beats.
Over the Spring 2023 semester, our team has been creating an integrated, user-friendly web application that allows users to generate lofi tracks from input MIDI samples and customize further with sounds of rain, fire, and cafe ambiance. This article will outline our process from education to the final product, discuss everything we did, from training an ML model to building a full-stack application, and reflect on limitations, extensions, and further learning opportunities.
Check out our website at https://callaunchpad.github.io/lofi-bytes-app/! This is a semester project by Launchpad, a creative ML organization founded on the UC Berkeley campus.
What is our data?
When we started our project, we wanted a good idea of what kind of data we were working with, as this would ultimately affect what ML model we would eventually use. We can use two main types of music data: the first is what you may be most familiar with, digital audio. This is the digital representation of audio as sound waves. However, the files are often large and contain much information, requiring large models and extended periods to generate new audio. There has been previous work in generating raw audio (e.g., Jukebox from OpenAI), but this might not be the best approach for a fast and user-friendly application.
So, we turned to MIDI data instead. MIDI is software that represents musical information in a digital format, storing information such as note pitch, pitch length, velocity, and pitch volume. You may have seen this representation of music online. It aligned with what we wanted to do: it was compact, allowed us to generate music quickly, and allowed users more control over the beats we generated.
The first problem: no data
One of the biggest problems we encountered from the start was needing more lofi MIDI data to train our model successfully. Unfortunately, most lofi data we found was in digital audio. We initially used a lo-fi dataset from Cymatics for training; however, this dataset had only 93 short music tracks, so it is no surprise that our results were less than satisfactory after training.
The solution: just get more data!
Looking for external datasets was a dead-end; given that, we had to get creative and make our own!
We started by making a YouTube playlist of lofi compilations, looking specifically for tracks with clean piano sounds and avoiding layered instrumentation and background noise to ensure the conversion process would give us cleaner MIDI results. We then wrote scripts that automatically download these compilations as MP3s, chop them into 90-second segments, and convert those clips to MIDI files. We successfully created a new dataset featuring over 7 hours of lofi piano MIDIs.
We had to get creative in our methods to acquire more data. Each group member added a clean lofi piano compilation to a YouTube playlist, which was then run through a series of scripts to chop the playlist into MP3 clips and subsequently convert those clips to MIDI files. This resulted in a new 7+ hour-long dataset of lofi piano MIDI, which we trained our model on for 500 epochs.
Trial and error: our baseline LSTM model
To start off, we wanted to experiment and see how well a simple LSTM model would perform with our MIDI data. An LSTM, which stands for long-term short-term memory, is good at recognizing and encoding long-term patterns in our MIDI music. We made a simple 2-layer LSTM and first trained on a simple MIDI piano dataset called Nottingham. We found that while the model could generate a few beats of music following a primer, it eventually deteriorated into noise and lacked the long-term structure we wanted. However, this was an excellent place to start. We knew we needed a more robust model and decided to turn to transformers to save the day.
Introducing the model of the day: MusicTransformer
After experimenting with different models, we landed on MusicTransformer, a model architecture developed by the Magenta team that could generate MIDI music with long-term structure.
In-depth, MusicTransformer is a neural network architecture that utilizes a transformer to generate polyphonic music with long-term coherence and structure. Generally, transformers consist of encoder and decoder layers that process input sequences and generate output sequences, capable of tracking context and relationships across the input. This feature of transformers is handy when working with music, as output sequences should follow patterns throughout an input sequence.
Our model learns MIDI (musical instrument digital interface) data, which encodes music based on pitch, duration, velocity, channel, etc. MIDI encodes no waveform, making it optimal for dealing with large quantities of data. The MusicTransformer model works by taking in a sequence of MIDI referred to as a primer and outputs new MIDI that follows the learned patterns of the primer. The encoder layers of the transformer process the primer and extract relevant features that capture the patterns and structure of the music. Next, the decoder layers take these features and generate a new sequence of MIDI that follows the learned pattern.
One key feature of MusicTransformer is its ability to generate music that consists of multiple independent melodies or voices. The model uses conditional sampling, generating a new MIDI sequence step-by-step, with each step conditioned on the previous MIDI generated. This allows the model to generate music that matches the learned patterns of the primer.
In effort to give more weight to the most relevant parts of the primer when generating output, MusicTransformer utilizes self-attention. Generally, a self-attention model calculates attention scores between every pair of positions in the primer and uses these scores to weigh the importance of each position in a generation. In our case, self-attention is used to capture the relationships between different musical events in the input sequence. By attending to relevant events in the sequence, the model can learn the complex dependencies and structure that make up the general pattern of the primer. With self-attention, MusicTransformer is able to learn the structure and patterns of input MIDI and generate new music that follows those patterns.
Results
Our results were not groundbreaking — we had achieved around 40% evaluation accuracy on our model before our accuracy plateaued. However, the generated pieces definitely sounded like lofi music. Our model had learned how to play sustained chords following our primers and often introduced new melodies in the same key. However, sometimes it would play a couple of wrong notes, which we attributed to noise in our dataset. As we created our dataset through scraped music, it was often noisy, and our model had unfortunately picked up on it.
Future Work
Although we are satisfied with our project, there is much more we can improve. This includes making a better and cleaner dataset of lofi music, which will greatly improve our model’s ability to generate. We can also fine-tune the model to generate more distinct melodies and chords. We were also looking at the Groove2Groove style transfer model and wanted to know if we could take a model that generates classical music and have it generate softer and jazzier lofi music.
All about our website
Back-end
Our team used the Flask API to deploy our ML model. Flask is a lightweight, easy-to-use web framework allowing developers to build and deploy web applications quickly. By leveraging Flask’s routing and request-handling capabilities, we can easily expose our trained ML model as an endpoint that can accept incoming data and return predictions in real time. This makes it possible to integrate our model into other applications, such as mobile apps, chatbots, and in this case, our lofi bytes web application!
When users upload a MIDI file through our Lofi-Bytes web app, an Axios post request sends the data to the ML Model (which resides in the Flask API). Our model generates an output MIDI, returns it to our React front-end, and the generated lofi music is played for our users!
Front-end
React is a JavaScript library integrating separate components into a complex user interface. Its built-in website themes, properties, buttons, icons, and more made it an obvious choice for supporting our website.
Front-End Architecture
- Home Screen contains the website skeleton: header, footer, margins, and background.
- Midi Generator contains the API connection: the user’s uploaded midi file is sent to the Flask API back-end. The transformer-generated output is sent back.
- Synth contains ambient sound interaction: Users can adjust sliding bars to increase and decrease the loudness of drum beats, rain, cafe sounds, and fire.
Additionally, we used tone.js for playing the MIDI music and the different ambiance audios together. This also allowed users to interact with the different volumes on our website, and we hope to introduce different instruments and sounds for MIDI music in the future.
Thank you for reading! Please check out our Github at https://github.com/callaunchpad/lofi-bytes.