This project aims to further the line of researches by predicting the genre and valence mood of the audio simultaneously by suing a multi-output CNNs to learn the features of mel spectrograms generated from the audio.
Notably, attention mechanism is applied in combination with CNN to extract the features of audio samples. The structure is shown below. For detailed model architexture, please refer to paper.