Audio Spectrogram Transformers in The Metaverse

Basic usage of AI/LLM to visualize research papers — with Prompts

Romesh Niriella
4 min readApr 16, 2024

WTF is a Audio Spectrogram Transformer?

First of all the seed of my tree of thoughts:

AST: Audio Spectrogram Transformer(

Yuan Gong, Yu-An Chung, James Glass MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA

Prompt: “usage of AST: Audio Spectrogram Transformer”

After attaching the above PDF source to ChatGPT, I’ve prompted with `usage of AST: Audio Spectrogram Transformer`

A spectrogram is a visual representation of the spectrum of frequencies in a sound or other signal as they vary with time.

For Visual Learners ♥

Being kind is always a good idea! specially to a machine. :)

vGPT4: How AST Works

Visualization captures the sequence from raw audio through to the output after transformer processing. top: Raw audio is converted into a spectogram. middle: The Transformer processes spectogram data points through multiple layers, focusing on different features in the spectrogram, learning complex patterns and relationships. bottom: output can be used for various purposes, such as classifying audio into different categories (e.g., music, speech, environmental sounds) or detecting specific events within the audio.

For Our Children ♥

The Listener



Romesh Niriella

{ 🇱🇰 | 🇦🇺 } — Ǟutomation, Дeep Space, €rypto