How modern audio models beat the "short window can't see far" problem: two levels. A high level sketches a coarse plan — which register the tune sits in, bar by bar — then a low level fills in the actual notes, conditioned on that plan. Same short note-context as a flat model, but now with a long-range arc to follow. It's a melody's rough sketch, then the rendering.
High level — the plan (generated first): register per cell
How it works. Two count tables. The high level is a Markov chain over a coarse token — each cell of 4 notes is labelled by its register band (low / mid / high), and we learn how bands tend to follow one another (rise, fall, arch). Generate that sequence first: the skeleton. The low level is a note Markov conditioned on the current band — P(next note | prev note, target band) — so it picks notes that gravitate to the planned register. Flip to flat and the low level ignores the plan: it still sounds locally fine, but the long-range shape collapses (watch "notes on-plan" fall toward chance).
Like a rough sketch. Exactly your art analogy: block in the gesture and composition first, then render detail into it — never start from the top-left pixel. Tie to Neuron Lab & the big models. This is the skeleton of SampleRNN (slow tier sets context for a fast tier), Jukebox and MusicLM (coarse "semantic" tokens → fine "acoustic" tokens). The high level carries the long-range structure so the low level only needs a short window — the same fix that lets the maze agent's history vector summarise a whole path in one key. Two takes on notes already live next door: Markov Melody (the flat version) and VSA Melody.