If you are forcing displayed text into a bounding box somewhere, then it wouldn't be much more of a step to replace the displayed texts with narration texts. If you use narration texts, then you could control the character animations as needed. Or create the entire talk animation in one animation for each direction & use Lua script to force the frames. Or you could create multiple animations for each direction & change the character animation index as needed.
To be honest I'm not sure which method I would use to approach this. Another quick method could be multiple display texts containing the same text, but with different pause values & between each text you change the animation index or the character outfit.
P.S: you are right that it is currently impossible to split body from talking animations, but that may change in a future version of vs (fingers crossed).
P.P.S: however, you could actually split it by having 2 characters in the same position. One could contain the head & the other could contain the body. As long as the background size of the sprites & the character animation centers are the same, then you could swap the main characters outfit to that of the head & teleport the body character into the position & direction alignment of the main character. Using background texts, would allow you to manipulate the secondary character as needed.
P.P.P.S: sorry for the long ass reply. I tend to ramble on when I'm theorizing or explaining various possibilities. But I hope some of what I said, will help you find a solution.