Proper lip syncing is not that easy to do, especially if you are planning on creating your game in multiple languages. I think the Hanna Barbera mouth animation approach is the simplest approach as it uses about 7 different mouth shapes, as opposed to one for each different kind of pronounced sound.
http://sunewatts.dk/lipsync/lipsync/article_02.phphttp://www.angryanimator.com/word/2010/11/26/tutorial-3-dialog/In the game I'm working on we created faux lip syncing, which involved preventing the characters mouth from moving during parts in the associated audio file where the character wasn't actually talking. Sometimes I forced it to play a specific mouth shape for words that began with certain sounds, but that wasn't too often.
My method involved opening up the speech files in audacity & then splitting up the recording into chunks based on spoken parts & silence. I then highlighted each of these parts to get their duration in milliseconds which I made a note of, before finally getting the duration of the entire audio file. I used both of these values to control the pauses & length of the dialog. I had to create the display texts as background display texts so that I could update the animation frames at the same time. As a result though that killed left click skipping of the texts, so I ended up wrapping each display text in cutscene action parts so that they could be skipped via the ESC key.
Sorry for the long post. Anyway... the best approach to what you want to do is to create loads of Lua tables. One per display text for each language. After that you will need a loop of some kind to iterate through the pauses & whatever else is needed based on the active text & language.
I'm not really sure how to tell you how to go about it. Daedalic used a similar method to what I just mentioned, but it's pretty complicated. They also (apparently) have a tool for calculating the mouth shapes based on the audio file. SimonS gave me a little tool for doing the same thing a little while back, but I found that it was often very inaccurate due to recording quality & the accent / tone of the person doing the talking.