There’s some talk and images circulating at the moment about the possibility that members of the Emerald developer viewer team may have gained administrative access to Second Life’s Vivox voice system through an exploit.

More →



Well, another session with Snowglobe, and yet another uncontrolled network flood. This time, though, I was able to isolate the traffic that was flooding the connection from the other hurly-burly that happens on the wire during routine operation and SL sessions.

It was Vivox. 400 megabytes of traffic blown through in minutes. And I’m a non voice-user, on a non-voice-enabled parcel. Since I’m on a capped connection (as most users outside North America are), this was naturally of great concern to me. On the highest bandwidth plan available to me, I can only have about 600MB during the day, peak, which must then be divided among the whole family.

More →

May 20 2009

With an average of one billion minutes of talk-time per month pumping through Second Life’s voice system (that’s pretty much Vivox’s system, not Linden Lab’s — one making, the other branding), Crap Mariner takes a look at what people are really wagging their tongues about.


Ever been really hacked-off by those machines that put sentences together by splicing words? Telephone recordings, elevators, that sort of thing? Even if you’ve got a good ear, they’re generally really annoying to listen to.

The solution to them is also very simple, yet nobody seems to ever do it.

There are two major ways of stressing a word in normal speech. Say “nine seven nine”.

No, really. Give it a try.

See how the first nine sounds different from the last nine? The first nine and the seven use ongoing stress emphasis. That’s the way we say a word when another word is going to follow it. The last nine uses a different emphasis, because you’re going to stop speaking. It’s how we sound words that occur at the end of sentences, or when we’re otherwise done speaking. So there are two ways to say a word: Regular and final.

Those devices that speak by splicing together words always use only finals. Essentially, the person they recorded spoke each word as a standalone (final) word. Those were recorded, and chopped up and stored for the software to reproduce.

Wrong!

To get it to sound right, and to sound more natural, you record the speaker using regular emphasis.

Get them to repeat the word several times as a sentence, and chop out one from the middle that you like. Then you take the last one, and record that as a final. That gives you two sound-banks of recorded words. One set of regulars, and one set of finals. Then it’s just a matter of setting up your data tables for each sentence to select a regular for each word, except the last word of a sentence, and pick the sound out of the list of finals for that one.

It sounds so much more natural, is easier on the ear and requires less concentration to understand. It’s also quite simple, doesn’t add much to the time with your voice-actor, and only requires double the storage (and in many of these systems, the storage is vastly underutilized).

So why does nobody ever actually do this?