The purpose of this device is to give language a context, an provide a tool to the hard of hearing. It may also be a tool for speac therapy and the understanding of accents.
This project has two main components.
Firstly, the output box would have a number of tactile outputs such as piezoelectric, each representing either a consonant or vowel sound, taking a input from the translation box.
Secondly, and the most challenging is the translation box, which will probably be Raspberry Pi based or similar. The main purpose of this box is to either convert real time speech in to phonem values that can be output at varying intensities. A secondary function would be to read a script file (which could be streaming at a venue, or output akin to an audiobook.
Thought I would try some manual transliteration while waiting on some components, I'll try syncing this with audio/video later, based on my Welsh/British accent.
All the world's a stage, And all the men and women merely players;
AO L SIL TH ER SIL W ER L D S SIL AE SIL S T EY JH SIL
AE N D SIL AO L SIL TH ER SIL M EH N SIL AE N D SIL W IH M EH N SIL M IY AE L IY SIL P L EY Y ER S SIL
Also there are 39 phonemes (plus a silence) so my design will increase the outputs from 32 to 39.
I have also noticed that these are always sequential so can be represented in 6 bits rather than the previous 32. For practical reasons each character should be represented as 8bit, or 2 hex characters.
Examples of the decoding taken from the cmusphinx is laid out below, with hex values added.
HEX CMUBET IPA Example Translation
-- ------ --- ------- -----------
00 SIL . silence ...word-end [ SIL ] word-start...
01 AA ɑ odd AA D
02 AE æ at AE T
03 AH ʌ hut HH AH T
04 AO ɔ ought AO T
05 AW ɑʊ cow K AW
06 AY ɑɪ hide HH AY D
07 B b be B IY
08 CH ʧ cheese CH IY Z
09 D d dee D IY
0A DH ð thee DH IY
0B EH ɛ Ed EH D
0C ER ɜɹ hurt HH ER T
0D EY eɪ ate EY T
0E F f fee F IY
0F G ɡ green G R IY N
10 HH h he HH IY
11 IH i it IH T
12 IY ɪː eat IY T
13 JH ʤ gee JH IY
14 K k key K IY
15 L l lee L IY
16 M m me M IY
17 N n knee N IY
18 NG ŋ ping P IH NG
19 OW oʊ oat OW T
1A OY ɔɪ toy T OY
1B P p pee P IY
1C R ɹ read R IY D
1D S s sea S IY
1E SH ʃ she SH IY
1F T t tea T IY
20 TH θ theta TH EY T AH
21 UH ʊ hood HH UH D
22 UW u two T UW
23 V v vee V IY
24 W w we W IY
25 Y j yield Y IY L D
26 Z z zee Z IY
27 ZH ʒ seizure S IY ZH ER
My concept is to break language in to 4 groups of 8 phonemes. Additional sounds would be created by combining these like you would colours, for example the word 'of' would begin with the 'o' sound, and 'us' would begin with the 'u' sound, but the vowels in the word 'boot' would trigger both the 'o' and 'u''.
My example of how the channels are broken down.
1
2
3
4
5
6
7
8
Soft consonants Quiet
C/S(+ch/sh)
F/Ph/Gh/J
H
Th
W
Y
*
*
Soft consonants Loud
L
M
N
R
V
X
Z
*
Hard consonants
B
C/K/Q
D
G
J
P
T
*
Vowels
A
E
I
O
U
*
*
*
*denotes spare channels, which would probably be assigned later, or used as a modifier like an accent.
The output box would be fed via a serial stream at a suitable speed that could be determined by experiment, but as fast speech is 10 syllables per second, then a sample rate of 50 phonemes per second should be the bare minimum.
If the signal running as 1 bit per phoneme then each sample would be 32bit uncompressed, with each phoneme type being represented by 2 hexadecimal digits
in this instance the word 'wigwam' would be
00001000 00000000 00000000 00000000 #08000000 W
00000000 00000000 00000000 00100000 #00000020 "I"
00000000 00000000 00010000 0000000 #00002000 "G"
00001000 00000000 00000000 00000000 #08000000 W
00000000 00000000 00000000 10000000 #00000080 "A"
00000000 01000000 00000000 00000000 #00400000 "M"
As a scripted file, in the manner of standard subtitle files each line starts as follows...
hh:mm:ss.mms
hh is hours mm is minutes ss is seconds mms is milliseconds