-
Downsampling the Audio to 8 Ksps
05/18/2020 at 17:24 • 0 commentsSummary
It occurred to me that the original 11025 sps files probably represent a higher-than-needed sampling rate. I downsampled to 8 ksps and gain back about 15 K flash, and some relaxed execution requirements.
Deets
It had bugged me for a bit that the sampling rate for the existing audio is 11025 ksps. This is quite appropriate for the SP0256 in general, which is advertised as having a 5 KHz output bandwidth, however on the -AL2 I find the signal to use less than that.
I discovered an audio processing application I use has a 'batch convert' mode, so I batch converted all the samples to an 8 KHz sampling rate (appropriate anti-aliasing filtering is done prior). The app I am using is 'Cool Edit', which to wit is long since discontinued (and it has a kind of interesting back story), but I would suspect that the free alternatives such as Audacity would similarly be useful.
This reduced the audio size by about 25%, and it seems to sound more-or-less the same. I modded the Python PoC to re-compute the ADPCM version and added that to the project. The sample rate timer, TIM4, needed to be adjusted. Everything else worked as-is.
This allowed me to gain back 15 K flash, and also reduced the responsivity requirements for preparing sample buffers (since they are now last 64 ms, up from 46). I wasn't in a crisis in either of those two areas, but it's still nice to have a little more wiggle room.
Next
Who knows? Is it done, now?
-
Using DMA for the PWM
05/17/2020 at 17:50 • 0 commentsSummary
As an improvement, I switch to using DMA for driving the sample data output instead of code in an ISR.
Deets
The system worked and seemed responsive on the monitor, but it tweaked my spidey sense to have so much stuff going on in the ISR. The ISR gets triggered every sample -- 11025 times per second -- and each time context has to be saved/restored, and code run. The STM32F103 has a lovely DMA unit, so I might as well make use of it!
Initial Tests
Getting DMA running is a multi-step process of fiddling with stuff in CubeMX (in this case, setting the DMA tab in the TIM4 resource, and configuring it as memory-to-peripheral, and the input size as 'byte' and the output size as 'half word', and setting the memory address to increment, but not the peripheral address. This is all obscure because the documentation for CubeMX leaves a lot to be desired, so a bunch of experimentation had to be done to figure out the appropriate recipe.
After that, then the right code incantations need to be spoken. Much as with CubeMX, the HAL library's documentation leaves a lot to be desired -- it has a 'doxygen' taste to it, so I find that it's more or less fruitless to go to the documentation, and rather I wind up reverse-engineering the source to see what the methods actually do. Really, I think that just programming at the register level and commenting the code would be clearer in almost all cases /except/ for USB, which is a beast. It definitely would cut down on the flash and ram usage. Anyway, at length I found the relevant incantations, which in general are 'turn on the timer 4 clock', then 'set up the DMA controller to transfer', then 'set timer 4 to trigger the DMA' (this starts output immediately). There are several interrupts from the DMA that you need to handle:- transfer complete -- you'll want to send another buffer
- half-transfer complete -- this is your heads-up to start preparing another buffer so you'll be ready to go when 'transfer complete' comes in
- transfer aborted -- if you care
- error -- if you care
The DMA has a nifty 'circular' mode that seems like it should be useful, but I can't figure out how I can actually make use of it in this project.
I did a trivial test of using DMA with the PCM test data: start a transfer, then start another in the 'transfer complete' event. I used the PCM 'hello' sample set from before, and it worked as expected.
On to Implementation...
My existing interrupt-driven design was based on receiving a 'half-complete' event that alerted the producing task to prepare additional buffer loads. This worked out neatly because the DMA similarly generates a 'half-complete' event and a 'fully complete event'. Most of the code from the timer ISR was reused/restructured into the 'fully complete' event, and the 'half-complete' simply sent a task notification to the SP0256 task. Actually, a bunch of code was deleted, since we didn't have to clock out individual samples -- we just needed to start a new transfer when the present one completed. The only addition was to 'kick start' the first transfer. This wasn't needed in the previous design because it was effectively always polling in the ISR as to whether there was a buffer to transfer at all. In this case, once it's done it's done, so you have to explicitly start the first transfer to kick off the process. So long as you keep producing buffers in a timely manner, the process is continuous.
Most of the time was spent figuring out how to use the DMA peripheral appropriately (the initial test), and the porting of the actual design fortunately went rather quickly since I had done the interrupt-driven design in a sort of dma way. The build is bigger; thanks HAL:
arm-none-eabi-size "BluePillSP0256AL2.elf" text data bss dec hex filename 113328 2024 16496 131848 20308 BluePillSP0256AL2.elf
But I've still got a little more code space for something....
Next
One day I will implement the streaming of phoneme data/text directly into this without using the command line. Maybe tomorrow is that day?
-
Setting Up PWM Output
05/16/2020 at 21:50 • 0 commentsSummary
For the first step of attempting to simulate the SP0256-AL2 (e.g., no physical chip needed), I wanted to test the PWM output, sending pre-recorded data. For the second step, I wanted to full-on simulate the physical chip.
Deets
First, I did a quicky experiment to prove that the hardware was configured correctly. This involved setting up PWM on TIM3, and using TIM4 for pacing the samples out via interrupt. I setup TIM3 to have similar frequency as SP0256 (about 45 KHz), and for 8-bit resolution (which matched my audio files). As such, I should be able to use existing output filter from SP0256 and just move the jumper between the two. I did an initial test with my scope both to verify that PWM was working, and that the TIM4 was interrupting at the expected rate.
I modded my Python prototype to do a couple short text-to-speech once again, but this time emitting a C source file with data in both the PCM format and ADPCM format. These were very short recordings of just the word "Hello", so I didn't really have to worry about flash space. I tested the PCM first, since that was a no-brainer; e.g. in my HAL_TIM_PeriodElapsedCallback() I just add the clause
/* USER CODE BEGIN Callback 1 */ else if (htim->Instance == TIM4) { //quicky sample PCM test extern const uint8_t g_abyPCM[]; //len = 8027 static int sl_nIdx = 0; __HAL_TIM_SET_COMPARE(&htim3, TIM_CHANNEL_3, g_abyPCM[sl_nIdx]); ++sl_nIdx; if ( 8027 == sl_nIdx ) { //loopy-doopy sl_nIdx = 0; } } /* USER CODE END Callback 1 */
so, it just outputs the current sample via PWM (__HAL_TIM_SET_COMPARE), increments the index to the current sample, and loops around. This worked fine, so now it was time to try out the ADPCM approach.
First, I ported the ADPCM decoder routine and setup to do the same thing with the ADPCM data. This was slightly more complicated. The ADPCM is by nybble, rather than byte, and the algorithm as designed is created for signed 16-bit samples. But it wasn't too bad:
/* USER CODE BEGIN Callback 1 */ else if (htim->Instance == TIM4) { //quicky sample ADPCM test struct ADPCMstate { int prevsample; int previndex; }; int adpcm_decode_sample ( int code, struct ADPCMstate* state ); //len = 4014, origlen = 8027 extern const uint8_t g_abyADPCM[]; static struct ADPCMstate state = { 0, 0 }; static int sl_nIdxNyb = 0; int code = g_abyADPCM[sl_nIdxNyb>>1]; if ( sl_nIdxNyb & 1 ) code &= 0x0f; else //upper nybble first code >>= 4; int samp = adpcm_decode_sample ( code, &state ); uint8_t samp8 = (uint8_t) ( ( samp / 256 ) + 128 ); __HAL_TIM_SET_COMPARE(&htim3, TIM_CHANNEL_3, samp8); ++sl_nIdxNyb; if ( 8027 == sl_nIdxNyb ) { //loopy-doopy sl_nIdxNyb = 0; state.prevsample = 0; state.previndex = 0; } } /* USER CODE END Callback 1 */
The sample data was half the size as PCM (about 4 KiB), and the binary size increase was about 5 KiB over the baseline, so I infer that means the ADPCM decoder incurs about 1 KiB code. I can work with that.
Now it's time to get the hands really dirty, and emulate the SP0256-AL2. This was actually a bit of work -- both in coding and debugging.
The first thing I did was create notion of 'mode' for the speech processor. There are two modes: 'physical' and 'simulated'. The existing task_sp0256 had a tight coupling with the hardware, but really there were just three points of contact: resetting the synth, strobing in data, determining if 'Load Request' (nLRQ) is asserted, and being notified if LRQ transitions from a negated to asserted state. I factored the code for the first three cases into a generic method that does one or the other based on current mode, and the last case was inbound and already decoupled from hardware specifics. Now it was time to put flesh on the bones.
I declared another circular buffer to represent the fifo on the chip. Strictly, I don't think the chip actually has a fifo -- I think it just has a one incoming byte that it can process, but it does offload that processing and allows accepting of another phoneme before the first is completed processing, so it's vaguely like a two-phoneme deep fifo. I had already coded to assume that there was a fifo, so it was straightforward for me to implement one of an arbitrary depth of 4. I don't know why I chose that, and really the overhead of the circular buffers greatly dwarfs the buffer size, so maybe I should make it bigger. I didn't want to make it too big, because in a way it is redundant relative to the existing buffering, however I did want to make a fifo specific to the emulation because otherwise there would be coupling that would complicate the existing physical chip management. I have RAM to spare, so there went 20 or so bytes -- I suspect a reasonable sacrifice for the sake of maintainability.
Next I needed to get the PWM outputting data. As per usual, I prefer to do less work in an ISR than more, and moreover in FreeRTOS there are complications when using synchronization objects. I decided to use a two buffer approach, where there is one buffer that is 'active' (i.e. the ISR is clocking samples out of it), and a 'next' buffer that is available to be filled at one's leisure. These buffers are PCM sample -- not ADPCM -- so the the ISR really is just plucking values from the 'active' buffer and pushing them into the PWM, and then switching over to the 'next' buffer when it runs out. This did still require a couple shared state variables between the 'task' (user-mode thread) and the ISR (supervisor mode code). In this case, I chose to use 'critical sections' to make the access to the relevant data atomic. These (OK, they're not a distinct thing in FreeRTOS) are much lighter weight but much more heavy handed than semaphores or whatnot. Basically they are a 'disable all interrupts' operation. Since I lock only around a couple state variables so as to read-modify-update (with a little logic in the modify), it seemed like a reasonable approach.
The design is still 'Hail Mary' with respect to being able to reliably produce pending buffers of samples such that the sample driving ISR is always fed so long as there are things to feed it. I think this is appropriate because if you fundamentally cannot provide buffers fast enough, then the system is a fail overall, anyway. However, you can amortize the cost of production on the part of the producer over time. So I added a couple task notifications: one for 'half-way done' and one for 'completed'. I didn't wind up using the latter for anything -- by the time you get that you are already too late. But the 'half-way done' notification is a useful heads-up that now would be a good time to produce another buffer, if possible, to have at the ready for the sample driver to automatically switch to when the current (active) buffer is completed.
An additional complication is that there are two buffers of samples, but samples are not bounded by phonemes. So the goal is to fill the sample buffer as much as possible, possibly (probably as it turns out) not completing a phoneme, and then later picking up where one left off in a particular phoneme before moving on to the start of another phoneme. So there's a little bit of state that keeps track of where the buffer filling routine left off in a previous call. It also has the logic to try to pluck more phonemes from the fifo to carry on filling the buffer to the maximum extent possible.
Because this device is strapped for RAM, I have provided for just two samples buffers of 512 bytes each. This means that at the present 11025 sps rate that they are just over 46 ms worth of samples, and since there is a 'half complete' notification, that is about 23 ms heads-up that you need to get a buffer ready. I did a rudimentary profile of the '_prepareNextBufferLoad()' routine using the Cortex Debug unit's cpu cycle counter DWT->CYCCNT, and found that it took 112,263 cycles to fill a 512 KiB buffer. Running at 72 MHz, that translates to about 1.6 ms, and means there is about 220 clocks per sample. This is doing all the ADPCM stuff. I did repeat the test with all the various optimization options (except -O0 since that was too big for flash) and got very similar results. (Since this is a live system, there is noise in this test since background interrupts would still be happening.) Either way, this seems like more than fast enough to keep up. I can't remember the scheduling quantum on FreeRTOS, but I think it is something like 1 ms, so the task should definitely have an adequate opportunity to get notified of the need for a new bufferload with also enough overhead to actually produce it. Mostly for my own comfort, I decided to raise the priority of the SP0256 task to 'high' so that it will take precedence over the monitor task, though I don't think this is really necessary.
I got all that stuff wired in (and debugged! lol) and tested. Now I can do phoneme and text-to-speech directly from the BluePill -- no physical SP0256-AL2 chip required! It uses the same LPF that the SP0256 uses, namely two 33 K resistors and two 0.022 uF caps. That output then goes through a DC blocking cap of 1 uF into a LM386 amp. I already had this on my breadboard, so it was just moving a wire from the SP0256 to the PB0 pin.
All built (in debug) results in:
arm-none-eabi-size "BluePillSP0256AL2.elf" text data bss dec hex filename 112092 2012 16424 130528 1fde0 BluePillSP0256AL2.elf
So there is still some 18 KB flash left for improvements.
Next
Improvements
-
Compressing the Phoneme Data with ADPCM
05/15/2020 at 20:04 • 0 commentsSummary
The phoneme data in its raw form is about 116 KB. This doesn't leave enough room for code on even an 128 KB BluePill. ADPCM should reduce that size by 1/2, but will it sound good enough? Back to Python land to prototype and find out.
Deets
As a stretch goal, I wanted to see if I could have the BluePill could function as the SP0256-AL2 itself, without requiring the physical part. Way back I had proven that concatenation of pre-recorded phonemes was a viable approximation to the actual digital filter originally used, but the recordings are too big for flash. Even with a 128 KiB BluePill, the inclusion of the uncompressed data would leave only 12 KiB flash for everything else, which would not be enough for what we have now. So some sort of compression would be useful. But before that, you will need to have a 128 KiB BluePill. Do you have one? Almost certainly you do. Some background...
Getting 128 KiB Flash on the STM32F103C8
The 'C8 is spec'd as having 64 KiB flash, and this is what it reports over the debug (SWD/JTAG) interface. However, it is an open secret that the device actually has 128 KiB. I have never heard a definitive answer as to why this is, but it seems plausible that it was simply a marketing decision on ST's part to offer the lower-capacity device, but scrimp on wafer NRE and merely burn a fuse that causes it to report one way or the other. Maybe at one time there were actual 64 KiB devices -- it's a rather old part number -- but I've never seen one in life.
Also, one should be cautioned that there are counterfeits out there. China has a different viewpoint on trademarks and part numbers, and generally is of the opinion that if something behaves the same as a part number XXX, then you get to market it as a part number XXX. To a degree, I can appreciate this viewpoint regarding part numbers, however /marking/ the chip in a way that visually causes brand confusion strikes me as overtly deceptive. At any rate, I mention it because it seems plausible that a counterfeit could actually only have 64 KiB, since that is the advertised capability.
Some folks fairly recently made a tester to prove that you do have a 128 KiB device, and also to exercise the extra flash to give some confidence that it is reliable. I won't go into elaborate details other than to say you simply flash it on your BluePill and connect via USB CDC with a terminal and drive a menu. More details and download at this link:
stm32f103c8-diagnosticsOnce you have satisfied yourself that you do have a closeted 128 KiB BluePill, you just need to make a couple hacks to make use of it.
- 1) modify your linker definition file to indicate the 128 KiB
For example, in this project the file is 'STM32F103C8Tx_FLASH.ld', and early in it is the line
FLASH (rx) : ORIGIN = 0x8000000, LENGTH = 64K
which is fairly obvious that you should change to
FLASH (rx) : ORIGIN = 0x8000000, LENGTH = 128K
And that's it for modifying your project! But this is not enough in itself, because your toolchain probably needs modifications. If you're using OpenOCD (and who isn't these days, at least under the covers), then you need to do some more. - 2) modify your toolchain to ignore what the device reports
In this case, OpenOCD is being used by System Workbench for STM32, and deep within its bowels is the config file
stm32f1x.cfg
which must be modified. Note that for some reason there are two of these. Only one of them is active -- I can't remember which. You can modify both to be certain (or modify one at a time to figure out which one is the one System Workbench actually uses -- I think it's the one with 'st_scripts' in the name).
Way down around line 65 or so is where the flash stuff is kept. There will be some comments that will make sense. There is a line that by default is:
flash bank $_FLASHNAME stm32f1x 0x08000000 0 0 0 $_TARGETNAME
The first '0' is what tells OpenOCD 'ask the chip how much we have'. You can change that value to explicitly state that there is 128 KiB like this:
flash bank $_FLASHNAME stm32f1x 0x08000000 0x20000 0 0 $_TARGETNAME
And that's it for modifying your toolchain! This only needs to be done once, and all your projects will be affected. Note that it appears that the version number of System Workbench is in the path name, so I suspect it is entirely possible that if you upgrade System Workbench that you might have to re-apply this patch.
With that done, you can continue on merrily with 128 KiB flash available. Let's fill it up!
Back to Compressing Audio Files
One simple compression is 'differential pulse-code modulation' ('DPCM'). The PCM part just means digitally sampled signal, so the interesting bit is the 'differential' part. The idea is that instead of storing the samples as records, you instead store the differences between two consecutive samples. If the signal doesn't change too wildly sample-to-sample, the differences will be much smaller than the magnitude of the signal itself, and so that difference can be encoded in fewer bits. This sort of works, but audio signals sometimes can have wild swings, so a modification is 'adaptive differential pulse code modulation' ('ADPCM'). The 'adaptive' part means that the magnitude of a change can be greater or smaller; e.g. a '1' might mean a change in +1 now, but later on it might mean a change in +256. The adaption works by changing that step size based on how far off one was in the previous prediction. This scheme is typically a fixed-compromise, and so there is no additional data in the stream indicating when to make a change in step size -- both the encoder and decoder will make the adaption in the same way. So again, only the differences are transmitted.
The algorithm has been around a while (e.g. CCITT G.721), and typically involved floating point calculations for the adaption curves, however the Interactive Multimedia Association (IMA) came up with a version that used table lookups instead of logarithms, so this is more tractable for embedded -- especially with no FPU!
Much as before, to prove the concept I use python to flesh out the implementation of the encoder and decoder. Then I encoded all the phoneme data in ADPCM, and then I decoded it back to PCM. With this second version of the audio, I ran through the same test case to hear side-by-side the straight-PCM version against the ADPCM-version. This is a lossy compression scheme, so the actual numbers are different, but audibly it sounds pretty much the same. E.g.: testcase_001a.wav So that's promising!
After that, I added some code to the python PoC script to emit C source that contains all the ADPCM data a constant arrays, and included it in the project to see what the flash impact is. It turns out to be about 58 KB. So there's hope; that will take up most of the newly unlocked flash, with a few K to spare. That with the 12 or so K we already have is probably enough to get the project completed.
Building with the ADPCM data shows we have broken the 64 KiB boundary with about 20 K to spare:
arm-none-eabi-size "BluePillSP0256AL2.elf"
text data bss dec hex filename
109992 1984 15360 127336 1f168 BluePillSP0256AL2.elfarm-none-eabi-size "BluePillSP0256AL2.elf" text data bss dec hex filename 109992 1984 15360 127336 1f168 BluePillSP0256AL2.elf
Next
PWM output of the sound data.
- 1) modify your linker definition file to indicate the 128 KiB
-
TTS Rulez Redux
05/14/2020 at 17:46 • 4 commentsSummary
With the 'compact' form of the rules in-hand, is is time to use them.
Deets
I ported the code that processes the rules into C. This was a bit more trouble than I anticipated because the Python version uses some conveniences in that environment -- especially with dynamically sized arrays and string concatenation. Since this code is going to be running in an embedded environment, I wanted to avoid as much copying to temporary and dynamically allocated buffers as much as possible, and rather try to process directly out of any buffers or constant definitions. Additionally, there was a hack in the original rules that required a space to be prepended and appended to the word. This hack allowed using the space as a meta-character for 'Nothing', which was used to indicate that a context pattern needed to be at the very beginning and end of the text. I wound up creating a separate meta-character for that '$' and updated all the rules accordingly. That addition cause me to generate a new distinct string, so I incurred a two-byte penalty to 9385 bytes for the compactified rules.
Incrementally building the code shows these numbers for flash usage:
- 40816 baseline
- 50208 rules included; delta = 9392
- 51908 tts code; delta = 1700
- 51964 simple test code to use TTS to translate a sentence; delta = 56
So this is not too bad; about 2 KB for the actual code, and the simple test (which is fairly representative of how it would be used in practice) is quite small at about 56 bytes.
This means that there is about 12 KB more flash for code growth before the next crisis. I think this might be OK for the remaining stuff I have planned. I've got a little more that 7 KB ram left, and I think this will be enough, too, to finish things up.
The simple test code:
static const char achGettysburg[] = "four score and seven years ago our fathers brought forth on this continent \ a new nation, conceived in liberty, and dedicated to the proposition that all \ men are created equal."; const char* pszText = achGettysburg; int nTextLen = COUNTOF(achGettysburg); //quicky test running through text const char* pchWordStart, * pchWordEnd; int eCvt; while ( 0 == ( eCvt = pluckWord ( pszText, nTextLen, &pchWordStart, &pchWordEnd ) ) ) { int nWordLen = pchWordEnd - pchWordStart; static uint8_t sl_abyPhon[64]; //semi-arbitrarily sized long word int nProduced = ttsWord(pchWordStart, nWordLen, g_abyTTS, sl_abyPhon, COUNTOF(sl_abyPhon) ); //stick on a space between words if there is not already a pause if ( sl_abyPhon[nProduced-1] > 4 ) //all pauses are code 0 - 4 { sl_abyPhon[nProduced++] = '\x03'; sl_abyPhon[nProduced++] = '\x02'; } size_t nIdxPhon = 0; size_t nRemaining = nProduced; while ( nRemaining > 0 ) { size_t nConsumed = SP0256_push ( &sl_abyPhon[nIdxPhon], nRemaining ); nRemaining -= nConsumed; nIdxPhon += nConsumed; if ( 0 != nRemaining ) { osDelay ( 200 ); //sleep a little to let the synth catch up } } //advance nTextLen -= pchWordEnd - pszText; pszText = pchWordEnd; }
So the gist of using it is to crack the text word-by-word (there is a convenience function pluckWord() provided for this), and then for each word 'plucked' from the buffer, push it into ttsWord() to translate it into a phoneme sequence. You can then send this sequence off to the SP0256 task (or whatever).
I added some debug code to make it send the plucked word and text-to-speeched phoneme sequence to the serial for debugging. E.g. for the first sentence of the Gettysburg address:
four 28 35 33 03 02 FF OW ER2 PA4 PA3 score 37 08 35 33 03 02 SS KK3 OW ER2 and 1a 0b 15 03 02 AE NN1 DD1 seven 37 07 23 07 0b 03 02 SS EH VV EH NN1 years 0c 13 33 2b 03 02 IH IY ER2 ZZ ago 1a 3d 35 03 02 AE GG2 OW our 20 33 03 02 AW ER2 fathers 28 1a 36 01 34 2b 03 02 FF AE DH2 PA2 ER2 ZZ brought 1c 27 17 0d 03 02 BB1 RR2 AO TT2 forth 28 17 17 33 1d 03 02 FF AO AO ER2 TH on 17 0b 03 02 AO NN1 this 36 0c 0c 37 37 03 02 DH2 IH IH SS SS continent 08 18 0b 0d 06 0b 07 0b 0d 03 02 KK3 AA NN1 TT2 AY NN1 EH NN1 TT2 a 07 14 03 02 EH EY new 0b 1f 03 02 NN1 UW2 nation, 0b 14 00 25 0e 0b 04 NN1 EY PA1 SH RR1 NN1 PA5 conceived 08 18 0b 37 13 23 07 15 03 02 KK3 AA NN1 SS IY VV DD1 in 0c 0c 0b 03 02 IH IH NN1 liberty, 2d 0c 3f 34 0d 0c 04 LL IH BB2 ER2 TT2 IH PA5 and 1a 0b 15 03 02 AE NN1 DD1 dedicated 21 0c 21 0c 2a 1a 1a 00 0d 0c 15 03 02 DD2 IH DD2 IH KK1 AE AE PA1 TT2 IH DD1 to 0d 1f 03 02 TT2 UW2 the 12 13 03 02 UW2 IY proposition 09 27 0e 0e 09 0e 2b 0c 00 25 0e 0b 03 02 PP RR2 RR1 RR1 PP RR1 ZZ IH PA1 SH RR1 NN1 that 36 1a 0d 03 02 DH2 AE TT2 all 17 2d 03 02 AO LL men 10 07 0b 03 02 MM EH NN1 are 18 34 03 02 AA ER2 created 08 33 13 14 00 0d 0c 15 03 02 KK3 ER2 IY EY PA1 TT2 IH DD1 equal. 13 2a 2e 1a 2d 04 IY KK1 WW AE LL PA5 PA5 PA4
I did go ahead and wire in a command in the monitor for testing this stuff: 'sp' for 'speak'. You're meant to supply a sentence and it will parse and translate much as the code is shown above (with a little extra error checking).
Now I'm curious about simulating the SP0256-AL2 using a PWM output. In this way, you wouldn't need the physical chip to enjoy 1970's era speech synthesis output. This will be a challenge with the flash -- the audio files as-is are something like 144 KiB total -- /that/ won't fit! Also, although the chip (STM32F103C8) is designated and self-reports as having 64 KiB flash, it is an open secret that the device in fact has 128 KiB (same as the 'CB). I will exploit this to get the extra room I need if it all works out.
Next
Chasing another goose named 'SP0256-AL2 simulation'.
-
Text-to-Speech Rulez!
05/12/2020 at 20:18 • 0 commentsSummary
For today's goose-chase, I am porting over the text-to-speech rules. Some effort was put forth towards reducing their flash footprint.
Deets
Having realized the primary impetus of the project, I'm faced with several other directions to take it next. Semi-arbitrarily, I decided to try getting text-to-speech capability in place. As mentioned in a previous post, I have some old TTS code which I ported to Python for a sanity check, and now I am porting it to C for inclusion on the BluePill. The first step is just encoding the rules as static data to be burned into flash. Transcoding the rules took a little over a day of mundane reformatting and some considerations of how to work with the C language itself, e.g. there's a bunch of variable-length arrays -- in the other languages the length is an intrinsic property of the array object, but that's not the case in C. For strings, there is the implicit NUL-terminator, but that is not the case for any other array. Eventually, I worked out some macros that exploit string-merging to fake it enough to have a result that looks manageable.
By straightforward inclusion of rules as C-defined structures shows that they take 19,300 bytes of flash. This is too much. When I had originally written this code (and by 'written' I mean 'ported some existing work and extended'; credits in the source), it was for a platform called 'dotNet Micro Framework'. It was somewhat interesting, but it lacked a lot of const-friendliness, and tended to put things in live objects (i.e. in RAM) no matter how much 'readonly' qualifier you would apply. So in that case I pre-processed the rules into an alternative form that would cause the compiler to leave almost all the stuff in flash. On that platform, I had an abundance of flash (and comparatively an abundance of RAM, too) relative to here. Those transformations are not meaningful here, but I wanted to see if a similar compactification could reduce the footprint. The gist would then be that the desktop app would be the 'master' copy of the text-to-speech rules, encoded in C-structs/arrays in a straightforward way, and then they would be pre-processed into the compact form for embedded. That way the rules can continue to be developed and maintained in a sane way, albeit with the additional pre-processing step.
First I did some basic statistics including raw counts and distinct counts:
Rules: 706 strs: 2118, bins: 706 dstrs: 484, dbins: 400
So, 706 rules, 2118 strings (the various 'contexts') and 706 phoneme sequences. Of the 2118 strings 484 were distinct, and of the 706 phoneme sequences 400 were distinct. This seems like that the strings could be reduced to about 25%, but really that is just count. The devil is in the details. Truthfully, a lot of the strings are for exception cases, and these tend to be longer. So deduping short strings might not really squeeze that much. Having the program tabulate the lengths showed:
Rules: 706 strs: 2118, bins: 706 dstrs: 484, dbins: 400 strlen: 2783, binlen: 1688 dstrlen: 1549, dbinlen: 1246
So, 1549/2783 really reduces about 44% rather than the hoped 75%. But that's still an improvement. A similar story is told for the phoneme binaries at 26% rather than 43%. But it occurred to me that this is not considering the nul-terminators, so I reworked it:
Rules: 706 strs: 2118, bins: 706 dstrs: 484, dbins: 400 strlen: 4901, binlen: 2394 dstrlen: 2033, dbinlen: 1646
Here, the space reduction is better (58% vs 44%, and 31% vs 26%), but wow! the size taking into consideration nul-terminators really added some overhead! That's what bunch of single-characters strings/bins will do. But another tale is to be told: even disregarding de-duping, the total of strings and binaries is 4901+2394 = 7295. But comparing the flash size before and after including the unmodified C ruleset showed just over 19,000 bytes. So where did the other 12 KiB go? Well, it's in pointers and padding. The rule structure is straightforwardly defined like this:
//structures involved: typedef struct PhonSeq { const char* _phone; size_t _len; } PhonSeq; typedef struct TTSRule { const char* _left; const char* _bracket; const char* _right; PhonSeq _phone; } TTSRule; //example rule: const TTSRule r_a[] = { ... { "^^^", "a", "", { EY, 1 } }, { "^.", "a", "^e", { EY, 1 } }, { "^.", "a", "^i", { EY, 1 } }, { "^^", "a", "", { AE, 1 } }, { "^", "a", "^##", { EY, 1 } }, ... };
Then what actually gets created is a contiguous array of TTSRule struct that look like this:
{ & "^^^", & "a", & "", { & "EY", 1 } },
So, the entries in the array are pointers to nul-terminated strings that are elsewhere -- more overhead. As a quick calculation, if you take the 19,300 bytes known consumption, minus the 7295 expected consumption, you get 12,005 bytes, and dividing that by the 706 rules is just over 17 additional bytes per rule. Since pointers are 32-bits on this platform, the 4 pointers in the rule would be 16 bytes, which jibes with the 17 bytes of the quick calculation.
So, if we instead were to pack all our de-duped strings into a contiguous blob (ostensibly 2033 + 1646 = 3679 bytes), and were to use 16-bit indexes (instead of 32-bit pointers) to reference into this blob, then we should have an overhead of 706 * 4 * 2 = 5,648 bytes. So 3679 + 5,648 = 9,327. That's less than 19,300! It's still a bit disappointing since the majority of the ruleset blob is still structural overhead in the way of indices to data components. Due to the statistical nature of the indices, they might be ripe for entropy encoding, though I'm not going to bother with that just now.
The real results will be a little bit different that that exact number. One thing is that for performance I segregate the rules into groups based on the initial character of the 'bracket' context. That adds the overhead of another array of pointers, however this is smaller in that there are only 27 elements, anyway, which is just 108 bytes. One additional overhead is determining the lengths of the rule groups, which presently involves a terminal 'sentinel' value. This is a full rule of zeros. If instead of holding pointers to the rule groups, and instead I concatenated all rule groups together and just kept offsets to the starts of rule groups, then I can reduce the size of the pointers and also eliminate the sentinels. The sizes of the rule groups then is calculated as the difference between the offsets of adjacent entries. Something has to be done for the last entry (since there is no subsequent entry), so I add a 'dummy' entry which is the offset to the next rule group if but only there was one. That means that there is an additional (27+1) * sizeof(uint16_t) = 56 bytes, bringing the grand total to 9,383 bytes. So, about half of what was originally there. I'd like to squeeze it more, but I'd also like to get coding, so I'm running with this for now to see if it is good enough.
Since I need to preprocess the rules into this compact form, I made a separate C++ application for that. I will also have it implement the text-to-speech code against the compactified ruleset (which it will recompute each time it runs) so it can serve as a unit test of that code.
Next
Uses the rules
-
Speech Commands 101
05/11/2020 at 18:32 • 0 commentsSummary
A command 'ph' is implemented to spew phonemes to the synthesizer, enabling basic experiments over the terminal.
Deets
Having gotten the SP0256 task in place, now it's time to use it. My eventual plan is to have the BluePill accept a binary stream over the serial port and directing it into the SP0256, but that will mean making a client side app (I guess I could make one quickly enough with python -- surely there is serial IO capability there).
In the short term, though, I decided I can implement a command on the command processor. This command takes a hex sequence which is the stream of phonemes. Since it's on the command processor, it's quite limited (the command processor has a hard line-length limit of 128 chars), but it's quite serviceable for interactive testing.
Because of the line length limitation, I shortened the name of the command to 'ph' and it takes a contiguous stream of hex chars. E.g., sending:
ph 1B072D350302
Will cause the synth to speak 'hello'.
Tada!
In the strictest sense, I have now achieved what was my original motivation for doing this project: set up to cause the real synth to generate audio that I can record, but of course the project has grown beyond that motivation and now I'm off chasing geese.
Next
Chase geese, maybe binary steaming of phonemes over the serial port (instead of the textual way here), or maybe text-to-speech.
-
Implementing the SP0256 'Task'
05/10/2020 at 15:40 • 0 commentsSummary
A FreeRTOS 'task' (aka thread) is used to manage the interface with the physical SP0256 and stream phonemes from a circular buffer.
Deets
Since this project is ultimately going to have several functions, including the previously described command processor, but also the phoneme receiver and text-to-speech component, I decided to make the handler for the physical SP0256 a task-oriented component. The gist is that there is an interface where you 'push' phoneme data to your heart's content, and internally there is a 'thread' that removes that data and sends it on to the chip in a way that works with the chip's hardware flow control signals. I've used this approach in some other projects, and it helps keep the design/implementation modular and more loosely coupled.
In this case, there are several hardware resources that the SP0256 Task manages:- the 'address' lines. Since there are only 64 phonemes, I decided to relinquish the top two bits for other purposes, so this would up using PA 0-5. These are not 5V tolerant, but being as they are strictly output, this is OK.
- the 'not Address Load' (nALD) line. This is what strobes the data into the SP0256. This is put on PB 1, which similarly is not 5V tolerant, but since it's going out of the BluePill and into the SP0256, this is OK.
- the 'not Load Request' (nLRQ) line. This is part of the hardware handshaking, and it goes low to indicate that it is OK to send data to the SP0256. Since this is an input, it needs to be on a 5V tolerant pin, and it is put on PB 11.
- the 'Standby' (SBY) line. This indicates that the SP0256 is finished with all phonemes, and could be put into low(er!)-power mode. I don't plan on using it, but nonetheless I wired it to PB 10 in case I change my mind.
- I also decided to manage the reset line explicitly, and I put that on PA 6 with an NPN transistor open collector. The data sheet seems to imply that Reset needs to go up to 5V, not just be a digital high, so that's why I did this.
This chip is really slow, and we are wiggling the lines programmatically, so I use some delay loops. One way I tend to do that on these ARM parts when possible is use the 'Debug module'. This is an optional module intended for debugging, but one handy thing it has is a cycle counter. This is a 32-bit up counter that is clocked by the CPU clock. By using this (if available on your particular part) I can avoid using the timer resources. For short delays and even profiling code it can be quite handy. The module has to be explicitly enabled, and that is done very early in main().
I use a circular buffer to receive the phonemes from outside this module. This is some common code I have written that I use across projects. Since this is manipulated by two threads, I protect it with a mutex. OK, some things about FreeRTOS: many functions have two variants: an 'ordinary' variant, and a 'ISR-friendly' variant. The synchronization-related stuff in particular is in this class. Mutexes are what FreeRTOS calls a 'binary semaphore', and you use the semaphore-related functions to acquire and release them. HOWEVER, for reasons that are not clear to me, mutexes are incompatible with ISRs. If you really need to do mutual exclusion and within an ISR, you must use the binary semaphore. FreeRTOS suggests that mutexes are useful for 'simple mutual exclusion'. Well, I think my application is 'simple' so I am going with the mutex, but I put a caveat in the comments on the API that the various methods are NOT to be called from an ISR. This isn't a problem for my project, but one day I may re-use this and forget and somehow deadlock the system and have to spend time debugging. Best to comment.
Speaking of ISRs, there is presently one interrupt source used: the nLRQ line is configured as an EXTI source, on falling edge. The idea is that when the nLRQ falls, it means that the SP0256 can accept more data, so if there is pending data in the circular buffer, send it on. I don't do this work in the ISR, though. Here's it's a problem because of the mutex, but I generally avoid that if practical as a rule anyway so that the ISR can return as quickly as possible to the system. Instead, I let the worker thread do that. So the ISR just signals for the worker thread to wake later and handle the data. There are several mechanisms for that in FreeRTOS, but the one I usually like to use are called 'Task Notifications'.
Task Notifications are a FreeRTOS concept, and essentially each task has a 32-bit value associated with it. You can interpret this 32-bit value in whatever way you want, but I (and presumably others) generally interpret them as a vector of flags. They are lightweight compared to other synchronization primitives, and have limitations, but for many use-cases they are sufficient. I usually have a single header 'task_notification_bits.h' that defines all the values for my project -- this is just my preference. The task calls 'xTaskNotifyWait()' which causes the thread to sleep until awakened by having a notification posted by 'xTaskNotify()' or 'xTaskNotifyFromISR()'. In my case I use the latter since I post the notification from the EXTI ISR. My bit definition is named 'TNB_LRQ' since this is the task notification bit for the 'Load Request Line'.
When the task awakens, it tests all the bits it knows about handles them accordingly. (You must test ALL the bits you know about because it is possible that more than one notification has been posted and those bits are automatically cleared, so you don't get a second chance.) In this case the task dequeues phonemes from the circular buffer and pushes them into the SP0256 until either there are no more phonemes or because the nLRQ line went high (and we must stop for now). All this is done while holding the queue's mutex, so other threads are prevented from damaging the queue while we're using it. The process of dequeuing and sending is fast, so I just hold the mutex for the whole time rather than be more surgical around the dequeue operation only. One can view this task as the 'consumer' of the phoneme stream. I have provided a notification mechanism that allows some arbitrary code to be executed when the phoneme stream is depleted, though I don't imagine I will be using it. It's just habit for me to provide such.
Other tasks will 'produce' phonemes into the queue. This might be the serial port, or it could be the output of the text-to-speech module yet to be developed. This 'push' operation is slightly more involved than the 'pull' because several scenarios must be handled:
- the queue is not empty (and concomitantly is being serviced by the consumer thread), and the SP0256 is ready to accept data. In this case we want to first dequeue and feed the SP0256. We want to do this until the SP0256 is no longer ready to accept more data. This step is important, because we want the data to go to the synth in the order we pushed it. So we need to flush out as much previous data as possible before processing the current data.
- the queue is empty, and the SP0256 is ready to accept data. In this case we want to feed the SP0256 from the start of the current data, not even putting it in the queue, and continue doing that until the SP0256 brings the nLRQ line high (indicating it cannot accept more). It is important to feed as much data as possible for reasons that will soon be apparent.
- the SP0256 is not ready to accept data. In this case we simply want to add the data to the queue.
By handling those three scenarios in the sequence presented, the order of the phoneme stream will be maintained, and the queue will be used to buffer excess, which will be automatically handled for us. As mentioned, it is important to feed the synth to the point that it indicates it can take no more. The reason is that we need to make sure that eventually there will be a falling transition on the nLRQ line so that the interrupt handler will send the task notification to cause the rest of the data to be fed in. It is OK if the amount of data being pushed is not enough to make this happen, but if the amount of data IS enough, then we must cram it until we can't cram any more. Any excess data gets enqueued.
The last feature of the 'push' API is that it returns how much data was consumed in the call. This is important because it's possible that there is not enough room in the queue to take it all. By reporting how much was taken, the caller can make multiple sequential calls to eventually push it all in.
OK, this activity took longer than I wanted because I misread the datasheet about the behaviour of the nLRQ line, so I had to re-write some of the code. Additionally, an annoyance with the STM32CubeMX application is that it is not sufficient to declare a pin as being an EXTI source -- you also have to go into the NVIC settings and turn on the EXTI interrupt. This is happened before, and it seems I never learn (or rather I always forget). Strictly, you have to do this separate 'enable the interrupt source' step for other peripherals, too, but for some reason it seems obvious in those cases.
At length, I was able to push a manually-crafted phoneme sequence in and have it play back correctly.
Next
The test was just with a hard-coded call to the 'push' API. Now I need to implement the code that will be making that call. This will be in a module that receives data over the serial port, and pushes that data into the SP0256 task.
-
Physical UART 'Monitor'
05/09/2020 at 18:06 • 0 commentsSummary
Mini-update: debugging while connected via USB CDC drives me bonkers, so I made an alternative configuration where a physical UART is used.
Deets
USB involves a lot of stuff to go on to maintain the connection between host and device. The on-chip peripheral handles some of that, but other parts are handled in code. Fortunately, I don't have to write most of that -- it's provided in the libraries, however it does have to be running to maintain the connection. If it doesn't, the host gives up trying to talk to the apparently malfunctioning device.
And that's the rub: when debugging (i.e. single stepping) the code, the servicing of the USB is concomitantly halted, and the host gives up on the connection. Some productive debugging can continue in many cases, however you will ultimately be faced with having to disconnect/reconnect the device to the host (i.e. 're-enumerate') to get the host to recognize it. Moreover, whatever application was using the USB will likely need to be restarted as well, because the old device handles are of no use anymore.
Because of this, I decided to spend a little time getting a physical UART running, and connecting it to a separate USB-to-serial bridge device (I keep a handful of FTDI boards on-hand for these sorts of things). In that way, the board can reset, cycle power, whatever, and not require futzing around on the host machine, because it's connected to the external device, not the actual project.
I had already made an adapter of the STM UART library code to my stream abstraction in a separate project, so I merged that it. At that point is was straight-forward to simply bind the UART stream to the monitor instead of the USB CDC stream. I modded the project to switch between the two based on a preprocessor definition. I expect that USB CDC would be what is used in 'production' and the UART would be used in 'development', but who knows? There's probably general use for both.
Next
Back to implementing the SP0256 interface.
-
Building the Basic BluePill Interface
05/08/2020 at 22:18 • 0 commentsSummary
For my first amazing feat, I am going to make the interface as originally planned: i.e. as a USB CDC to SP0256 'bridge'.
Deets
The original motivation of this mini project was to make a USB CDC to SP0256 'bridge' so I could record individual phoneme signals. An application on the PC would send the phoneme sequence over serial and then also record the output. As mentioned, I found a collection of recorded phonemes, but I'm going to carry on with this. Maybe I'll get some better recordings. Anyway, I have future plans beyond the original bridge, but this is a good first step towards those.
Of HALs and Hacks and Heaps
I've covered this before in other projects, but I'm not a big fan of the STM32 HAL libraries. I still use them anyway for projects like these because they are convenient, but they strike me as bloated. For example, after configuring the chip and generating the project and doing a build:
debug 'optimize for debug'
arm-none-eabi-size "BluePillSP0256AL2.elf" text data bss dec hex filename 36140 1156 13576 50872 c6b8 BluePillSP0256AL2.elf
so, 36 K flash (out of 64K) is consumed, and something 14.5 K ram (out of 24K) is used. Yikes! Even doing a release build:
release, 'minimize size'
arm-none-eabi-size "BluePillSP0256AL2.elf" text data bss dec hex filename 30856 1148 13552 45556 b1f4 BluePillSP0256AL2.elf
doesn't improve it much. But c'est la vie. For things like USB the HAL library is pretty much the only option unless maybe you went with an alternative library. The USB peripheral is a beast.
The HAL also has quirks, so I have a set of hacks which I use that make USB CDC and UART work the way I want them to. Bit since the code is generated, those hacks get overwritten. I have a batch file that re-applies them after every time I re-generate the code. You will regenerate often in the beginning, because you'll change your mind about peripherals, etc. The batch file makes it tolerable.
The last hack is an enhancement -- I have my own heap (i.e. malloc) implementation. I needed this a while back for a library that required realloc(), and the one that comes with FreeRTOS does not provide that. Additionally, I added some debugging enhancements that let me do a 'heapwalk' of the blocks and also to fill blocks with a pattern so they are easier to visually inspect. The last bit of legerdemain with the heap is using a nifty feature of the gcc linker. You can tell the linker to redirect symbols to a 'wrapper' function that you must provide. This trick allows me to re-direct calls to malloc() that are even in pre-compiled code (e.g. libc) into my implementation. This is important, because otherwise you will have two heaps: the one you conscientiously use that you provide, and the default implementation that is in libc. I am less fond of the libc implementation because it will 'grow' the heap as needed upwards into the stack. There isn't a hard limit on the arena size (well, up until crashing -- that's a hard limit!).
Some folks have asked for my heap implementation. It's in the source (in the github project in the links section). It goes where FreeRTOS normally places heap_4.c:
.\Middlewares\Third_Party\FreeRTOS\Source\portable\MemMang\heap_x.c
the 'fixup.bat' does the deletion of heap_4.c and replaces it with heap_x.c. The linker stuff is documented in main.c near the top.
OK with all that setup, I generally start with a common design where the 'default' task (which apparently cannot be deleted via the tool, so I just work with it) handles things like the LEDs (of which this board has just one on the board itself) and periodically collecting debug statistics like heap usage, per-task stack usage, and anything else I might be interested in such as circular buffer usage. I find these useful for tuning in the final build. The release build conditionally omits that diagnostic code.
The other common thing I do is implement a 'monitor' task. This is a serial console through which I can issue commands to see what's going on and do configuration stuff. I typically do that over USB CDC, since that's the way I usually use the board, however I've found that complicates debugging, because when you hit a breakpoint the USB state machine halts and then you generally lose your connection on the PC and require to cycle power. I have since found it easier (when possible) to use a physical uart and an external uart-to-usb bridge (e.g. FTDI). That way the PC can stay connected to the USB on the separate bridge while the board is halted. This is fairly easy for me to do because part of my 'hacks' is to abstract UART and USB CDC into a common 'stream' interface. Then stuff that needs to use that IO can 'bind' to the relevant stream interface object and deal with them in a consistent way. As such, switching between physical UART and USB CDC is just a one-line change where the interface binding is performed.
The monitor task uses a common component of mine called 'command processor'. It handles the low(er) data IO from the stream interface, and performs a looking of commands that are stored in a table. The command entry provides some help text, and also the function pointer to the handler for the command. Since this project is just starting, the set of commands is currently minimal:
- set -- set a persistent setting or list all the settings
- persist -- save settings to flash
- depersist -- read the settings from flash
- reboot -- restart the board as if you pressed 'reset'
- dump -- dump memory contents
- diag -- print some of the diagnostic statistics I mentioned earlier
- help -- get help on a specific command, or list all commands
I generally like to get a board brought up to this stage before proceeding with the project-specific stuff. And so I did!
Next
Implementing the SP0256 interface processor.