Aligned Strings with Attributes

Should then we call these ASA ? it's a simple layer above the AS system.

The previous log 12. Extension and type throws a few ideas at the metaphoric wall and some have stuck.

Repurposing the trailing NULL seems to be the path of least algorithmic resistance and of maximal flexibility. Now it's a matter of actually defining these bits and managing them.

When a "POSIX" string is imported from outside a program's domain (user input, a file, a network socket...), it's just a Variable-Length Byte Array of type 1 (255 bytes max) or 2 (65535). There is 1 or 2 bytes for size prefix and a NULL terminator byte (suffix is not counted in the size but always present. The contents could be... anything, including another NULL byte. So it's considered unsafe.

The importation process must determine if the string contains valid ASCII characters and well-formed UTF-8 sequences. A lookup-table-driven FSM looks like a simple and practical method to filter invalid sequences or bytes, and collect the attributes.

We already have defined some attribute bits. The "tested" bit would be allocate to the MSB to ease testing as a negative number. A positive number would not have been tested so it's invalid, including the default NULL value.

bit 7 : tested
bit 6 : UTF-8
bit 5 : malformed
bit 4 : 
bit 3 : 
bit 2 : 
bit 1 : 
bit 0 :

The "malformed" attribute is set when encountering a NULL byte, invalid UTF-8 bytes (F5–FF) or an invalid UTF-8 sequence.

So the UTF-8 attribute covers the prefix as well as the continuation ranges, but these individual ranges must also be provided to check for a valid sequence. Hence the lookup table is 16-bit wide : one byte is directly ORed with the attribute byte while the other helps update the FSM.

This still leaves 5 attribute bits to use in the terminal byte. Here are the candidates:

control and separators (not NULL, 1..32, 127, C0 and C1)
digits (0 through 9)
symbols/punctuations
lowercase
uppercase

I'm not sure whether separating uppercase and lowercase letters is pertinent, but it's possible. This could help some parsers. In this vein, maybe controls can be separated from separators (space, tabs). We can change the table at will later, or even provide a custom one.

The import function only checks the validity of sequences but there could also be an option to eventually translate to UTF-16 or UTF-32 in a separate buffer, to save time.

For now, the code only contains

int utf8_encode_point(char *dst, uint32_t point)

but here we are doing the reverse: taking a bounded number of bytes and building the code point (and also check this is valid). Maybe the function name above should be changed to be more explicit.

...

The current definitions for the terminator is

#define MASK_ASTRA_TESTED      (128)
#define MASK_ASTRA_UTF8        ( 64)
#define MASK_ASTRA_ERROR       ( 32)
#define MASK_ASTRA_CONTROL     ( 16)
#define MASK_ASTRA_SEPARATOR   (  8)
#define MASK_ASTRA_PUNCTUATION (  4)
#define MASK_ASTRA_DIGIT       (  2)
#define MASK_ASTRA_CHARACTER   (  1)

but more bits are needed for the lookup table, in particular for parsing the utf-8 strings.

const uint8_t character_type_LUT[256] = {
  [    0          ] = MASK_ASTRA_ERROR,
  [    1 ...   31 ] = MASK_ASTRA_CONTROL,
  [    9          ] = MASK_ASTRA_SEPARATOR,
  [   32          ] = MASK_ASTRA_SEPARATOR,
  [   33 ...   47 ] = MASK_ASTRA_PUNCTUATION,
  [   48 ...   57 ] = MASK_ASTRA_DIGIT,
  [   58 ...   64 ] = MASK_ASTRA_PUNCTUATION,
  [   65 ...   90 ] = MASK_ASTRA_CHARACTER,
  [   91 ...   96 ] = MASK_ASTRA_PUNCTUATION,
  [   97 ...  122 ] = MASK_ASTRA_CHARACTER,
  [  123 ...  126 ] = MASK_ASTRA_PUNCTUATION,
  [  127          ] = MASK_ASTRA_CONTROL,
  [ 0x80 ... 0xBF ] = MASK_ASTRA_UTF8_CONTINUATION,
  [ 0xC0 ... 0xC1 ] = MASK_ASTRA_CONTROL,
  [ 0xC2 ... OxDF ] = MASK_ASTRA_UTF8_PREFIX1,
  [ 0xE0 ... OxEF ] = MASK_ASTRA_UTF8_PREFIX2,
  [ 0xF0 ... OxF4 ] = MASK_ASTRA_UTF8_PREFIX3,
  [ 0xF5 ... 0xFF ] = MASK_ASTRA_ERROR
}

So this added MASK_ASTRA_UTF8_PREFIX1, MASK_ASTRA_UTF8_PREFIX2, MASK_ASTRA_UTF8_PREFIX3 and MASK_ASTRA_UTF8_CONTINUATION.

...

Looking back at AS20230117.tgz I already found one glaring bug. See you in the next log.

Extension and type

Attributes moved to Flex

Discussions

Become a Hackaday.io Member