Extension and type

The 4/8 types are great.

They are the basis for designing a language and an operating system. These applications need more than what the Aligned Strings format provides, though. The format is one thing but a type is required.

Usually, types are either inferred or implicit, defined by an API/ABI. But for my projects I have identified 3 specific uses of the aligned strings :

untyped, variable length array of bytes (so that counts as "no type", which is still a type, anyway)
type declaration / prototype (yes, it's going to be a thing)
ASCII (7-bit/safe) string (the basic type)
UTF8 string (can only be typed as such if certain conditions are checked)

I don't know how to specify the string type, yet. The format already exhausts the coding space. Furthermore, the typing thing is mostly critical for the character strings because there are compatibility issues. And in practice, the aligned format is usually well segregated, there should be little confusion, except for characters: a character string could be either

unknown
known 7-bit
known UTF8

and preserving this requires 2 bits...

There is not enough room in the type 0/1/2 aligned strings but the type 3 has some headroom.

From aligned_strings.h:

typedef struct {
  // only 24 bits available with the "fast" code
  uint32_t total_length __attribute__ ((aligned (4)));
  uint16_t nb_elements,
           flags;  // unused yet.
  // Up to here: 8 bytes.
  aStr_t elem[];
} aString_list;

So two flag bits can be reserved in flags, but this increases the burden and overhead for type-agnostic code, because this would only accept Type 3/list.

The list type is more flexible (and opens the possibility of parallel processing) but slower, which would burden Unicode-agnostic code. Forcing Unicode to type 3 only does not make sense since sub-strings need to be other types, including 8- and 16-bit sizes. After all it's possible to merge a type 3 into type 1 or type 2.

________

Another possibility is to steal a couple of bits from the size field: This allows types 1/2/3 to contain the Unicode flags in the LSB, keeping the code common for them.

Consequences:

String sizes are reduced : 63 or 16383 bytes for types 1 and 2. It's not a disastrous reduction, considering that type 2 is preferred.
Character strings will need slightly different code than other byte strings. Or else, the other variable byte arrays will use the 2 flag bits for other purposes.

But overall it's a step backwards since the type flags use space that could have been used to format type in the first place.

________

Where else can we hide the string's attributes ?

The NULL terminator would be another good place, since it is specific to character strings and will not pollute the code for other formats of variable byte arrays. But

This will make it incompatible with standard ASCIIZ : special functions must be added such as toPOSIX and fromPOSIX.
Fetching the attribute might delay the program since the attribute byte is likely in a different cache line.

Or else, the first byte could be the attribute byte. This will make all pointer indexing wonky with a +1 everywhere and become a source of confusion and bugs.

Overall, this all must be compared to the actual cost of re-checking/re-scanning the whole string, every time the code needs to know in which domain it is...

________

Overall, it seems that "repurposing" the optional NULL terminator is the most convenient method.

FromPOSIX will scan the source string, allocate the correct amount of memory, set the correct attributes
ToPOSIX will simply discard the attributes by clearing the NULL,
Too bad for the delay in fetch. Let's assume it's still in L1 or L2 at worst.

The good news is that the NULL terminator is already handled (optionally) by the existing code, it would only need a little update.

During concatenation, the attributes should be easily "upgraded" to the proper values. This can be done by ORing them. Normally it is better to "greedily" upgrade rather than downgrade whenever possible. The following encoding does not work:

00 unknown status (default)
01 tested as ASCII only
10 tested as UTF-8
11 tested as malformed UTF-8

because the malformed attribute will be created by both ASCII & UTF8. So 3 bits are required:

bit
 0 Tested
 1 UTF8
 2 Malformed

This way, a malformed string will "avalanche" to the whole result during concatenation.

While we're at it, we can even use more attribute bits to get "free data". Previously I talked about Bloom filters and it's in this spirit. Most attributes can be created with a simple lookup table, except the "malformed" one. UTF8 refers to specific entries in table, while traditional ASCII is another zone. Within this zone, more regions can be identified such as lower case and upper case characters, digits or punctuation.

Fuzzing and safety

Aligned Strings with Attributes

Discussions

Become a Hackaday.io Member