Close

Extension and type

A project log for aStrA : Aligned Strings format with attributes

Developing a more flexible pointer and structure format that solves POSIX and Cs historical problems.

yann-guidon-ygdesYann Guidon / YGDES 08/29/2024 at 17:560 Comments

The 4/8 types are great.

They are the basis for designing a language and an operating system. These applications need more than what the Aligned Strings format provides, though. The format is one thing but a type is required.

Usually, types are either inferred or implicit, defined by an API/ABI. But for my projects I have identified 3 specific uses of the aligned strings :

I don't know how to specify the string type, yet. The format already exhausts the coding space. Furthermore, the typing thing is mostly critical for the character strings because there are compatibility issues. And in practice, the aligned format is usually well segregated, there should be little confusion, except for characters: a character string could be either

and preserving this requires 2 bits...

There is not enough room in the type 0/1/2 aligned strings but the type 3 has some headroom.

From aligned_strings.h:

typedef struct {
  // only 24 bits available with the "fast" code
  uint32_t total_length __attribute__ ((aligned (4)));
  uint16_t nb_elements,
           flags;  // unused yet.
  // Up to here: 8 bytes.
  aStr_t elem[];
} aString_list;

So two flag bits can be reserved in flags, but this increases the burden and overhead for type-agnostic code, because this would only accept Type 3/list.

The list type is more flexible (and opens the possibility of parallel processing) but slower, which would burden Unicode-agnostic code. Forcing Unicode to type 3 only does not make sense since sub-strings need to be other types, including 8- and 16-bit sizes. After all it's possible to merge a type 3 into type 1 or type 2.

 
________
 

Another possibility is to steal a couple of bits from the size field: This allows types 1/2/3 to contain the Unicode flags in the LSB, keeping the code common for them.

Consequences:

But overall it's a step backwards since the type flags use space that could have been used to format type in the first place.

 
________
 

Where else can we hide the string's attributes ?

The NULL terminator would be another good place, since it is specific to character strings and will not pollute the code for other formats of variable byte arrays. But

Or else, the first byte could be the attribute byte. This will make all pointer indexing wonky with a +1 everywhere and become a source of confusion and bugs.

Overall, this all must be compared to the actual cost of re-checking/re-scanning the whole string, every time the code needs to know in which domain it is...

 
________
 

Overall, it seems that "repurposing" the optional NULL terminator is the most convenient method.

The good news is that the NULL terminator is already handled (optionally) by the existing code, it would only need a little update.

During concatenation, the attributes should be easily "upgraded" to the proper values. This can be done by ORing them. Normally it is better to "greedily" upgrade rather than downgrade whenever possible. The following encoding does not work:

00 unknown status (default)
01 tested as ASCII only
10 tested as UTF-8
11 tested as malformed UTF-8

 because the malformed attribute will be created by both ASCII & UTF8. So 3 bits are required:

bit
 0 Tested
 1 UTF8
 2 Malformed

This way, a malformed string will "avalanche" to the whole result during concatenation.

While we're at it, we can even use more attribute bits to get "free data". Previously I talked about Bloom filters and it's in this spirit. Most attributes can be created with a simple lookup table, except the "malformed" one. UTF8 refers to specific entries in table, while traditional ASCII is another zone. Within this zone, more regions can be identified such as lower case and upper case characters, digits or punctuation.

Discussions