This project documents the design and implementation of a string format that is
- flexible
- safe
- efficient in code space
- efficient in data space
- efficient in code execution
This is in part due to
- a compact representation with only one cache line to hit,
- its flexible implementation that instantiates the required features only,
- a definition that is adapted to static and dynamic processing,
- compatibility with POSIX's ASCIIZ format (trailing NULL bytes are preserved),
- trailing NULL byte is replaced internally by a canary,
- code and data that can be accessed at any level of abstraction.
ASCII is the basic character encoding but Unicode is supported:
- as UTF-8 byte sequences in types 3/Flex3 lists,
- as attributes in Flex1 and Flex2 byte arrays,
- and as integer code points in type UP.
The attributes make it easier to distinguish the appropriate "domains", helping with ASCII-only processing without excluding UNICODE. This makes it suitable from basic, barebones utilities as well as more elaborate applications.
To define constant strings, the basic structure is defined as:
typedef struct {
uint8_t len __attribute__ ((aligned (4)));
char text[];
} aString_8;
Getting the length of the array is as simple as getting the first word and masking off a number of bytes given by the pointer's LSB:
static inline int aStr_length(aStr_t p) {
uint32_t *q = (uint32_t *)(p & ~3);
int LSB = (p & 3);
return (*q) & ~((~0) << (LSB<<3));
}
A variant (with bit 2 of the pointer set) adds another field to keep the allocated size, helping to perform variable-sized operations:
// Same as above but with an extra 32-bit size field:
typedef struct {
uint32_t allocated __attribute__ ((aligned (8)));
uint8_t len;
char text[];
} Flex_aString_8;
Of course, the pointer that the program manages is always the address of .text[]. The actual type is given by the pointer's LSB: the same principle works for the longer 16-bit version (types 2 and F2) and the lists (types 3 and F3 ).
And you can dynamically declare/allocate Flex strings on the stack, inside a function.
-o-O-0-O-o-
Logs:
1. Dealing with re-alignment and 2 string types only
2. Context
3. 2023 : a new version
4. Evolution...
5. Merge works
6. Holding back a bit
7. More food for thoughts
8. Article !
9. Another possible extension
10. More than an error
11. Fuzzing and safety
12. Extension and type
13. Aligned Strings with Attributes
14. Attributes moved to Flex
15. It's not a liability, it's a feature!
16. The canary is singing.
17. The enhanced "aligned strings" format is named aStrA
.
.
.
Safety-wise, from the coding perspective, there should not be a type confusion between the fixed and the Flex versions... so it's up to the compiler to strictly enforce the type hierarchy.
Having the fixed specified as "const" also helps.
Flex is a superset of fixed, some operations (resizing) can only be performed on Flex, so it's possible to cast Flex to fix (in certain cases) but the reverse is forbidden : fixed should not be cast to Flex.