-
The enhanced "aligned strings" format is named aStrA
09/13/2024 at 21:29 • 0 commentsaStrA means "aligned strings with attributes" but it's more than that. There are actually 3 important enhancements:
- The attributes, as previously mentioned, help processing UTF-8 and separate it from the ASCII-only domain.
- The canary increases reliability and spot eventual addressing errors (involuntary or not).
- The dedicated stack helps manage variable-size strings without messing with the existing stack or the compiler.
All three are optional and might not be required together, you can still use the basic aligned strings format alone but soon, at least one of these three "options" becomes necessary. So aStrA is a superset that will help develop a replacement for stdio.h and string(s).h. Of course it is still compatible (to a certain degree) with the traditional functions but the aim is to provide the convenience of high-level languages while erasing the "original sins" of C. This will make the design of complex applications much easier !
-
The canary is singing.
09/09/2024 at 01:27 • 0 commentsThis time, I create a full program to both serve as regression/test as well as examples for potential users who wonder "how do I do this ?" (not just a pair of small examples). I want to test all the corner cases. Already, several preprocessor issues have been raised and I fortunately found solutions, which makes the code even better.
The archive contains a way to write to stdout without printf:
write(0, (void*)s3, aStrA_length_fast(s3));
It will be integrated in more code for concision but you see the central idea of the format: it might look a bit more cumbersome than traditional C but it leaves no space for interpretation.
Now all the types 1 and 2 have a canary that uses the place of the NULL terminator, and the value is defined at compile time. It was a pain to deal with it because it is required to be in octal format and then transformed as a valid string and a C literal. Fun times but it works with almost no drawback (just be aware that string constants will still have NULL terminator because I can't find a way to prevent the compiler from doing it).
Now the canary sings, it's only a matter of discipline to check it regularly, before bugs snowball. It will sound easier to simply forget it, for the sake of speed and compactness, and trust all the data but canary checks would be performed by the library.
-
It's not a liability, it's a feature!
09/07/2024 at 23:26 • 0 comments(as noted at the end of the last log...)
Since the attributes can be moved to the Flex types, the terminal byte does not need to be kept. But I keep it for a different reason: now it's a canary.
The fixed/constant ones can be trusted (usually) and the compiler usually adds a NULL terminator byte that can't be easily changed. So the default canary is 0.
However a good canary can be changed arbitrarily: updated arbitrarily by a counter, PID or by a PRNG. Flex types must be postfixed by something else than 0. There is a global variable that contains the current canary that is set and tested at each update of a Flex type:
uint8_t aStrA_canary_value = 42;
It's "good for security" and of course, for now, for safety as it could help spot bugs faster and prevent the propagation of altered data. Can we say now that aStrA is more secure than the rest ? :-)
Transmitting strings from one software module to another creates the problem that the canary's value is different between the modules (unless agreed on otherwise). The "portable" way is to clear the canary before transmission, then updating it while/after the validation upon reception, when the attributes are (re)computed, because, you know, you never know.
I think it's a better use of the trailing byte than what I intended a few days before. The attributes are easier to address and the canary is where it is meant to be, less used than the attributes.
Consequence: the library must provide at least the following functions
int aStrA_set_canary(p, c); int aStrA_check_canary(p); int aStrA_clear_canary(p);
such that a string can be moved to/from POSIX domains.
...
aligned_strings.20240908.h shows how far it's going. I managed to add the canary to strings constants though (damned CPP macros and their weird limitations!) at the price of a compile-time value and an extra character (the NULL just takes useless room).
But you can compile with a different canary with
-DCANARY_BYTE_OCTAL=123
That's a decent compromise.
....
Finaly I reduced the "helper" functions to only two, more general, ones:int aStrA_set_canary(aStr_t p, uint8_t c) { ASTRA_CHECK_STRING(p, (MASK_LSB_UP|MASK_LSB_LIST) ) // update the attribute *(uint8_t *)(p+len_p)=c; return 0; } int aStrA_get_canary(aStr_t p) { ASTRA_CHECK_STRING(p, (MASK_LSB_UP|MASK_LSB_LIST) ) // read the attribute return *(uint8_t *)(p+len_p); };
That should work.
They share the same code defined in the macro ASTRA_CHECK_STRING which can be reused in other places and could be optimised at one location for all users. Yay.
-
Attributes moved to Flex
09/07/2024 at 01:42 • 0 commentsI'm reviewing and updating and rewriting the core files and now, something appears. The UTF-8 attributes can be restricted to the Flex Types 1 and 2 because their "allocated" field is also limited to (approx.) 8 and 16 bits. This leaves 15 MSB for other uses and the attributes are perfect. These do not need to be in the trailing/terminator byte, which is quite cumbersome to compute.
The trick is that "fixed" strings are mostly constants and their contents are already known (more or less). We trust the compilers enough to create the appropriate utf-8 sequence, right ? The purpose of the attributes is to help filter external data, that are not yet known and are likely to have errors. Usually the size is not known in advance so they are received in a Flex buffer, hence why it's not required to put the attributes in a fixed format.
During reception, the incoming data is processed with 2 methods:
- Either the buffer is pre-allocated and the data are correctly aligned (ie read() with the address of the aligned buffer and the maximum buffer size)
- or : the data are available somewhere and need to be aligned to fit the format.
In the first case, the attributes are reconstructed by a simple scan, the second case can merge de scan and the re-aligning string copy.
...
This changes quite a few things, as I have already started removing the optionality of the NULL terminator. Again, it becomes useless. And it's good to speed up the computation of the address of the attribute, because the code was getting awkward, almost mooting the point of the Aligned Strings in the first place.
-
Aligned Strings with Attributes
09/06/2024 at 00:37 • 0 commentsShould then we call these ASA ? it's a simple layer above the AS system.
The previous log 12. Extension and type throws a few ideas at the metaphoric wall and some have stuck.
Repurposing the trailing NULL seems to be the path of least algorithmic resistance and of maximal flexibility. Now it's a matter of actually defining these bits and managing them.
When a "POSIX" string is imported from outside a program's domain (user input, a file, a network socket...), it's just a Variable-Length Byte Array of type 1 (255 bytes max) or 2 (65535). There is 1 or 2 bytes for size prefix and a NULL terminator byte (suffix is not counted in the size but always present. The contents could be... anything, including another NULL byte. So it's considered unsafe.
The importation process must determine if the string contains valid ASCII characters and well-formed UTF-8 sequences. A lookup-table-driven FSM looks like a simple and practical method to filter invalid sequences or bytes, and collect the attributes.
We already have defined some attribute bits. The "tested" bit would be allocate to the MSB to ease testing as a negative number. A positive number would not have been tested so it's invalid, including the default NULL value.
bit 7 : tested bit 6 : UTF-8 bit 5 : malformed bit 4 : bit 3 : bit 2 : bit 1 : bit 0 :
The "malformed" attribute is set when encountering a NULL byte, invalid UTF-8 bytes (F5–FF) or an invalid UTF-8 sequence.
So the UTF-8 attribute covers the prefix as well as the continuation ranges, but these individual ranges must also be provided to check for a valid sequence. Hence the lookup table is 16-bit wide : one byte is directly ORed with the attribute byte while the other helps update the FSM.
This still leaves 5 attribute bits to use in the terminal byte. Here are the candidates:
- control and separators (not NULL, 1..32, 127, C0 and C1)
- digits (0 through 9)
- symbols/punctuations
- lowercase
- uppercase
I'm not sure whether separating uppercase and lowercase letters is pertinent, but it's possible. This could help some parsers. In this vein, maybe controls can be separated from separators (space, tabs). We can change the table at will later, or even provide a custom one.
The import function only checks the validity of sequences but there could also be an option to eventually translate to UTF-16 or UTF-32 in a separate buffer, to save time.
For now, the code only contains
int utf8_encode_point(char *dst, uint32_t point)
but here we are doing the reverse: taking a bounded number of bytes and building the code point (and also check this is valid). Maybe the function name above should be changed to be more explicit.
...
The current definitions for the terminator is
#define MASK_ASTRA_TESTED (128) #define MASK_ASTRA_UTF8 ( 64) #define MASK_ASTRA_ERROR ( 32) #define MASK_ASTRA_CONTROL ( 16) #define MASK_ASTRA_SEPARATOR ( 8) #define MASK_ASTRA_PUNCTUATION ( 4) #define MASK_ASTRA_DIGIT ( 2) #define MASK_ASTRA_CHARACTER ( 1)
but more bits are needed for the lookup table, in particular for parsing the utf-8 strings.
const uint8_t character_type_LUT[256] = { [ 0 ] = MASK_ASTRA_ERROR, [ 1 ... 31 ] = MASK_ASTRA_CONTROL, [ 9 ] = MASK_ASTRA_SEPARATOR, [ 32 ] = MASK_ASTRA_SEPARATOR, [ 33 ... 47 ] = MASK_ASTRA_PUNCTUATION, [ 48 ... 57 ] = MASK_ASTRA_DIGIT, [ 58 ... 64 ] = MASK_ASTRA_PUNCTUATION, [ 65 ... 90 ] = MASK_ASTRA_CHARACTER, [ 91 ... 96 ] = MASK_ASTRA_PUNCTUATION, [ 97 ... 122 ] = MASK_ASTRA_CHARACTER, [ 123 ... 126 ] = MASK_ASTRA_PUNCTUATION, [ 127 ] = MASK_ASTRA_CONTROL, [ 0x80 ... 0xBF ] = MASK_ASTRA_UTF8_CONTINUATION, [ 0xC0 ... 0xC1 ] = MASK_ASTRA_CONTROL, [ 0xC2 ... OxDF ] = MASK_ASTRA_UTF8_PREFIX1, [ 0xE0 ... OxEF ] = MASK_ASTRA_UTF8_PREFIX2, [ 0xF0 ... OxF4 ] = MASK_ASTRA_UTF8_PREFIX3, [ 0xF5 ... 0xFF ] = MASK_ASTRA_ERROR }
So this added MASK_ASTRA_UTF8_PREFIX1, MASK_ASTRA_UTF8_PREFIX2, MASK_ASTRA_UTF8_PREFIX3 and MASK_ASTRA_UTF8_CONTINUATION.
...
Looking back at AS20230117.tgz I already found one glaring bug. See you in the next log.
-
Extension and type
08/29/2024 at 17:56 • 0 commentsThe 4/8 types are great.
They are the basis for designing a language and an operating system. These applications need more than what the Aligned Strings format provides, though. The format is one thing but a type is required.
Usually, types are either inferred or implicit, defined by an API/ABI. But for my projects I have identified 3 specific uses of the aligned strings :
- untyped, variable length array of bytes (so that counts as "no type", which is still a type, anyway)
- type declaration / prototype (yes, it's going to be a thing)
- ASCII (7-bit/safe) string (the basic type)
- UTF8 string (can only be typed as such if certain conditions are checked)
I don't know how to specify the string type, yet. The format already exhausts the coding space. Furthermore, the typing thing is mostly critical for the character strings because there are compatibility issues. And in practice, the aligned format is usually well segregated, there should be little confusion, except for characters: a character string could be either- unknown
- known 7-bit
- known UTF8
and preserving this requires 2 bits...
There is not enough room in the type 0/1/2 aligned strings but the type 3 has some headroom.
From aligned_strings.h:
typedef struct { // only 24 bits available with the "fast" code uint32_t total_length __attribute__ ((aligned (4))); uint16_t nb_elements, flags; // unused yet. // Up to here: 8 bytes. aStr_t elem[]; } aString_list;
So two flag bits can be reserved in flags, but this increases the burden and overhead for type-agnostic code, because this would only accept Type 3/list.
The list type is more flexible (and opens the possibility of parallel processing) but slower, which would burden Unicode-agnostic code. Forcing Unicode to type 3 only does not make sense since sub-strings need to be other types, including 8- and 16-bit sizes. After all it's possible to merge a type 3 into type 1 or type 2.
________
Another possibility is to steal a couple of bits from the size field: This allows types 1/2/3 to contain the Unicode flags in the LSB, keeping the code common for them.
Consequences:
- String sizes are reduced : 63 or 16383 bytes for types 1 and 2. It's not a disastrous reduction, considering that type 2 is preferred.
- Character strings will need slightly different code than other byte strings. Or else, the other variable byte arrays will use the 2 flag bits for other purposes.
But overall it's a step backwards since the type flags use space that could have been used to format type in the first place.
________
Where else can we hide the string's attributes ?
The NULL terminator would be another good place, since it is specific to character strings and will not pollute the code for other formats of variable byte arrays. But
- This will make it incompatible with standard ASCIIZ : special functions must be added such as toPOSIX and fromPOSIX.
- Fetching the attribute might delay the program since the attribute byte is likely in a different cache line.
Or else, the first byte could be the attribute byte. This will make all pointer indexing wonky with a +1 everywhere and become a source of confusion and bugs.
Overall, this all must be compared to the actual cost of re-checking/re-scanning the whole string, every time the code needs to know in which domain it is...
________
Overall, it seems that "repurposing" the optional NULL terminator is the most convenient method.
- FromPOSIX will scan the source string, allocate the correct amount of memory, set the correct attributes
- ToPOSIX will simply discard the attributes by clearing the NULL,
- Too bad for the delay in fetch. Let's assume it's still in L1 or L2 at worst.
The good news is that the NULL terminator is already handled (optionally) by the existing code, it would only need a little update.
During concatenation, the attributes should be easily "upgraded" to the proper values. This can be done by ORing them. Normally it is better to "greedily" upgrade rather than downgrade whenever possible. The following encoding does not work:
00 unknown status (default) 01 tested as ASCII only 10 tested as UTF-8 11 tested as malformed UTF-8
because the malformed attribute will be created by both ASCII & UTF8. So 3 bits are required:
bit 0 Tested 1 UTF8 2 Malformed
This way, a malformed string will "avalanche" to the whole result during concatenation.
While we're at it, we can even use more attribute bits to get "free data". Previously I talked about Bloom filters and it's in this spirit. Most attributes can be created with a simple lookup table, except the "malformed" one. UTF8 refers to specific entries in table, while traditional ASCII is another zone. Within this zone, more regions can be identified such as lower case and upper case characters, digits or punctuation.
-
Fuzzing and safety
03/22/2024 at 01:14 • 0 commentsThe aligned format is great, while it remains used inside a safe and controlled context.
It can get ugly though when an "aligned string" pointer is transmitted through an untrusted piece of code. This unsafe code could be prevented from dereferencing the string's value but this is not enough. If the pointer itself is modified, all kinds of problems arise.
Receiving a pointer to an aligned string from a dubious source can be less dangerous if the type is restricted. The type 2 (16-bit length) is the safest and it's easy to filter. The Type 3 creates indirect (de)references and the flexible types should be cast back to constant strings (it might not be possible to modify or reallocate the target anyway).
Use Type 2.
-
More than an error
03/11/2024 at 03:04 • 1 commentType 0 is NULL/Error and detected from the three LSB.
But the remaining MSB are not checked...
So if the MSB are not cleared, this could lead to yet another type, 8-byte aligned. What could it be ?
-
Another possible extension
09/25/2023 at 01:12 • 0 commentsThe recent article Faster String Processing With Bloom Filters got my attention. It refers to https://codeconfessions.substack.com/p/cpython-bloom-filter-usage. And I thought: What if...
It is possible to create a larger structure that contains an even larger field, to hold either a Bloom sieve or a hash/signature, for 2 use cases : sub-string search or string matching. For example, if a string is determined to be tested for equality at some later point, a programming language / Compiler would "tag" the string so the hash is created once the value is assigned. Then the actual test becomes a simple scalar value comparison, followed by a real comparison in case of a hit, which might save some CPU cycles when misses are likely.
Similarly, before performing a sub-string search, the strings (possibly many, again) can be pre-processed so only the 64-bit Bloom filter is used at first, to sieve candidates faster than the naïve method.
However, with already 3 LSB used, such an extension is impractical. The memory losses from alignment might become ridiculous. The hash or sieve values would be handled by the programmer, case-by-case at first, in a separate structure.
And speaking of hashes, I might have another trick up my sleeve, look up #PEAC Pisano with End-Around Carry algorithm !
-
Article !
07/01/2023 at 17:25 • 0 commentsI have published a thorough article on this subject in GNU/Linux Magazine France: "Sus à l’ASCIIZ, vive les chaînes alignées !" in March 2023, with a summary of all the design decisions and more.