aStrA : Aligned Strings format with attributes

Description

This format is compatible with ASCIIZ formatted strings and has a size prefix (like "Pascal strings") which ranges from 0 to 255 or 64KiB (encoded in 1 or 2 bytes).
The type is not encoded in the prefix itself but in the 2 LSB of the pointer to the string. And this encoding is still congruent with the real start of the byte string !
Additionally, type 3 encodes a "list" structure that contains pointers to sub-strings, to ease with concatenation and corner cases.
The last format is type 0 which encodes a single Unicode point.

An extended version adds "Flexible" strings (through the 3rd LSB of the pointer) that can be dynamically resized within the range that was previously allocated. This also supports a canary byte and attributes.

This solves quite a few problems inherent with the existing Pascal and ASCIIZ formats !

Details

This project documents the design and implementation of a string format that is

flexible
safe
efficient in code space
efficient in data space
efficient in code execution

This is in part due to

a compact representation with only one cache line to hit,
its flexible implementation that instantiates the required features only,
a definition that is adapted to static and dynamic processing,
compatibility with POSIX's ASCIIZ format (trailing NULL bytes are preserved),
trailing NULL byte is replaced internally by a canary,
code and data that can be accessed at any level of abstraction.

ASCII is the basic character encoding but Unicode is supported:

as UTF-8 byte sequences in types 3/Flex3 lists,
as attributes in Flex1 and Flex2 byte arrays,
and as integer code points in type UP.

The attributes make it easier to distinguish the appropriate "domains", helping with ASCII-only processing without excluding UNICODE. This makes it suitable from basic, barebones utilities as well as more elaborate applications.

To define constant strings, the basic structure is defined as:

typedef struct {
  uint8_t len  __attribute__ ((aligned (4)));
  char text[];
} aString_8;

Getting the length of the array is as simple as getting the first word and masking off a number of bytes given by the pointer's LSB:

static inline int aStr_length(aStr_t p) {
  uint32_t *q = (uint32_t *)(p & ~3);
  int LSB =                 (p &  3);
  return (*q) & ~((~0) << (LSB<<3));
}

A variant (with bit 2 of the pointer set) adds another field to keep the allocated size, helping to perform variable-sized operations:

// Same as above but with an extra 32-bit size field:
typedef struct {
  uint32_t allocated __attribute__ ((aligned (8)));
  uint8_t len;
  char text[];
} Flex_aString_8;

Of course, the pointer that the program manages is always the address of .text[]. The actual type is given by the pointer's LSB: the same principle works for the longer 16-bit version (types 2 and F2) and the lists (types 3 and F3 ).

And you can dynamically declare/allocate Flex strings on the stack, inside a function.

-o-O-0-O-o-

Logs:
1. Dealing with re-alignment and 2 string types only
2. Context
3. 2023 : a new version
4. Evolution...
5. Merge works
6. Holding back a bit
7. More food for thoughts
8. Article !
9. Another possible extension
10. More than an error
11. Fuzzing and safety
12. Extension and type
13. Aligned Strings with Attributes
14. Attributes moved to Flex
15. It's not a liability, it's a feature!
16. The canary is singing.
17. The enhanced "aligned strings" format is named aStrA
18. Lengths
19. Merging
20. Attributes
21. Memory allocation
.
.

Files

AS20241229.tgz

resizeable stack top object (no test yet)

x-compressed-tar - 11.91 kB - 12/29/2024 at 11:11

Download

AS20241228.tgz

Adding a first version of the auxiliary stack allocator.

x-compressed-tar - 11.46 kB - 12/28/2024 at 08:46

Download

AS20241128.tgz

better version :-D

x-compressed-tar - 10.01 kB - 11/28/2024 at 04:11

Download

AS20241126.tgz

utf8 filter works

x-compressed-tar - 9.72 kB - 11/26/2024 at 02:46

Download

AS20241121.tgz

merge and ARGSTR work !

x-compressed-tar - 8.33 kB - 11/23/2024 at 07:28

Download

Project Logs

Collapse

Memory allocation
Yann Guidon / YGDES • 12/28/2024 at 08:51 • 0 comments

The latest archive AS20241228.tgz adds a first version of a stack-based arena allocator, using the same mark-new-release ideas as in Turbo Pascal.

This is a sort of reboot of the ideas of #A stack.. It's still a dumb mark-new-release allocator but it is multithread-compatible, you can locate the buffer anywhere you like in data memory.

Attributes

Yann Guidon / YGDES • 12/16/2024 at 05:08 • 0 comments

So far, the attributes describe the types of characters that are found in a string:

// The attributes/flags
#define MASK_ASTRA_TESTED      (  1)
#define MASK_ASTRA_UTF8        (  2)
#define MASK_ASTRA_UTF32       (  4)
#define MASK_ASTRA_ERROR       (  8)
#define MASK_ASTRA_CONTROL     ( 16)
#define MASK_ASTRA_CHAR_UPPER  ( 32)
#define MASK_ASTRA_CHAR_LOWER  ( 64)
#define MASK_ASTRA_DIGIT       (128)
#define MASK_ASTRA_PUNCTUATION (256)
#define MASK_ASTRA_SEPARATOR   (512)
#define MASK_ASTRA_CHARACTER  \
    ( MASK_ASTRA_CHAR_UPPER | MASK_ASTRA_CHAR_LOWER )

Some hidden masks are defined to drive the UTF-8 decoder. But it occurred to me lately that the remaining bits could be used for other purposes, in particular to help with the list types :

// Flags used by 3/F3 type :
#define MASK_ASTRA_TYPE0   (4096)   // List contains Unicode points
#define MASK_ASTRA_TYPE1   (8192)   // List contains 255-byte string pointer
#define MASK_ASTRA_TYPE2   (16384)  // List contains 65535-byte string pointer
#define MASK_ASTRA_TYPE3   (32768)  // List contains reference to another list

The 4 bits must be cleared in types F1 and F2. Think of them as canaries.

For types 3/F3, this helps with processing certain types of lists, for example if only
MASK_ASTRA_TYPE0 is present, then it's an equivalent to UTF-32/UCS4.

Some implementations don't want MASK_ASTRA_TYPE3 in a list to prevent recursive definitions.

This leaves only 2 bits for extensions...

...

20241228 : only 1 bit left for extension, as one is now reserved to signal that the string does not contain characters but ... types.

Merging
Yann Guidon / YGDES • 11/23/2024 at 07:50 • 0 comments
The long awaited update ! AS20241121.tgz is uploaded and the little demo works !

It can do the following:
```
      ASTRING8(s1,"hello ")
      ASTRING8(s2,"world\n")
FLEX_ASTRING16(s4,"0123456789", 42)
EMPTY_ASTRING8(buffer, 100)

aStrA_merge(buffer, s1, s2, s4, UPOINT('\n'),
      UPOINT(0x1F60A), ARGSTR("\n# the end.\n"));
write(0, (void*)buffer, aStrA_length_fast(buffer));
```
The lists are not yet handled, that will be a next milestone, but the core features are coming to life.

The ARGSTR argument type is a required hack since C and the proprocessor don't allow me to create the required anonymous structures. The temporary solution is to define another composite format: the size of the string is given as a pure number (<<3) and distinguished from NULL. This precedes the actual pointer which could be misaligned, end with \0 etc.

This is "dangerous" because there is a little "hack" to make it work : sizeof counts the ending \0 so sizeof("")=1. This ensures that the size parameter (sizeof << 3) is always different from NULL and the value is decremented just before use. I hope this hack will be removed later...
```
// declare a constant string in an argument
#define ARGSTR(VALUE)  sizeof(VALUE) << 3, VALUE
// No canary, no structure: it's passed directly on the call stack.
```
So the macro replaces the string with a pair (size then value) separated by a comma.

Another format-adapting macro adds the terminal NULL to the varargs of merge:
```
// Let's first make sure the list of arguments
// is ALWAYS correctly terminated:
#define aStrA_merge(DESTINATION, ARGUMENTS... ) \
        aStrA_merge_body(DESTINATION, ARGUMENTS , NULL)

// Arguments : only aStrA_t types !
int aStrA_merge_body(aStrA_t dest, ...) {
```
So it's a big hack so far, trying to get around C's limitations...

But it works so far and the implementation will evolve, while the user-facing features get polished.
Lengths
Yann Guidon / YGDES • 11/22/2024 at 05:40 • 0 comments
What is the minimum length of an aligned string ? Zero.

Since there is the "padding" (-4) code, that represents a zero-size string, it would have been tempting to bump the Type1 and Type2 strings by 1, extending the range to 256 or 65536 bytes. You can imagine the mess it would cause throughout the whole code... just for a tiny convenience.

But what is the maximum size of an aligned string ? The largest size field is in the Type3 "list" with a whole 32-bit number but there are two issues:
1. the "fast" code only uses 24 bits due to the shift-based size computation. So a string can contain up to 16 million bytes.
2. the "slow" code also needs to return an error code, which is a negative value, so the aStrA_length_slow function is an int, a signed value. This means that strings longer than 2 billion bytes will be interpreted as errors.
But seriously, who has ever seen or managed strings longer than 64K ?

If you encounter a string longer than 16M, then
- Either something is very wrong and you have hit a bug
- Somebody is trying to inject data
- You're trying to display or edit a very large file and you can't bother to process it by chunks.
All in all, it seems that the 16M imposed by the "fast" algorithm is reasonable. In fact, in the current code, the aStrA_length_slow function checks the bits 24-31 of the size as a canary that should be 0. Which is fine since the list format can't really have a similar canary as Types1 and 2.
The enhanced "aligned strings" format is named aStrA
Yann Guidon / YGDES • 09/13/2024 at 21:29 • 0 comments
aStrA means "aligned strings with attributes" but it's more than that. There are actually 3 important enhancements:
1. The attributes, as previously mentioned, help processing UTF-8 and separate it from the ASCII-only domain.
2. The canary increases reliability and spot eventual addressing errors (involuntary or not).
3. The dedicated stack helps manage variable-size strings without messing with the existing stack or the compiler.
All three are optional and might not be required together, you can still use the basic aligned strings format alone but soon, at least one of these three "options" becomes necessary. So aStrA is a superset that will help develop a replacement for stdio.h and string(s).h. Of course it is still compatible (to a certain degree) with the traditional functions but the aim is to provide the convenience of high-level languages while erasing the "original sins" of C. This will make the design of complex applications much easier !
To mark the change,
```
typedef uintptr_t aStr_t;
```
becomes
```
typedef uintptr_t aStrA_t;
```
The canary is singing.
Yann Guidon / YGDES • 09/09/2024 at 01:27 • 0 comments
This time, I create a full program to both serve as regression/test as well as examples for potential users who wonder "how do I do this ?" (not just a pair of small examples). I want to test all the corner cases. Already, several preprocessor issues have been raised and I fortunately found solutions, which makes the code even better.

The archive contains a way to write to stdout without printf:
```
write(0, (void*)s3, aStrA_length_fast(s3));
```
It will be integrated in more code for concision but you see the central idea of the format: it might look a bit more cumbersome than traditional C but it leaves no space for interpretation.
Now all the types 1 and 2 have a canary that uses the place of the NULL terminator, and the value is defined at compile time. It was a pain to deal with it because it is required to be in octal format and then transformed as a valid string and a C literal. Fun times but it works with almost no drawback (just be aware that string constants will still have NULL terminator because I can't find a way to prevent the compiler from doing it).
Now the canary sings, it's only a matter of discipline to check it regularly, before bugs snowball. It will sound easier to simply forget it, for the sake of speed and compactness, and trust all the data but canary checks would be performed by the library.
It's not a liability, it's a feature!
Yann Guidon / YGDES • 09/07/2024 at 23:26 • 0 comments
(as noted at the end of the last log...)

Since the attributes can be moved to the Flex types, the terminal byte does not need to be kept. But I keep it for a different reason: now it's a canary.

The fixed/constant ones can be trusted (usually) and the compiler usually adds a NULL terminator byte that can't be easily changed. So the default canary is 0.

However a good canary can be changed arbitrarily: updated arbitrarily by a counter, PID or by a PRNG. Flex types must be postfixed by something else than 0. There is a global variable that contains the current canary that is set and tested at each update of a Flex type:
```
uint8_t aStrA_canary_value = 42;
```
It's "good for security" and of course, for now, for safety as it could help spot bugs faster and prevent the propagation of altered data. Can we say now that aStrA is more secure than the rest ? :-)

Transmitting strings from one software module to another creates the problem that the canary's value is different between the modules (unless agreed on otherwise). The "portable" way is to clear the canary before transmission, then updating it while/after the validation upon reception, when the attributes are (re)computed, because, you know, you never know.

I think it's a better use of the trailing byte than what I intended a few days before. The attributes are easier to address and the canary is where it is meant to be, less used than the attributes.

Consequence: the library must provide at least the following functions
```
int aStrA_set_canary(p, c);
int aStrA_check_canary(p);
int aStrA_clear_canary(p);
```
such that a string can be moved to/from POSIX domains.

...

aligned_strings.20240908.h shows how far it's going. I managed to add the canary to strings constants though (damned CPP macros and their weird limitations!) at the price of a compile-time value and an extra character (the NULL just takes useless room).

But you can compile with a different canary with
```
-DCANARY_BYTE_OCTAL=123
```
That's a decent compromise.

....
Finaly I reduced the "helper" functions to only two, more general, ones:
```
int aStrA_set_canary(aStr_t p, uint8_t c) {
  ASTRA_CHECK_STRING(p, (MASK_LSB_UP|MASK_LSB_LIST) )
  // update the attribute
  *(uint8_t *)(p+len_p)=c;
  return 0;
}

int aStrA_get_canary(aStr_t p) {
  ASTRA_CHECK_STRING(p, (MASK_LSB_UP|MASK_LSB_LIST) )
  // read the attribute
  return *(uint8_t *)(p+len_p);
};
```
That should work.

They share the same code defined in the macro ASTRA_CHECK_STRING which can be reused in other places and could be optimised at one location for all users. Yay.
Attributes moved to Flex
Yann Guidon / YGDES • 09/07/2024 at 01:42 • 0 comments
I'm reviewing and updating and rewriting the core files and now, something appears. The UTF-8 attributes can be restricted to the Flex Types 1 and 2 because their "allocated" field is also limited to (approx.) 8 and 16 bits. This leaves 15 MSB for other uses and the attributes are perfect. These do not need to be in the trailing/terminator byte, which is quite cumbersome to compute.

The trick is that "fixed" strings are mostly constants and their contents are already known (more or less). We trust the compilers enough to create the appropriate utf-8 sequence, right ? The purpose of the attributes is to help filter external data, that are not yet known and are likely to have errors. Usually the size is not known in advance so they are received in a Flex buffer, hence why it's not required to put the attributes in a fixed format.

During reception, the incoming data is processed with 2 methods:
- Either the buffer is pre-allocated and the data are correctly aligned (ie read() with the address of the aligned buffer and the maximum buffer size)
- or : the data are available somewhere and need to be aligned to fit the format.
In the first case, the attributes are reconstructed by a simple scan, the second case can merge de scan and the re-aligning string copy.

...
This changes quite a few things, as I have already started removing the optionality of the NULL terminator. Again, it becomes useless. And it's good to speed up the computation of the address of the attribute, because the code was getting awkward, almost mooting the point of the Aligned Strings in the first place.
Aligned Strings with Attributes
Yann Guidon / YGDES • 09/06/2024 at 00:37 • 0 comments
Should then we call these ASA ? it's a simple layer above the AS system.

The previous log 12. Extension and type throws a few ideas at the metaphoric wall and some have stuck.

Repurposing the trailing NULL seems to be the path of least algorithmic resistance and of maximal flexibility. Now it's a matter of actually defining these bits and managing them.

When a "POSIX" string is imported from outside a program's domain (user input, a file, a network socket...), it's just a Variable-Length Byte Array of type 1 (255 bytes max) or 2 (65535). There is 1 or 2 bytes for size prefix and a NULL terminator byte (suffix is not counted in the size but always present. The contents could be... anything, including another NULL byte. So it's considered unsafe.

The importation process must determine if the string contains valid ASCII characters and well-formed UTF-8 sequences. A lookup-table-driven FSM looks like a simple and practical method to filter invalid sequences or bytes, and collect the attributes.

We already have defined some attribute bits. The "tested" bit would be allocate to the MSB to ease testing as a negative number. A positive number would not have been tested so it's invalid, including the default NULL value.
```
bit 7 : tested
bit 6 : UTF-8
bit 5 : malformed
bit 4 : 
bit 3 : 
bit 2 : 
bit 1 : 
bit 0 : 
```
The "malformed" attribute is set when encountering a NULL byte, invalid UTF-8 bytes (F5–FF) or an invalid UTF-8 sequence.

So the UTF-8 attribute covers the prefix as well as the continuation ranges, but these individual ranges must also be provided to check for a valid sequence. Hence the lookup table is 16-bit wide : one byte is directly ORed with the attribute byte while the other helps update the FSM.

This still leaves 5 attribute bits to use in the terminal byte. Here are the candidates:
- control and separators (not NULL, 1..32, 127, C0 and C1)
- digits (0 through 9)
- symbols/punctuations
- lowercase
- uppercase
I'm not sure whether separating uppercase and lowercase letters is pertinent, but it's possible. This could help some parsers. In this vein, maybe controls can be separated from separators (space, tabs). We can change the table at will later, or even provide a custom one.

The import function only checks the validity of sequences but there could also be an option to eventually translate to UTF-16 or UTF-32 in a separate buffer, to save time.

For now, the code only contains
```
int utf8_encode_point(char *dst, uint32_t point)
```
but here we are doing the reverse: taking a bounded number of bytes and building the code point (and also check this is valid). Maybe the function name above should be changed to be more explicit.

...

The current definitions for the terminator is
```
#define MASK_ASTRA_TESTED      (128)
#define MASK_ASTRA_UTF8        ( 64)
#define MASK_ASTRA_ERROR       ( 32)
#define MASK_ASTRA_CONTROL     ( 16)
#define MASK_ASTRA_SEPARATOR   (  8)
#define MASK_ASTRA_PUNCTUATION (  4)
#define MASK_ASTRA_DIGIT       (  2)
#define MASK_ASTRA_CHARACTER   (  1)
```
but more bits are needed for the lookup table, in particular for parsing the utf-8 strings.
```
const uint8_t character_type_LUT[256] = {
  [    0          ] = MASK_ASTRA_ERROR,
  [    1 ...   31 ] = MASK_ASTRA_CONTROL,
  [    9          ] = MASK_ASTRA_SEPARATOR,
  [   32          ] = MASK_ASTRA_SEPARATOR,
  [   33 ...   47 ] = MASK_ASTRA_PUNCTUATION,
  [   48 ...   57 ] = MASK_ASTRA_DIGIT,
  [   58 ...   64 ] = MASK_ASTRA_PUNCTUATION,
  [   65 ...   90 ] = MASK_ASTRA_CHARACTER,
  [   91 ...   96 ] = MASK_ASTRA_PUNCTUATION,
  [   97 ...  122 ] = MASK_ASTRA_CHARACTER,
  [  123 ...  126 ] = MASK_ASTRA_PUNCTUATION,
  [  127          ] = MASK_ASTRA_CONTROL,
  [ 0x80 ... 0xBF ] = MASK_ASTRA_UTF8_CONTINUATION,
  [ 0xC0 ... 0xC1 ] = MASK_ASTRA_CONTROL,
  [ 0xC2 ... OxDF ] = MASK_ASTRA_UTF8_PREFIX1,
  [ 0xE0 ... OxEF ] = MASK_ASTRA_UTF8_PREFIX2,
  [ 0xF0 ... OxF4 ] = MASK_ASTRA_UTF8_PREFIX3,
  [ 0xF5 ... 0xFF ] = MASK_ASTRA_ERROR
}
```
So this added MASK_ASTRA_UTF8_PREFIX1, MASK_ASTRA_UTF8_PREFIX2, MASK_ASTRA_UTF8_PREFIX3 and MASK_ASTRA_UTF8_CONTINUATION.

...

Looking back...
Read more »
Extension and type
Yann Guidon / YGDES • 08/29/2024 at 17:56 • 0 comments
The 4/8 types are great.

They are the basis for designing a language and an operating system. These applications need more than what the Aligned Strings format provides, though. The format is one thing but a type is required.

Usually, types are either inferred or implicit, defined by an API/ABI. But for my projects I have identified 3 specific uses of the aligned strings :
- untyped, variable length array of bytes (so that counts as "no type", which is still a type, anyway)
- type declaration / prototype (yes, it's going to be a thing)
- ASCII (7-bit/safe) string (the basic type)
- UTF8 string (can only be typed as such if certain conditions are checked)
I don't know how to specify the string type, yet. The format already exhausts the coding space. Furthermore, the typing thing is mostly critical for the character strings because there are compatibility issues. And in practice, the aligned format is usually well segregated, there should be little confusion, except for characters: a character string could be either
- unknown
- known 7-bit
- known UTF8
and preserving this requires 2 bits...

There is not enough room in the type 0/1/2 aligned strings but the type 3 has some headroom.

From aligned_strings.h:
```
typedef struct {
  // only 24 bits available with the "fast" code
  uint32_t total_length __attribute__ ((aligned (4)));
  uint16_t nb_elements,
           flags;  // unused yet.
  // Up to here: 8 bytes.
  aStr_t elem[];
} aString_list;
```
So two flag bits can be reserved in flags, but this increases the burden and overhead for type-agnostic code, because this would only accept Type 3/list.

The list type is more flexible (and opens the possibility of parallel processing) but slower, which would burden Unicode-agnostic code. Forcing Unicode to type 3 only does not make sense since sub-strings need to be other types, including 8- and 16-bit sizes. After all it's possible to merge a type 3 into type 1 or type 2.

________

Another possibility is to steal a couple of bits from the size field: This allows types 1/2/3 to contain the Unicode flags in the LSB, keeping the code common for them.

Consequences:
- String sizes are reduced : 63 or 16383 bytes for types 1 and 2. It's not a disastrous reduction, considering that type 2 is preferred.
- Character strings will need slightly different code than other byte strings. Or else, the other variable byte arrays will use the 2 flag bits for other purposes.
But overall it's a step backwards since the type flags use space that could have been used to format type in the first place.

________

Where else can we hide the string's attributes ?

The NULL terminator would be another good place, since it is specific to character strings and will not pollute the code for other formats of variable byte arrays. But
- This will make it incompatible with standard ASCIIZ : special functions must be added such as toPOSIX and fromPOSIX.
- Fetching the attribute might delay the program since the attribute byte is likely in a different cache line.
Or else, the first byte could be the attribute byte. This will make all pointer indexing wonky with a +1 everywhere and become a source of confusion and bugs.

Overall, this all must be compared to the actual cost of re-checking/re-scanning the whole string, every time the code needs to know in which domain it is...

________

Overall, it seems that "repurposing" the optional NULL terminator is the most convenient method.
- FromPOSIX will scan the source string, allocate the correct amount of memory, set the correct attributes
- ToPOSIX will simply discard the attributes by clearing the NULL,
- Too bad for the delay in fetch. Let's assume it's still in L1 or L2 at worst.
The good news is that the NULL terminator is already handled (optionally) by the existing code, it would only need a little update.

During concatenation, the attributes should be easily "upgraded" to the proper values. This can be done by ORing them. Normally it is...
Read more »

View all 21 project logs

Discussions

Yann Guidon / YGDES wrote 11/16/2024 at 02:24

Strings in Rust are so simple ! :-P

https://youtu.be/CpvzeyzgQdw?t=1304

Are you sure? yes | no

Yann Guidon / YGDES wrote 11/12/2024 at 02:40

Safety-wise, from the coding perspective, there should not be a type confusion between the fixed and the Flex versions... so it's up to the compiler to strictly enforce the type hierarchy.

Having the fixed specified as "const" also helps.

Flex is a superset of fixed, some operations (resizing) can only be performed on Flex, so it's possible to cast Flex to fix (in certain cases) but the reverse is forbidden : fixed should not be cast to Flex.

Are you sure? yes | no

Yann Guidon / YGDES wrote 02/10/2023 at 13:53

https://hackaday.com/2023/02/10/modernizing-c-arrays-for-greater-memory-safety/

https://developers.redhat.com/articles/2022/09/29/benefits-limitations-flexible-array-members#nonconforming_compiler_extensions

https://people.kernel.org/kees/bounded-flexible-arrays-in-c

Are you sure? yes | no

Yann Guidon / YGDES wrote 02/02/2023 at 21:26

https://www.joelonsoftware.com/2001/12/11/back-to-basics/

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/22/2023 at 12:04

https://v8.dev/blog/pointer-compression :-)

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/22/2023 at 12:05

from the link :

The tag bits serve a dual purpose: they signal either strong/weak pointers to objects located in V8 heap, or a small integer. Hence, the value of an integer can be stored directly in the tagged value, without having to allocate additional storage for it.

V8 always allocates objects in the heap at word-aligned addresses, which allows it to use the 2 (or 3, depending on the machine word size) least significant bits for tagging. On 32-bit architectures, V8 uses the least significant bit to distinguish Smis from heap object pointers. For heap pointers, it uses the second least significant bit to distinguish strong references from weak ones:

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/15/2023 at 23:51

15 years ago : http://www.and.org/vstr/comparison lists 28 string libraries. I don't see any with a similar pointer trick.

Are you sure? yes | no

Gravis wrote 01/13/2023 at 15:05

Unless I'm missing something, the benefit of type 1 is so low that differentiating between it an type 2 is almost certainly going to require more code than the single bytes saved in each string.

However, if you keep type 1 then you need to bulletproof implementation that will recoup the cost of the extra code needed in only a couple dozen strings. However, if you have dozens of strings then it's unlikely to be a small program making the savings moot. Finally, if you every merge a type 1 and 2 then the code for that is going to require you to use hundreds of type 1 strings to recoup the savings. It would be logical do do away with type 1 as it's almost certain that type 2 strings will be used for a "how to use this program" aka "--help" string.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/13/2023 at 16:36

The cost for having both type 1 and 2 is a few mask & shift operations.

inline int aStr_length_fast(aStr_t p) {
uint32_t *q = (uint32_t *)(p & ~3);
int LSB = p & 3;
return (*q) & ~((~0) << (LSB<<3));
}

This works for types 1 , 2 and 3 (with type 3 limited to 24b/16M).

Are you sure? yes | no

Gravis wrote 01/13/2023 at 19:06

The problem is that you have to do it for all the functions. It may only require a handful of operations but when it's in multiple functions, you start needing dozens of strings to balance the space savings.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/14/2023 at 01:05

@Gravis OTOH don't you already have to call strln() at these places ?

In modern OoOx CPUs (since the Pentium III) what costs time is random memory accesses and execution flow breaks (calls, jumps, conditional branches in particular).

In the above case, there is no branch, and memory access is to the very same cache line (by construction) so there is only a marginal cost.

Are you sure? yes | no

Gravis wrote 01/14/2023 at 02:27

@Yann Guidon / YGDES

There is more to programming than ye olde C strings. There are C++ strings and NT has the UNICODE_STRING type. The days of relying on strlen are long gone.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/14/2023 at 03:52

@Gravis I know this well, what is your point ?

https://en.wikipedia.org/wiki/String_(computer_science)#Representations

https://en.wikipedia.org/wiki/Comparison_of_programming_languages_(string_functions)

The last link compares about 50 languages.

Furthermore, the class of applications I try to address, including the constraints, is probably not well stated. I swap between many projects and drop ideas here and there...

Are you sure? yes | no

Gravis wrote 01/14/2023 at 06:04

@Yann Guidon / YGDES

"if you keep type 1 then you need to bulletproof implementation that will recoup the cost of the extra code needed in only a couple dozen strings. However, if you have dozens of strings then it's unlikely to be a small program making the savings moot." - Me

"The cost for having both type 1 and 2 is a few mask & shift operations." - You

"The problem is that you have to do it for all the functions." - Me

The point being that the benefits of type 1 are outweighed by it's requirements.

You went on about needing strlen and I pointed out that that's no longer the case as many string implementations exist that hold the length of the string making the benefit a moot point.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/14/2023 at 20:14

@Gravis

"many string implementations exist that hold the length of the string making the benefit a moot point."

FORTH calls them "counted strings" and these were already well accepted in the 60s (before PDP-7 dwellers at Bell Labs selected null-terminated for their pet project and language). That's also called "Pascal style" and certainly something else by other languages.

So yes there are many string implementations, I never pretended inventing them. Now, maybe you don't have to interact with POSIX often, or related tools or interfaces.

Due to alignment, Type 1 provides a marginal benefit, but what would you do with a byte prefix instead ?

The project is not as documented and elaborated as some of my other projects so you might miss some context and other elements. At least there is preliminary example code and I'm working on more elaborated features. They probably don't fit your needs and you are luckily not affected by the tyranic anachronism of <string.h>, and that's good for you.

Are you sure? yes | no

Gravis wrote 01/15/2023 at 22:29

@Yann Guidon / YGDES

I'm more acquainted with POSIX than you may have thought. There are many POSIX compatible implementations already and there is literally no need to rely on POSIX string functions.

Here's the description for the "Better String" library: https://github.com/websnarf/bstrlib/blob/master/bstrlib.txt

If nothing else, you could use bstrlib as a better starting point.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/15/2023 at 23:28

@Gravis

Thanks for the link ! And I presume nothing about your skills and knowledge.

"there is literally no need to rely on POSIX string functions."

I'd have a few things to say about this but I'll try to stay on-topic.

Of course I know I'm not the first one to complain about strings. I didn't know Paul's lib, which has not taken over the traditional libs yet. His author Paul Hsieh is not a mere beginner as I used to read him on c.l.a.x back in the Usenet days. I also used his references for x86 asm optimisation and followed his exploration of hashing algos. So I have deep respect for him.

Now let's simply read the link you sent :

A bstring is basically a header which wraps a pointer to a char buffer. Lets start with the declaration of a struct tagbstring:

struct tagbstring {
int mlen;
int slen;
unsigned char * data;
};

OK that's it, just like in almost every other HLL or interpreter. Delphi adds a refcounter, but so far nothing breathtaking:

* you have to carry a whole struct around, which is less convenient to return from a function than a plain pointer,

* constant strings have mlen=slen,

* the pointer uses 4/8 more bytes to the struct,

* the effort is mostly the same,

* and it adds one level of indirection.

I'm sure Paul's work is a few leagues above mine and he takes a reasonable idea to turn it into a great piece of code that I'd be happy to see integrated into POSIX for example. But you failed to convice me and, replying to

"you could use bstrlib as a better starting point."

I know it's better than standard C (and there is quite a lot to get inspiration from) but I see no benefit over the aligned strings. And it's not a NIH syndrome.

Are you sure? yes | no

Gravis wrote 01/16/2023 at 00:47

@Yann Guidon / YGDES

"you have to carry a whole struct around, which is less convenient to return from a function than a plain pointer,"

What are you talking about? It's a pointer to a struct. Your stuff can be represented the exact same way. You're giving me the feeling that you don't quite grasp how compilers deal with structs.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/16/2023 at 01:46

@Gravis

"You're giving me the feeling that you don't quite grasp how compilers deal with structs."

Feelings are treacherous. The C syntax allows it and it opens a whole can of worms. If it could, should you? You know C's x86 ABI returns the function's result in EAX (or RAX), which clearly can't fit a struct. You can and will play around. Each compiler will do their own thing, copying or dereferencing, sometimes in inefficient or unexpected ways. Whether it returns a pointer (to a struct allocated where ?) or bulk data (that are filled/copied where?) creates secondary issues.

Funnily, I used to decompile and single-step binaries generated from Ada-like VHDL (have a deep look at GHDL) and that crazy language family has no problem returning a struct or more, using other techniques even on a GCC backend. It makes me feel "too tight" when I must code in C again.

"It's a pointer to a struct. Your stuff can be represented the exact same way."

No.

Read again:

struct tagbstring {
int mlen;
int slen;
unsigned char * data;
};

You see, at the end, it's a pointer.

Now:

typedef struct {
uint32_t allocated __attribute__ ((aligned (8)));
uint16_t len;
char text[];
} Flex_aString_16;

You should see that "text" is not a pointer, it's the start of the array, using C99's "Flexible array" notation, that was indeed designed for this purpose. No indirection. You can printf() the pointer directly with a simple (char*) cast to the value, which points to Flex_aString_16.text. The pointer is self-sufficient and that's the whole point.

Anyway, that's a stimulating discussion, thank you.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/13/2023 at 16:40

OTOH, compilers have to align, sometimes more than required so the saving of Type1 is almost irrelevant but it keeps compatibility with some other format.

And with the forced alignment the savings are not huge and only for size=4n+3

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/13/2023 at 17:24

If you dropped type 1, what would you use if for ? I think I have covered the bases so far.

Are you sure? yes | no

Gravis wrote 01/13/2023 at 19:24

Honestly, I think you are attacking the wrong problem. Strings aren't really a problem when it comes to memory with the exception of very special programs. I would say the real issue is efficient allocation and management of memory pools in order to minimize memory fragmentation.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/14/2023 at 01:09

@Gravis I agree that memory allocation is a huge .... "problem" doesn't even come close to the scale of this catastrophic mess.

I'm working on it too, already.

Are you sure? yes | no

Gravis wrote 01/14/2023 at 02:33

@Yann Guidon / YGDES

Then you should look at "two-level segregated fit" because it's a good solve. You will need to be exceptionally clever to make an allocation method that could even compete with it.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/14/2023 at 03:56

I started to work on another aspect/side of the problem.

Are you sure? yes | no

Thomas Castello wrote 01/10/2023 at 09:47

Hello word world!

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/10/2023 at 09:56

Welcome !

Right now I write example code and the primary functions, to be uploaded here soon.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/09/2023 at 22:23

https://www.youtube.com/watch?v=Qyn1qxi73u8

Even Dave's Garage is talking about this obsolete dumpster fire...

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/04/2023 at 06:36

I'm back on the subject, dudes ! With more hindsight and it's slowly coming together...

Are you sure? yes | no

Ken Yap wrote 05/18/2020 at 00:22

Why not just use a 16-bit length field all the time and avoid the complexity of deciding whether which variant it is, in return for wasting one byte? What you lose in data space you may save in code space.

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/18/2020 at 17:33

this would be possible if the context was well confined in a language framework for example (such as Python, JS and others).
In Delphi for example, there is a 32b prefix but everything is aligned.

In my case I want to perform operations that differ from the safe&sane environment of managed languages and I want to avoid "shifting" blocks all the time. I want to do thing "in place" as much as possible. So the byte prefix is useful BUT often not large enough. 2 bytes forces alignment of the start of the character area.

I'll have to write a log about the use cases & environment...

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/18/2020 at 00:11

Notes for later : Some random googling and readings...

https://en.m.wikipedia.org/wiki/Null-terminated_string
https://stackoverflow.com/questions/7648947/declaring-pascal-style-strings-in-c
https://stackoverflow.com/questions/4418708/whats-the-rationale-for-null-terminated-strings
https://en.m.wikipedia.org/wiki/String_(computer_science)#Length-prefixed
https://en.m.wikipedia.org/wiki/Rope_(data_structure)
https://queue.acm.org/detail.cfm?id=2010365
https://wiki.freepascal.org/Character_and_string_types
http://web.archive.org/web/20151125114234/http://www.codexterity.com/delphistrings.htm
https://wiki.freepascal.org/TStringList-TStrings_Tutorial
http://www.paulgraham.com/hundred.html

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/18/2020 at 00:31

https://bobobobo.wordpress.com/2008/01/28/how-to-use-variable-argument-lists-va_list/

This one will be handy later :-)

Are you sure? yes | no

Ken Yap wrote 05/15/2020 at 09:00

Actually you can make the format more useful by extending the record backwards:

-4: uint_24: capacity, uint_8: flags

0: length (0-24b), uint_8 data[]

length: \0

This allows you to carry information about the capacity of the buffer in the record, for passing to str(n)* and snprintf routines. The flags could encode options like extend on demand, etc. which wrapper functions could use. Thus you still have compatibility with C library functions but can support enhanced behaviour.

Since this constitutes public disclosure, this extension cannot be patented from now on. 😉

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/15/2020 at 10:51

yes, it is of course possible to extend backwards.

Now, I don't know how to tell a C compiler how to do it.

in C, I used to use a struct for Pascal-type strings, and it starts at address 0.

My idea is to use a 3rd bit of the pointer to flag the extra fields, which will take 4 more bytes like in you case.

Are you sure? yes | no

Ken Yap wrote 05/15/2020 at 11:13

There are various ways. One is to use an enclosing struct:

struct {

uint_32 capflags;

uint_8 data[];

}

Then mask the low 2 bits and subtract sizeof(uint_32) from the address you have to get to the capflags.

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/15/2020 at 11:21

If you align to 8 bytes, you don't need to subtract the offset from the address :-)

Are you sure? yes | no

Ken Yap wrote 05/15/2020 at 11:34

Yes, but that might be a waste of memory and you would have to make sure that any heap allocation routines return that alignment.

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/17/2020 at 20:56

@Ken Yap you can always re-align a pointer to a block you allocated.

Or even allocate a large block and make your own allocation system with the desired granularity :-)

Are you sure? yes | no

Ken Yap wrote 05/17/2020 at 23:01

Of course, but you have to overallocate in malloc then round up to the desired alignment. So 4 byte alignment may already be the default and you have to ensure 8 byte alignment with this method.

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/18/2020 at 00:10

never mind : i'm about to settle for 16-bits alignment :-P

Are you sure? yes | no

Ken Yap wrote 05/14/2020 at 23:36

You can probably define a C++ class to handle these "ystrings" and to convert to and from C strings. Conversion to a C string is easy, just return the pointer. Conversion from a C string is more work, you have to allocate space and do a copy to ensure you have the alignment you want. It makes operations like ystrlen O(1) rather O(n) complexity.

Another consideration is whether you will normalise ystrings after an operation. For example all three represent the same string:

0 0 5 h e l l o \0

0 5 h e l l o \0

5 h e l l o \0

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/15/2020 at 00:11

C++ is not for me... I'm having more fun with its bastard grandkid JavaScript, who rediscovers life's rules and hardships after turning 20 ;-)

"ystring" does not roll on my tongue but... nah :-P

Yes, conversion to C is easy. That's the point. It would ease communication with POSIX stuff (hopefully).

Conversion from C is less frequent and would occur in more deterministic situations. Most of the runtime cost (for my cases) is when strings are concatenated, when stuff is assembled, so the net gain should be positive. "reshifting" would occur only at the last moment when the whole batch is serialised/streamed/concatenated, normally only once.

You made a mistake with the degenerate cases : it's little endian ;-)

5 h e l l o \0

5 0 h e l l o \0

5 0 0 h e l l o \0

This really helps because the function only has to read the first 32-bits word and mask according to the pointer.

so normalisation is not critical : there is only one or two bytes of penalty for using a larger format, while enforcing a strict resize all the time is costly due to the constant reshifting.

If there is an overflow, you can avoid the cost of shifting by making a tree of strings.

Are you sure? yes | no

Ken Yap wrote 05/15/2020 at 00:22

But concatenation of C strings is already O(n) so can't get any better than that. Only if you have spare space behind one string can you reduce that to the length of one string. Anyway most of the work in s(n)printf is converting the data to characters so saving a bit of string manupulation won't make much difference to CPU work.

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/15/2020 at 10:48

managing memory during multiple concatenations can be a P.I.T.A. and it happens a lot...

Remember : I am implementing a HTTP server, which has to manage quite a number of strings of various and variable sizes...

You can only do so much with snprintf() in this case : you can say what is too much but this is arbitrary... and you have to deal with the eventual error anyway.
If you concatenate with a tree first, you get the size of the whole string to allocate before serialising it. And you can process/manage the error before the serialisation happen.

Are you sure? yes | no

David wrote 12/01/2020 at 17:58

@Ken Yap : this format actually can get better than that. It can concatenate two strings into one "non-ASCIIZ-compatible two-pointer" format "ystring" in O(1) time. You are right that creating an ASCIIZ string of length N by concatenation will inevitably take at least O(N) no matter how you do it. However, concatenating N separate strings of average length M with strcat() requires O( M*N^2 ) ( Schlemiel the Painter's Algorithm https://www.joelonsoftware.com/2001/12/11/back-to-basics/ ) but O( N + N*M ) by making a tree of two-pointer nodes, and flatting down to one long ASCIIZ string only once at the end.

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/09/2023 at 18:21

Well, @Ken Yap it seems I have settled on a flexible mix of features, look at the latest version :-)

I'm currently writing design decision docs, I'll post a summary here later.

Are you sure? yes | no

Ken Yap wrote 01/09/2023 at 22:49

I hope you will thoroughly audit your implementation for weaknesses because imagine what havoc an explotable bug in such a fundamental data structure shared library would cause if it's widely adopted. You could become famous for the wrong reasons. ☹

Are you sure? yes | no

Yann Guidon / YGDES wrote 01/09/2023 at 23:05

@Ken Yap that's what regression tests and fuzzing are, right ?

Are you sure? yes | no

Ken Yap wrote 01/09/2023 at 23:08

That's only the start of it. If you look at CVEs even things like Perl's sort becoming quadratic on certain inputs was regarded as a weakness.

Are you sure? yes | no

ziggurat29 wrote 05/14/2020 at 17:52

I like this. I may use it myself in the future. Please do not patent! lol

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/14/2020 at 18:25

There is no risk : a patent can't apply on a format. Only on methods.
I write everything under AGPLv3 (unless required by a client).
Many software use a prefixed format with size being 4 bytes, which is a waste but keeps the code short and fast. I'm trying to find a compromise...

Tell us what you think :-)

Are you sure? yes | no

ziggurat29 wrote 05/14/2020 at 19:01

I had occasion to use pascal-style strings just yesterday. It wouldn't have helped in that particular case because all the data were small so I could get away with a fixed size of 8-bits for the length, but I like your adaptable solution for other cases where it might need to spill over to 16-bits or more occasionally, rather than bumping up all lengths to 16-bit just because a few happened to be long.
Vaguely it reminds me of how UTF8 achieves bisexuality with the ANSI world, but your use of the lower two bits of the pointer to encode the length of the length is particularly clever.
I was joking about patents, of course, however this is in fact a 'method'.

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/14/2020 at 19:44

Feel free to use that trick :-)

If you only have 1 and 2 bytes for the size, you can simply test the LSB of the pointer and select the appropriate processing code. I am considering that "degraded mode" for 16-bits platforms.

I am thinking a lot about it, now, it's my new pet idea and a lot of things are flashing through my head, we'll see in the next days and weeks after the brightest ideas have percolated to the top...

Using the LSB to encode meta-informations is not new : I think some LISP and IA machines use this technique (for example, add a flag to indicate the type of a node in a linked list or tree) but the "smart" idea here is that the pointer serves 2 purposes at once ! It can indicate the format and still point to the start of the string ! It sounded so cool that I wondered why nobody did it before and I am still digesting a lot of documentation on the subject.

Are you sure? yes | no

ziggurat29 wrote 05/14/2020 at 20:03

Yes I was thinking that a better analogy than UTF8 might be ARM Thumb instructions: the LSB of the address indicates whether the target code (of a call or interrupt vector) is actually Thumb instructions. Since instructions can never be on an odd address, they re-purposed that otherwise unused bit. (I know this well from disassembling, lol)
Your encoding technique is still useful on 32-bit platforms.
It will be interesting to hear your thoughts about it's applicability to trees and linked lists when you get to that.

Are you sure? yes | no

Yann Guidon / YGDES wrote 05/14/2020 at 20:54

Yes, playing with pointers is a risky business...

I've been a CPU architecture geek for a long while and it is always tempting to use "some bits here and there" for "this and that"... And it often backfires.

Here it might be a sweet spot, we'll see. Anything is better than sprintf() ;-)

Are you sure? yes | no

aStrA : Aligned Strings format with attributes

Description

Details

Files

AS20241229.tgz

AS20241228.tgz

AS20241128.tgz

AS20241126.tgz

AS20241121.tgz

Project Logs

Collapse

Memory allocation

Attributes

Merging

Lengths

The enhanced "aligned strings" format is named aStrA

The canary is singing.

It's not a liability, it's a feature!

Attributes moved to Flex

Aligned Strings with Attributes

Extension and type

Discussions

Similar Projects

Simple Compiler

3RGB image lossless compression format

Targeting SDCC to the 8080

limited-code hacks/ideas

aStrA : Aligned Strings format with attributes

Become a Hackaday.io member

Just one more thing

Description

Details

Files

AS20241229.tgz

AS20241228.tgz

AS20241128.tgz

AS20241126.tgz

AS20241121.tgz

Project Logs Collapse

Enjoy this project?

Discussions

Become a Hackaday.io Member

Similar Projects

Does this project spark your interest?

Report project as inappropriate

Send message

Remove Member

Project Logs

Collapse