Close
0%
0%

Let's Build a Modern LLM - Entirely from Scratch

While there are several open source versions of various AI applications out there, there is still very little out there for doing training.

Similar projects worth following
This project will attempt to combine the features of several projects that existed more than twenty years ago, and which were created for other purposes, such as trying to predict the Lottery, or for attempting to turn audio into sheet music, for example, into a fully integrated AI, which is just as capable as anything that not so open, you know who or anyone else is doing.

If you have been around these parts long enough, you know how the classic chatbots like Eliza, Parry, and Megahal work.  Then there were other important approaches to AI, like Perceptrons, SHRDLU, Mathematica, Watson, and so on.  Some of those things have open source versions, and some do not.  One thing that they all had, and perhaps still have in common, was, and perhaps there still is the fact that there always seems to be a "great breakthrough" in AI, which almost always seems as if it were as great a leap forward, as it would be, that is to say, as if sustainable controlled fusion had also been solved.  And then of course, no, not really.  Better luck next time.

Then perhaps one of the biggest breakthroughs ever came with the landmark observation, "What if attention is all you really need?"  Purely attentional-based systems, however they actually work, do have their own sets of problems, and among those, of course, is a lack of efficiency.  And yet there are some very interesting things that can be said about "attention" in and of itself!  So maybe a digression into the history of all of this will prove to be worthwhile.

Most people reading this were not even born when Don Lancaster wrote the article for the very first build-it-yourself "TV Typewriter", as if all that mattered at the time was putting "your message on the screen."  

Ah, the TV Typewriter!

Somewhere, somehow, back in the day, beyond the valley of the shadows, in the once upon a long time ago, long before I ever wanted to write a novel where every chapter began with "It was a dark and stormy night, and as the swamp thing staggered from the crypt, suddenly there was a need for words", there was, of course, this:

Now that I have your attention.  Yeah, that thing.  Back when Bill Gates was still in high school.  Then one thing happened, and another, another, and we all know what happened.  Even if Eliza came out in the 60's, and then SHRDLU amazed so many more, while requiring a PDP-10 so as to enable a user to interact with a wire-frame virtual world that allowed the input of commands, like "Pick up the red block and place it under the blue pyramid."

Now, all of a sudden, here it is, 2026, and vibe coding seems to be the new meme.  Or perhaps the new mess, since obviously, we are at a crossroads.  Some people are saying that up to 50% of white-collar jobs will be gone within the next couple of years, and then what?  Will AI also somehow eliminate the need for half the world's population?  Will the presence of data centers in rural areas lead to vanishing water supplies, dead fish, and no more birds?

What if open-source training can be done "at home", and if models could be shared, Wiki-style?  Would that work?  Would that solve the data center problem?

Well, that is all good and fine, if it should turn out to be possible, but how do we know what is or is not possible, that is, until we try some things for ourselves?  So let's take a look at something different, I think, from anything that anyone else is doing, at least insofar as "open source" is concerned.  What if we wanted to try to use AI, let's say, to try to "predict the Lottery!?!"

Now some people will be wondering, of course, what does trying to predict the Lottery have to do with LLMs?  Perhaps the answer will be obvious to some, while for others, we will need to take a very deep dive into the whole theory of just how "training an LLM works", that is, if it actually does.  Well, as it turns out, there are certain Lottery systems that have a great deal in common with how LLMs work.  Well, maybe.  Like what if we had a program that looks at the draw history for a particular game that we are interested in, to find potential "hot pairs", that might be useful in trying to create not just lists of so-called hot numbers. Rather, we might want to find...

Read more »

  • What, This Thing Is Nuclear? No, Not Really!

    glgorman06/09/2026 at 01:38 0 comments

    But I like to begin my log entries with a catchy title, nonetheless.  So, in any case, maybe it is time to look at some code.  For Lotto-type games, I created a source file called lotto.cpp, which does all of the heavy lifting, insofar as looking at a bunch of stuff that involves examining pair histories, looking for triplets containing particular pairs, and so on.   So the C++ class for "draw" that is going to be rewritten for that part of the model starts out looking like this:

    class draw
    {
        int score;
        int number;
    
        int x [6];
        char str [32];
    
    public:
        
        draw ()
        {
            memset (this,0,sizeof (draw));    
        };
        void sort ();
        draw (int a, int b, int c, int d, int e, int g = 0)
        {
            memset (this,0,sizeof (draw));
            x[0]=a; x[1]=b; x[2]=c; x[3]=d;
            x[4]=e; x[5]=g; sort ();
        }
        draw (char*);
    
        bool operator > (draw &d);
        int &operator [] (int arg)
        {
            return x[arg];
        }
        void test (char *, size_t max);
        bool increment (int limit = _G::MAX_NUMBER);
        bool match_copair (int p, int q, int r, int s);
        void dump (char *buff, size_t max);
        void calc_pairs ();
        void calc_trips ();
        
    };
    class draw
    {
        int score;
        int number;
    
        int x [6];
        char str [32];
    
    public:
        
        draw ()
        {
            memset (this,0,sizeof (draw));    
        };
        void sort ();
        draw (int a, int b, int c, int d, int e, int g = 0)
        {
        memset (this,0,sizeof (draw));
        x[0]=a; x[1]=b; x[2]=c; x[3]=d;
        x[4]=e; x[5]=g; sort ();
        }
        draw (char*);
    
        bool operator > (draw &d);
        int &operator [] (int arg)
        {
        return x[arg];
        }
        void test (char *, size_t max);
        bool increment (int limit = _G::MAX_NUMBER);
        bool match_copair (int p, int q, int r, int s);
        void dump (char *buff, size_t max);
        void calc_pairs ();
        void calc_trips ();
    };

    There is, of course, a lot of room for improvement, and a lot of things that need to be changed, since clearly it is nowhere nearly as simple as using find and replace to change the word "game" to "context" and "draw" to "reading frame" as one might want it to be.  One of the first things that I want to get rid of, therefore, is the reliance on the use of sscanf to parse a line from a Lottery history file into an entry into a game history database.  That needs to go badly!  Especially if we are going to be "tokenizing" things like the Federalist Papers, or articles from Wikipedia, or whatever else we might encounter.  So, this is going to present an interesting challenge, in that respect, i.e., make improvements to the original code so that it will work with more history files for more types of games. At the same time, enabling the algorithms that we are using to try to make "predictions" about what pairs with whatever, to work with other types of tokens, that is according to a more robust probability model.

    void CLottoProDoc::find_copairs (int p, int q, int SEARCH_DEPTH)
    {
        _int8 copairs1 [64][64];
        _int8 copairs2 [64][64];
    
        int i,j,k,l,N,p1,q1;
        char str [DEFAULT_BUFFER_SIZE];
        sprintf_s (str,DEFAULT_BUFFER_SIZE,"searching for copairs %02d-%02d ... \r\n\r\n",p,q);
        strcat_s (outbuff,OUTPUT_BUFFER_SIZE,str);    
    
        for (i=0;i<_G::MAX_NUMBER+1;i++)
        for (j=i;j<_G::MAX_NUMBER+1;j++) {
            copairs1 [i][j]=0;
        copairs2 [i][j]=0;
        }
        for...
    Read more »

  • Simple Simon met a "pi" man - going to the fair.

    glgorman06/08/2026 at 01:48 0 comments

    I was thinking about doing another log entry about Lottery systems, and the probability that certain groups of numbers will come up, so as to take a closer look at certain types of problems, like what is the actual probability that someone's birthday will come up, on any particular Lotto type game, as well as the more general problem with respect to the probability that any particular set of 2, or even 3 numbers might come up, like 3 and 14, of course.  Yet that leads to the more general problem of "how long will it be on average for 3-14 and 15 to come up", depending on the number of balls in the particular game, and the number of numbers picked in a particular draw.  But that will have to wait.  Maybe I will write a short C++ program that prepares a chart for the waiting time for each variation, with fans of e = 2.71828, which can be played on a Lottery as 2-7-18-28, along with "whatever" taking the absolute longest, that is, besides matching all five or six numbers, according to the game.

    Instead, I want to look at something else, something that is, that I hope we can get some help with, that is, in the realm of "vibe coding," perhaps.

    Yet if in the previous example, I was thinking about how the phrase "I LOVE A MYSTERY" might encode the sequence 5-8-30-14, or else the phrase "I LIVE OUT MY DREAMS" might encode the set 5-8-13-14-30, it is easy to see that neither of these came up on the California Super Lotto last night, even if the five and eight did, but oh so sorry, no 13, and no 14, no hidden pi, or any pie for that matter to be had.  No pi, pie, too bad so sad, and better luck next time?

    Or else wait a minute.  I didn't even play the Super Lotto last night, and not just because I am not even in California right now.  At least the 5 and 8, and then I noticed something that I find interesting, nonetheless.  Something that fits in nicely with my examples "DOWN THE PRIMROSE PATH", "DOWN THE HATCH", "DOWN THE RABBIT HOLE" and of course "DOWN THE DRAIN" and that would of course also mean that I should have mentioned something along the lines of "DOWN BY THE RIVER SIDE" as an example of something that can be, and perhaps should be done when we have "lists of words" that we somehow "need" to use in a sentence, or in a composition, as a whole, and where we need to be a bit more flexible with word order, or else when we need to accommodate the need to have an occasional word or phrase inserted into the response.

    So I spotted something, something kind of fun, since I was working on a meme that begins with 5 and 8 representing the word "LOVE", yet what about "FLOWERS" which encodes the sequence 8-5-40, which of course does not work, with last night's Super Lotto, yet "A FLOWER WILTS!" does encode the sequence 5-8-45-10. Just as easy to think of as the "easy e = 2-17-18-28", and just as hard to predict, of course!

    A flower wilts.
    A flower wilts when it has no water.
    A flower wilts when it has no air.
    A flower wilts in one year!
    A flower wilts.

    Ah poetry!  Using "random numbers as seed values!"

    Easier said than done?

    Q.E.D

    For now!

  • The Birthday Problem and other Roads Less Travelled

    glgorman05/31/2026 at 02:21 0 comments

    Looking at the Fantasy Five Birthday problem, vs. the Power Ball birthday problem, and how to calculate the "Birthday Problem" for any Lotto-type game.  In the California Fantasy Five game, there are 39 numbers to choose from.  So, if someone's birthday was 12-25, then what is the chance that 12-25 will come up on the California Fantasy Five, which, as stated, has 39 numbers, vs. the Power Ball, which has 69 numbers?  According to my classic HP-15 calculator, the birthday probability for the California Fantasy 5 game is approximately 1 in 74.10, whereas the "Birthday Probability" for the Powerball is approximately 1 in 234.60.

    One way to do this calculation is, of course, to use the "choose function C" to compute the value C(39,5)/C(37,3) for the California Fantasy Five, or to likewise simply compute C(69,5)/C(67,3) for the Powerball, obtaining the 234.60 value in that case, as stated.  So this would appear that we could find the general case for any Lotto-type game as C(M, N)/(M-2, N-2), which uses either a bit of multiplication and division to get done, or else we could perhaps generalize it a bit further, by discovering a pattern that seems to exist in the calculations, when it is written out by hand, so that it also presents an opportunity to perform some optimization. 

    Thus, it would appear that the problem of finding the probability of having two specific numbers come up in a five-numbered draw set actually generalizes to the value: N*(N-1)/20.  This, of course, might have some interesting implications.

    With an LLM, of course, we are going to want to replace some Lottery concepts like "draw set" with "reading frame," and we are going to want to think about the idea of grammars that might have 10's of thousands of symbols or more, so that maybe problems of this type are going to turn out to be important when we think about memory allocation, thread pool size, and so on.  Yet drawing from the examples of "down the rabbit hole", "down the primrose path", "down the hatch," and "down the drain", it should be easy to contemplate that there should be some way of developing metrics for measuring the probability that two randomly chosen symbols will occur within some context, so as to be able to do some work in the problem of scaling things like "neuronal weights" associated with relevance, or significance, that is, within the framework of the hopefully more general transformer based attentional systems.

    Maybe.  Or we can go back to asking "What happened to the other dollar?"

  • Which way do we go from here?

    glgorman05/24/2026 at 16:23 0 comments

    There is a whole bunch of stuff that I want to do, like talk about the "phone book trick from the movie Rain Man", and how there is an urban legend that it was actually done by the magician and memory expert Harry Lorayne, who was also one of the authors of a best-selling book entitled "The Memory Book."  Even if Lorayne didn't actually memorize the entire Manhattan telephone directory, it was said that he did memorize quite a few pages, just to prove that it could be done.  

    This is something that I want to talk about in a different context, i.e., with respect to Lottery systems, whether they work or not.  Yet they will have utility, I think, in finding interesting and novel ways to approach the problem of designing an LLM from scratch.

    Since this is going to take a while to explain, perhaps it would be best to start with an excerpt directly from the book, or just simply borrow a description for one popular system for memorizing numbers from the Wikipedia article on the so-called "Mnemonic Major System."

    So here is "The system", shamelessly borrowed from Wikipedia:

    --------------------------------------------------------------------------------------------------------------------------------------------------------------

    0/s/, /z/s, soft c, zZero begins with z (and /z/). Upper case S and Z, as well as lower case s and z, have zero vertical strokes each, as with the numeral 0. The alveolar fricatives /s/ and /z/ form a voiceless and voiced pair.
    1/t/, /d/, /θ/, /ð/t, d, th (as in thing and this)Upper case T and D, as well as lower case t and d have one vertical stroke each, as with the numeral 1. The alveolar stops /t/ and /d/ form a voiceless and voiced pair, as do the similar-sounding dental fricatives /θ/ and /ð/, though some variant systems may omit the latter pair.
    2/n/nUpper case N and lower case n each have two vertical strokes and two points on the baseline.
    3/m/mLower case m has three vertical strokes. Both upper case M and lower case m each have three points on the baseline and look like the numeral 3 on its side.
    4/r/rFour ends with r (and /r/ in rhotic accents).
    5/l/lL is the Roman numeral for 50. Among the five digits of one's left hand, the thumb and index fingers also form an L.
    6/tʃ/, /dʒ/, /ʒ/, /ʃ/ch (as in cheese), j, soft g, shUpper case G looks like the numeral 6 and lower case g looks like the numeral 6 rotated 180°. Lower case script j tends to have a lower loop, like the numeral 6. In some serif fonts, upper case CH, SH and ZH each have six serifs. CHurch has six letters. The postalveolar affricates /tʃ/ and /dʒ/ form a voiceless and voiced pair, as do the similar-sounding postalveolar fricatives /ʃ/ and /ʒ/.
    7/k/, /ɡ/k, hard c, q, hard g, ch (as in loch),Both upper case K and lower case k look like two small 7s on their sides. In some fonts, the lower-right part of the upper case G looks like a 7. G is also the 7th letter of the alphabet. The velar stops /k/ and /ɡ/ form a voiceless and voiced pair.
    8/f/, /v/f, ph (as in phone), vLower case script f, which tends to have an upper and lower loop, looks like a figure-8. The labiodental fricatives /f/ and /v/ form a voiceless and voiced pair.
    9/p/, /b/p, bUpper case P and lower case p look like the numeral 9 flipped horizontally. Lower case b looks like the numeral 9 turned 180°. The labial stops /p/ and /b/ form a voiceless and voiced pair.

    --------------------------------------------------------------------------------------------------------------------------------------------------------------

    Now what do we do with it?  Well when taking a short phrase like "I LOVE A MYSTERY", or "I LIVE OUT MY DREAMS", we can easily see that the first sentence encodes the sequence 5-8-30-14 and the second sequence encodes the pattern 5-8-13-14-30, that is to say, if we have a constraint that what we want to do is to try to find a way to create some kind of "magic cookie" as it were, that somehow encodes a set...

    Read more »

  • Another Day - Another 3000 Words: Maybe

    glgorman04/29/2026 at 21:29 0 comments

    The way that I figure it, what if I could write 3000 words each day, of "new material" which is somehow relevant to training an LLM, by whatever means?  Then it would take "only" 333.33 days to get to one million words of hopefully, new, original, and just as important, potentially "useful" content.  Yet there are other things that an LLM might be trained on, like MIDI data derived from audio input.

    Or patterns in DNA, maybe?

    I mean, why not?  If it seems like a reasonable thing to try, and if the advent of vibe coding might somehow make the otherwise gargantuan task of retooling something like 100,000 lines of code, spread across at least 20 libraries, actually seem feasible.  Then maybe that is yet another domain for AI, which is not only highly important, but also relevant, even if at the same time it might seem like such explorations are perhaps being underserved, at least within the Open Source community.

    I know that I am going to want to be doing things with "transformers"; however, they actually work.  Yet, there is also the obvious, yet seemingly all too often overlooked, task of getting the data into the program, to begin with.  Maybe that isn't as hard as it might seem.

    Yet there is also this idea of generating as much "metadata" as possible, maybe, based on traditional methods.  Not just because it is possible to do, but because, maybe, just maybe, the transformer model would be a lot more efficient if some of the initial heavy lifting were done according to standard coding techniques.

    Interestingly enough.  When I took some Atari BASIC code and experimented with using a "genetic" algorithm to approximate a mathematical function, I found that I got better results if I ran 32 models in parallel, and let them compete with each other, let's say for 64 iterations each, than I was getting if I ran a single model for 2048 iterations.

    So I think that there are even more avenues that need to be explored, as far as how this sort of thing might work out in the long run.  Clearly, at least in this particular case, a competition among the so called "mixture of experts", multi-model approach, seems to work best.

  • Somewhere in the Once Upon a Long Time Ago

    glgorman04/29/2026 at 02:36 0 comments

    Right!  Whatever I just said. 

    Well, anyway.  

    What if this project isn't really about trying to predict the Lottery, and it is especially not about trying to predict any current game using a database of numbers from as long ago as 2012, which hasn't been updated in at least that long?  

    No, that's not the point.

    That is only the beginning.

    Really, an LLM, or so we are told, is not so much about words or numbers as it is said to be about "tokens".  Whatever those tokens represent is, well, whatever they represent.  Maybe macroblocks from a JPEG of a medical X-ray, or maybe waveform data from a seismograph, or "whatever."  That, of course, was the big surprise, with the discovery that "maybe attention is all you need."  Yet somehow, the "pairs of pairs" meme seems to have stuck with researchers as one of the best predictors, at least as far as text generation is concerned.  So does it make sense to try to, in effect, create some kind of "lottery-like" application that increases the number of balls in a particular game to some "reasonable number" like 65535, or something like that, along with an appropriate increase in the size of the "reading frame?"

    Maybe.  Maybe there are a lot of things that need to be tried in parallel, and it is just as important to come up with methodical techniques for formalizing the parallelization of whatever processes are being put to the test.  Something else comes to mind, such as whether to try to store the topology of a neural network in a traditional fashion, such as by extending the capabilities of Spice, or Ki-Cad, or whether it is actually much easier to do something like that with a relational database like MySQL . Yet, that of course is where the newfound meme of "vibe coding" might come in handy, that is to say, as things grow, and grow, and grow.  Which they will.

    Likewise, we might want to add or delete a source from the training set.

    So a "make system" is also needed.

    This thing is huge, but not impossible.  

    Yes, there will be back propagation and gradient descent.

    Maybe even a seemingly infinite number of monkeys, all leaping about.  Having private "conversations" with each other, and so on.

    This is an ambitious project.

    In the meantime, thank you for choosing Johnny Cab.  Hope you enjoy the ride!

View all 6 project logs

Enjoy this project?

Share

Discussions

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates