28taidan - Stardance

Open comments for this post

@28taidan on BadGPT · 1 day ago

8h 53m 47s logged

I implemented self attention! I will try to explain here, but will do a terrible job: basically, what that is, is it lets the model decide what tokens in the input are more important than the others. The first thing I did was scrap the wavenet model, and switched to a simple bigram model. The second thing I implemented was bag of words, which edits the input tokens so every element is the average of that element and all that came before it. This lets all of the tokens “talk” to eachother. This is all fine and dandy but since all the things are averaged then you don’t know where things came from, and positional data is lost, so we lose a lot of resolution and data, so that sucks. So we have to encode positional data. So instead of just averaging all the elements that came before, we expand each averaged element into a list, which has all the elements in a row instead of averaging them into one number. So the first row could be [1, 0, 0] and the second [1, 2, 0] and the third [1, 2, 3], etc. We do this by masking it with the tril function, from numpy (very cool). But this is problematic, because all of the elements are given the exact same weight. A letter 34 characters back gets the same weight as the last character in the input (which is the letter immediately preceding the predicted letter, the output). To fix this, we add self attention! This is modeled after the groundbreaking paper Attention is All You Need, which started the whole AI hype thing. How it works is each token has a linear layer (that multiplies it by learned weights) called the key, and another called the query. The query is kind of like what that token is looking for, and the key is kind of like what that token has. And these can be multiplied with the keys and queries of all the other tokens, and that lets you know which token has the stuff that the most other tokens are looking for, and is thus more important and can be given more weight. That is a single head of attention, there are multiple heads that can specialize in different things. This, in theory, should increase performance. But it also drastically increases the amount of time it takes to train. The model only has ~56k parameters, which is less than half of the wavenet model, but takes a lot more time to do it’s thing. This is the output from the results of more than 13 hours of training. As you can see from the loss function, the loss was decreasing at a steady rate, so more training could probably improve it. Plus, this was at a highish learning rate the whole time (.1) so decreasing it would probably further improve it. But I can’t do anything on my computer while this is training or it runs out of RAM, and I have to use my computer now, so I will be moving on for now.

Open comments for this post

@28taidan on BadGPT · 8 days ago

6h 32m 50s logged

I increased the block size from 8 characters to 64. This means that instead of the model taking in the previous 8 characters and outputting the next one, it takes in the previous 64 characters and outputs the next. This means that it should be able to keep track of longer words so it doesn’t get lost in them. However, it doesn’t seem to be improving the model too much, although that might be because I haven’t trained it as much as the others. I’ve done roughly 200,000 epochs, which took a long time, since this model is much larger (110,415 parameters vs 45,000), and has 6 hidden layers as apposed to 2, which means it is a lot slower. As you can see in the loss graph, it probably could make further gains, although it seems to be mostly good.

Open comments for this post

@28taidan on BadGPT · 10 days ago

1h 33m 47s logged

Trained the wavenet architecture on a file with just shakespear in it. It’s got the structure down well, and sometimes can do words and wordlike outputs, but generally spouts straight gibberish. I will try to follow a lecture to build a better Tokenizer and Transformer next time. Right now each character is treated as a token, and embedded into 10 dimensions, it would be better if pairs of characters were treated as tokens and embedded. Increasing the context size could also help, it’s currently an input of the previous 8 characters and outputting 1. This is it’s output:

Open comments for this post

@28taidan on BadGPT · 11 days ago

8h 33m logged

Implemented the wavenet architecture. Usually, during the flattening phase, the tensor goes from the shape (batch num, block size, embedding dimensions) to (batch num, block size * embedding dimensions). This squishes all of the embedded characters into one dimension. Wavenet does that more gradually, it slowly merges the different letters. So the first two letter embeddings get squished into one, then the next two, etc, and in the next layer then you have the first 4, then the next four, until by the final layer you have just one group of all the letters, which is the same as the original method, just a much more gradual approach. The first image is the training loss, the huge cliff is when the learning rate decreased. The second image is a visualization of the wavenet architecture from the paper. You can see that in each layer, nearby groups of input get merged together. In the original method, the graph would look like all of the inputs going to the 1st dot in the hidden layer, then continuing up. Next I will be actually building a GPT instead of this names project.

Open comments for this post

@28taidan on BadGPT · 13 days ago

6h 44m 24s logged

I did a bunch of things to prep for implementing a wavenet architecture, which basically changes the model so instead of taking in all the characters of input at once and squishing them all into the first layer, it gradually puts them in at each separate layer. First, I put more stuff in classes! This didn’t really change the functionality, but instead of embedding the characters and flattening the model in separate chunks, I put them in their own classes so they can get put in the list of layers, and we can just run the input through each layer in the list, and compare it with the target at the end, instead of having half the layers in the list and half as separate chunks, which is a lot smoother. So the entire model is just:

model = Sequential([  
    Embedding(vocab_size, n_embd),
    FlattenConsecutive(2), Linear(n_embd * 2, n_hidden, bias=False), BatchNorm1D(n_hidden), Tanh(),  
    FlattenConsecutive(2), Linear(n_hidden * 2, n_hidden, bias=False), BatchNorm1D(n_hidden), Tanh(),  
    FlattenConsecutive(2), Linear(n_hidden * 2, n_hidden, bias=False), BatchNorm1D(n_hidden), Tanh(),  
    Linear(n_hidden, vocab_size),
])

Which I think is really cool. I also improved the loss graph, previously it looked like the screenshot on the right, because each iteration’s loss is graphed. Because our batch size is 32, which means in each iteration we look at just 32 inputs/output, a good deal of what the loss will be depends on how lucky we are with the input/output. Over a great number of iterations, it averages out into the correct value, though, so to fix the graph, we instead plot the average of every 1000 iterations, which lets us see a lot more detail that otherwise would be lost in the noise (the two loss graphs I added are not of the same model, but you get the idea). Anyway now I’m going to actually implement the wavenet architecture!

Open comments for this post

@28taidan on BadGPT · 17 days ago

3h 13m 48s logged

I followed Karpathy’s lecture on backpropagation, where he implemented backpropagation manually. In the series, he’s using Pytorch, which does it for you, which is why he’s doing it manually. I’m not using Pytorch, so the only way to do it is to do it manually, so I didn’t get to add too much to the code, I just verified that I had implemented it correctly and learned a bunch. It is still pretty confusing though, and I couldn’t solve any of the exercises in the lecture without watching his explanations. Anyway here’s a name it generated that I think is pretty cool.

Open comments for this post

@28taidan on BadGPT · 18 days ago

44m 18s logged

Added a demo page! You can check it out here: https://luchik28.github.io/BadGPT/. I saved the model weights, built a script that reconstructs the model using those weights, and hosted it using github pages.

Open comments for this post

@28taidan on BadGPT · 18 days ago

3h 7m 1s logged

Put my model into the Layer class, so I can easily create lots of layers. The graph shows the activation distribution by layer. Each layer is pretty similar, as you can see. This devlog marks the end of the fourth lecture in Andrej Karpathy’s makemore series, out of 7, so we’re getting there! That’s just under 8 hours or lecture watched (so I spend roughly 5 minutes coding for every minute of lecture watched). Pretty interesting.

Open comments for this post

@28taidan on Vex V5 Match Analyzer · 23 days ago

2h 4m 16s logged

Wrote all of the initial code, am currently testing it on livestream data from the 2026 Kalahari tournament (roughly 24 hours of video, and 350 matches). It goes very slowly, it’s been running for like 5 hours and still has a good chunk to go. The screenshot is of the visualization software with a demo match loaded.

Open comments for this post

@28taidan on BadGPT · 23 days ago

13h 51m 3s logged

I was playing around with how changing the number of epochs/other variable affected the output, and seeing how that differed to before I implemented Batch normalization, and so tried to put the epochs up a lot. I ended up training it for 20,000,000 epochs, which took like 8 hours, which is why this devlog has so much time logged. It didn’t really improve it much, you can see by the loss graph it didn’t go down very much at all, so I’ll need to change something in how the code works. It was still fun though.

Open comments for this post

@28taidan on BadGPT · 26 days ago

2h 28m 27s logged

I worked on switching to a Jupyter notebook instead of just a python file, this made it easier since I didn’t have to constantly retrain the model, and also could print out stuff in line. I also added batch normalization, which helps change the range (really the standard deviation) of the values before going into the activation function (tanh) so that the values are more spread out. tanh squishes all values between -1 and 1, so if all of the numbers are greater than 1 or less than -1, then all of the values will essentially be squished to either -1 or 1, which makes the model lose a lot of the data. But if you divide all the numbers so that they are all closer to 0, then the activation function is a lot more spread out and the data is preserved. (I got the photo from google, I think it’s helpful when visualizing it).

Open comments for this post

@28taidan on BadGPT · about 1 month ago

1h 3m 1s logged

I made a bunch of little improvements, including increasing the character encodings to 10 dimensions up from 2, increasing the epochs to 200,000, and also decreasing the learning rate from .1 to .01 for the laster 50,000 epochs. Although the loss is significantly lower than Andrej Karpathy’s makemore (which I modeled my model after), it’s names are worse. Karpathy got 2.17 ish on the testing dataset, while I got 1.776. However, Karpathy’s generated slightly abnormal but reasonably word-like names, like “jeron” and “ham” while mine where a little more random (“xxxhorlis” is one of them). I’m not sure how to improve from here, I don’t think increasing the epochs will do much, it already took around 10 minutes to train this model, I think we’ll get diminishing returns. I will, as always, continue on to the next video in Karpathy’s series, hoping that his tips will help improve it.

Open comments for this post

@28taidan on BadGPT · about 1 month ago

2h 50m 37s logged

I tuned the learning rate (The magnitude of how much it changes the weights, which is multiplied by the gradient of each weight to get the final change), added better printouts when training (Added a neat progress bar/tracking, output the loss, and a timer), and also using plotmatlib to plot the embeddings. The graph is how the model arranged each letter in space. What’s interesting is similar letters (like the ending character “.” and “e”, which both are commonly found near the end of the word) are in similar spots, so the model has sorted out some sort of meaning from their placements, which I think is pretty cool. The names are still quite nonsensicle, I think I might have to enlarge the model a little bit to improve performance, which means waiting longer for it to train :(

Open comments for this post

@28taidan on BadGPT · about 1 month ago

5h 49m 47s logged

I improved my model based off of Kerpathy’s makeMore part two tutorial and a 2003 paper by bengio et al (link is https://jmlr.org/papers/volume3/tmp/bengio03a.pdf), it was really interesting. I increased the input size, so it takes in the previous 3 letters as an input instead of 1, and encoded each letter in a two dimensions instead of one hot encoding. I also had to start using numPy due to needing to do efficient matrix multiplication (numPy does it in C in a heavily optimized fashion, I considered trying to do this but decided it was out of the scope of the project). Doing it in just python would be way too slow.
I also optimized it a little bit, having it select 32 examples randomly from the data set for each epoch, instead of just using the whole thing, which made it a lot faster while still training over the whole set. Hopefully this will make it produce more accurate results!

Ship

@28taidan on The Family Plan · about 1 month ago

The Family Plan helps manage logistics for families that have to manage driving their kids around to different extracurriculars all the time. Just put in when drivers are available, and when and where kids need to be, and it calculates the most optimal plan to get everyone where they need to be with as little time in the car as possible.
I was inspired to make this project because my mom spends hours every weekend trying to come up with a plan for how to get me and my siblings to all of our various after school activities. I made this to try to simplify and speed up the process, and also include the rest of the family in that process (since they can simply put in when they are available and when they need to be picked up/dropped off).
Something that was challenging was figuring out how to get the most optimal plan. I knew that it would involve getting people to spend the least amount of time in the car (adjusting for tiers, or how important a person’s time is), but it was complicated to figure out how to do that. The key realization was that you can’t decide who carpools by looking at directions on a map. Two kids heading “opposite” ways might still share a ride if one stop is on the road to the other. So instead the app asks a routing engine for real drive times and pools rides whenever the extra driving is small enough to be worth it. This breaks slightly whenever the destinations don’t have an address (since the engine I use for that, OSRM, is incomplete), since then it doesn’t have drive times. However, it works well enough for now. To use, just go to the website (hosted on Vercel), and create a new family. Add some drivers and kids, and play around with the routing system. Hopefully this is useful!

4 devlogs
10h
10.69x multiplier
105 Stardust

Try project → See source code →

Open comments for this post

@28taidan on The Family Plan · about 1 month ago

4h 46m 20s logged

Finished the planning logic, and some final bug fixes before shipping. The planning works by assigning tiers to each person, which prioritizes their time over other people’s time. So in case like someone really doesn’t like driving people around, you can give them a high tier and it will try to protect their time more. It then runs through all the possible plans (which is really inefficient but works, since there’s not too many possible plans), to find one that works well. I also did some bug fixes, cleaning up the UI and other things from the fixes. Some things I still might add in the future are allowing people to integrate their calendars, email/text all the per person weekly plans to people directly instead of just downloading the images, and better tracking for where each driver is (it assumes the driver starts from home, but this is unrealistic, since the driver probably goes to work or something in the middle of the day. But they wouldn’t put that in the plan, since it doesn’t have anything to do with driving their kid around), so I might try to find out how to fix that. But I want to ship it first to get feedback on the core process.

Open comments for this post

@28taidan on BadGPT · about 1 month ago

1h 5m 18s logged

I finally got it to generate some names! They are sampled below. I watched the first half of Andrej Karpathy’s makemore video, and built a 2D array of the number of each instance of a letter following another instance. Then, when generating the name, it takes an array of the number of times in the training set each letter followed the previous letter in the name, multiplies that number by a random number 0 to 1, and chooses the highest number as the next letter in the name. This doesn’t work too well, as shown below, so I will spend some time improving, but it’s a start! Some names it generated are sampled below.

Open comments for this post

@28taidan on BadGPT · about 2 months ago

5h 20m 19s logged

Having completed the MLP I described in my last devlog, I set out to try to follow the next tutorial in the list I found. However, after I listened to the description of it, which is basically just a model that can take in a database of words and produce more words that sound similar, I foolishly believed I could try to build something similar without following the tutorial. I did take the database of names from the tutorial, since I will need those to train the model.

I started by transforming the inputs (the previous letter in the word) into numbers, and then the output (a number) back into a letter. My first attempt just put in the ASCII code for the letter, and the rounded down output number. The first issue was I used the entire database to train, which took forever and quickly hit the maximum recursion depth, so I switched to using only the first 500 examples (using the entire database, combined with super high differences between the output and the expected output resulted in the initial loss being 1,934,198,245, compared to a loss of like 4 when just testing my MLP).

After that, I got it to actually finish training without crashing, however, the loss didn’t go down, and it always outputted a null character. I realized I had been really stupid, as the tanh function at the end of every neuron in the MLP shrinks the result between -1 and 1, so the result could only ever be the ASCII codes 0 or 1, which are null.

So instead, I changed it so instead of outputting the ASCII code directly, it outputs a 27 long array, each index representing a letter (and the last representing end of name), each value at an index represents the probability of that letter being chosen, so to decode the results it just has to find the highest probability, check that it’s not the last element, and add 97 to it’s index to get the ASCII codes (because the letter “a” is at 97, so if it’s index 0 and meant to be “a” it has to adjust). This led it to actually output letters, but it always outputted the same letter, as shown in the screenshot. The loss dropped rapidly and then plateaued at around 2000.

I found some bugs (gradient accumilating over time instead of resetting each epoch) and made some changes (instead of inputting an ASCII code, it inputs in the same format as the output), but nothing had any effect. After thinking, I realized that the model is just picking the most common letter, and always outputting it. That’s why the loss can’t get any lower than 2000, because that’s the limit of how good this strategy can get. However, I have no idea how to fix this issue, and somehow teach the model the meaning behind the letters, so I will accept defeat and watch the tutorial, then implement what I learned.

Open comments for this post

@28taidan on BadGPT · about 2 months ago

4h 47m 32s logged

I implemented an MLP, I watched the entire Micrograd explanation by Andrej Karpathy, which was really interesting. I followed along, and played around with tuning my model (that does absolutely nothing useful) to predict stuff better.
It has Value objects, which store the data, as well as gradients, and the operation/previous numbers used to create them, as well as all the operations. This can be used to calculate the gradient (or how much each Value object affected the last Value in the chain of operations), which is basically all you need for backpropogation (which is where it goes back and calculates how much each weight/bias affects the final output).
It also has classes for each neuron, which is a collection of Values that act as the inputs, the weights (which are randomly generated at the start), and the bias for that nueron.
These neurons make up layers, and the layers make up the MLP (multilayer perceptron). I created a couple arrays of inputs, and the desired output, which has the output for each input array. To train it, you have the model predict something, then calculate the loss (or the sum of how far away each prediction was squared), then do the backpropogation (go through each parameter, or each weight and bias), find out how much and in what direction it affected the loss function, and finally change each parameter slightly so it moves the loss in the right direction. As you can see in the screenshot, I got it down pretty low. This will be useful because when I am training my LLM, I will need to use a neural network like this, and get the loss down as low as possible so it predicts better words and produces better results.
Karpathy’s video is at https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=2, it’s really quite amazing, I learned so much from it, and I haven’t even learned Calculus yet, which he states is a prerequisite in the description, I highly recommend if you’re interested. Next I will need to work on implementing transformers (which takes in the input (words), and derives meaning from them by encoding them into a vector).

Open comments for this post

@28taidan on The Family Plan · about 2 months ago

2h 13m 32s logged

Working on UX improvements!

I fixed a bunch of bugs, for example overlapping events saying “I’m available” are combined into one, for simplicity. The idea is that each person goes in to put in their availability, so I made it that you choose a person and edit only their location/availability, so they don’t have to see stuff that’s not relevant to them (it gets greyed out).

Next, I am working on the core of the project, the planning logic. The idea is to build an algorithm to find the lowest amount of total hours everyone. I think I’d probably need some kind of greedy algorithm, but I’m not sure yet.

Loading more…