Devlog by @28taidan

@28taidan on BadGPT · about 2 months ago

5h 20m 19s logged

Having completed the MLP I described in my last devlog, I set out to try to follow the next tutorial in the list I found. However, after I listened to the description of it, which is basically just a model that can take in a database of words and produce more words that sound similar, I foolishly believed I could try to build something similar without following the tutorial. I did take the database of names from the tutorial, since I will need those to train the model.

I started by transforming the inputs (the previous letter in the word) into numbers, and then the output (a number) back into a letter. My first attempt just put in the ASCII code for the letter, and the rounded down output number. The first issue was I used the entire database to train, which took forever and quickly hit the maximum recursion depth, so I switched to using only the first 500 examples (using the entire database, combined with super high differences between the output and the expected output resulted in the initial loss being 1,934,198,245, compared to a loss of like 4 when just testing my MLP).

After that, I got it to actually finish training without crashing, however, the loss didn’t go down, and it always outputted a null character. I realized I had been really stupid, as the tanh function at the end of every neuron in the MLP shrinks the result between -1 and 1, so the result could only ever be the ASCII codes 0 or 1, which are null.

So instead, I changed it so instead of outputting the ASCII code directly, it outputs a 27 long array, each index representing a letter (and the last representing end of name), each value at an index represents the probability of that letter being chosen, so to decode the results it just has to find the highest probability, check that it’s not the last element, and add 97 to it’s index to get the ASCII codes (because the letter “a” is at 97, so if it’s index 0 and meant to be “a” it has to adjust). This led it to actually output letters, but it always outputted the same letter, as shown in the screenshot. The loss dropped rapidly and then plateaued at around 2000.

I found some bugs (gradient accumilating over time instead of resetting each epoch) and made some changes (instead of inputting an ASCII code, it inputs in the same format as the output), but nothing had any effect. After thinking, I realized that the model is just picking the most common letter, and always outputting it. That’s why the loss can’t get any lower than 2000, because that’s the limit of how good this strategy can get. However, I have no idea how to fix this issue, and somehow teach the model the meaning behind the letters, so I will accept defeat and watch the tutorial, then implement what I learned.