A Stream of Consciousness

January 03, 2025

I am but a monster who can think and dream and make money in this capitalistic hellhole we call America. I love this country more than our fore founders ever imagined, their sacrifice turned to an investment compounded over generations. The more I write this the more I search for sentences in my head that sort of sound like things I’ve heard before. But I refuse to stop typing so I have to just keep pattern matching to make a coherent sentence for you. I don’t want to waste your time, I want to create value like the founders did and have it flourish in your future because you learned something. And in the end I will likely have learned something too.

Talking about talking

I wrote that a few moments after creating the text file for this piece.

It’s nonsense, but coherent. I believe when we are asked to perform a task, many subsystems instantly activate in our brains in a complementary manner. If we need to communicate verbally during the action, we activate the speech subsystem. But if we don’t activate anything else, does speech still work? Of course it does. These systems are independent. The drunk guy at the bar can still talk even though they may not have total agency over what they’re saying. Young children repeat what their parents say without a clue of what it means. Speech is not knowledge or insight, speech is a signal from A to B; it’s more like color than thought.

I really like cats. I think about them a lot, mainly because I have two of them and feed them twice a day. I’ve done a considerable amount of thinking about cats. About what they look like, how they act, what they play with, what seems to make them happy.

Cats are my favorite animal and there is so much to say about these tiny little creatures. And it’s not just me, the love of felines goes back thousands of years-- the Ancient Egyptians even sorta worshipped them! I sort of worship them now, at least my cats. I spend so much time making sure they are content and seem to have what they need. The black cat Marvin is a Tuxedo, and they are usually pretty sweet which is nice. Marvin is the most chill Tux out there. His sister Bibi is Cortico and she’s nuts-- we think she has some Tortie in her.

Those sentences came out of my head into this document very quickly with very little actual thinking, except for when I made a typo and stopped myself momentarily out of habit. I did not start the paragraph knowing I would mention the Egyptians. I started with cats, which lept to how much I love them, to how much people love them generally, to how long that’s been happening, to a reference that was obvious once I had gotten that far.

We call this stream of consciousness and I think it’s largely what LLMs are doing. The more we know about a subject, and the broader the context we have seen it in, the easier is to just say stuff that sounds insightful. In fact, to people who don’t have as much exposure, it IS insightful, because it’s new information. They gained something they wanted (knowledge, dopamine).

When data is outside its training set, an LLM will “hallucinate”-- but so does a human, no?

I had ChatGPT come up with esoteric question I couldn’t even fully parse: “How would the introduction of fractional branes in Type IIB string theory on a conifold singularity affect the gauge theory on the worldvolume of D3-branes, and how might this manifest in the duality cascade behavior predicted by the AdS/CFT correspondence?”

And I answered:

The fractional branes in Type IIB string theory on a conifold singularity affect the gauge theory on the worldvolume of D3-branes by disallowing them to come to a whole together. The fractional nature of the branes means that when combining them, the randomness of the sizes doesn’t allow them to create a whole brane. This stops them becoming inputs into the gauge theory altogether! It may manifest in the duality cascade behavior predicted by the AdS/CFT correspondence by helping to rule out conjectures that require some kind of intentional design, as the fractional nature of the branes implies a level of randomness only found by chance.

ChatGPT's evaluation:

Grading Breakdown (Scale of 1 to 5):

Understanding of the Topic: 2

Coherence of Response: 2

Answering the Question: 2

I expected a 1 for each to be honest. I have no idea what a brane is! You can see a lot of ChatGPT in my response, as well. In an attempt to sound convincing, I knew to repeat parts of the question back first. ChatGPT will do this quite often, sometimes to a degree that makes it pretty clear a human did not write it. But as I write this I understand why it does it. I understand how we got here.

And I think I understand what is missing.

Much of the recent advances in LLMs has been through Chain-of-Thought, a means of allowing the LLM to think a bit before its outputs are given to the user. This aligns with the idea that the speech subsystem itself is not enough, and there needs to be some version of “thought”.

However, I think we may be going down a rabbit hole attempting to find thought in language. I believe thought to be an innate, low-level subsystem that directly connects to our nervous system. Our thoughts are much more affected by what we feel than what we say. Our thoughts are significantly affected by what other people say, but only as a function of how those words made us feel.

We feel through emotional and physical connection, and those subsystems are missing in LLMs. We love people, we touch things. I hypothesize that is impossible to emulate this with a single language model.

Some ideas I’ve had

Model Parenting

Humans are born with a fair amount of instinct, especially when it comes to physical survival. We come out of the box with the ability to feel pain, a physical signal into our brain as feedback for whether a decision was right or wrong. In ML we call this unsupervised learning, and babies do it too. Ouch, that hurt! Better not do that again. Oh, that tasted good. Let’s have more of that. Babies can see and hear right away, they can observe their parents and eventually learn to repeat what they see. This is an insane shortcut to survival. Imagine a society of babies, no parents. It’s too quiet for a reason.

Models going through pre-training will have their weights initialized in some way, like through a normal distribution or randomness. I suspect we’ll learn a lot about how to train models faster by using models we’ve previously trained as the parent of a new model. A “parent” meaning a feedback mechanism that the model can observe during training and repeat, on a smaller scale.

This works for people on a lot of very complex tasks. Like hitting a baseball with a bat. Kids watch the pros on TV and learn to swing like they do much earlier than they understand the physics of why it works. Without that knowledge, improvement is found through repeated trial and error. Swing, feel, evaluate-- then swing again. Compare to the pros, change strategy, repeat. See-and-repeat is the biological equivalent of time-travel, and it should be exploited.

LLMs already sort of work this way, in that the target/criterion of a particular input during training is that same input plus one extra character at the end. So it’s seeing and trying to repeat what it saw, in text. It turns out this is a spectacular way to learn the structure of language. But it only rewards the model on if it sounds right, not if it is right. And sounds right is more or less literal here-- if we said the words, recorded them, then muffled the sounds, could we tell the difference between them? An LLM is trying its best to get you to answer that question with no. It is not trying to learn concepts or perform abstractions, except for the purposes of deception (i.e. making two things sound the same, even when they’re not). The valuable output of that process is finding conceptual relationships between words, which is very valuable data. But it’s not intelligence the way we understand it for humans. I think we value information when it comes with synthesis. Data alone means nothing if not interpreted. The ability to take two things, and draw connections between them in ways never considered before, is what I want to see from AGI.

I suspect we’ll see this in the realm of science or mathematics first, or another topic that is harder to anthropomorphize. For thousands of years, humans have told human stories through metaphor and I think there is enough training text out there for LLMs to convince us they’re doing something novel, but only because of lack of exposure. When anything can be a person in a story as a narrative device, the layers of conceptual relationships between seemingly unrelated concepts will appear as intelligence. But I think it’s more akin to a knock off B movie trying to capitalize on the latest blockbuster hit. It’s an imitation of something else, and if you smash two unrelated topics together coherently, you appear fresh. Generally we like stories we know, stories we have felt ourselves. And that's okay because we like it. They are color, like an LLM’s words. And we like color. But we need to expand beyond color.

Pre-training an LLM from scratch creates an inefficient compression algorithm

This comes up from time to time online, but I’d like to see it explored more. By nature of being able to reproduce external data which is larger than the model’s weights, an LLM is a compression algorithm.

What this means to me is that we are being wasteful in how we approach certain problems like hallucination and source acknowledgement.

I suspect that in humans, we have a sort of ETL process for speech. In that we Extract content from our memory, we transform it via thought, and load the result into the speech delivery engine, whatever part of the brain triggers the muscle movements necessary for speech. When we extract information, we’re doing it from more than one place. We have long-term memory and short-term memory, each with different characteristics. We also have a lot of metadata that goes with those memories, like a) the source we learned it from, b) the level of trust for that source, c) an internal measurement of how well we understand that topic, d) the consequences of misrepresenting that information. I’m sure there’s a million more in the dimensionalities of life.

All of that information is brought into the Transform part of the process, where thought takes the memories and the current input, and attempts to have a satisfactory output in words. When we Transform memory, and by memory I mean experiences and facts as they were given to you, we do so with at least two methods: recitation and insight. Some things are just repeated verbatim. Maybe you were asked a question, and you can recite a previous answer you heard. Or perhaps you need to repeat part of the question asked of you in order to better setup your answer. In both cases, you are not creating knowledge, though you may still be insightful. I think insight is not inherit to facts, insight is the feeling of increased value through thought. When a fact helps you it is insightful, when it is irrelevant to you it is not. It may be irrelevant because of the topic, or because you already knew it and therefore did not need to ask. In both cases, it is not of value to the consumer.

But many problems require recitation and therefore provide value in that context, and as such can be insightful. If I asked you for the preamble to Constitution, I am not asking for a probabilistic approximation. I am asking you to copy and paste. It’s a task with a discrete set of requirements. The probabilistic aspect of the task should only be whether to do it all, because a human providing an answer to a question is itself an experience subject to probabilistic decision making, even if the answer received is expected to be deterministic.

The answerer may fear the questioner, or know that the answer will cause problems for a loved one. Those are feelings, and they drive probabilistic decision making around the use of language but not the language itself.

The combination of recitation and insight is what makes something useful, instead of just interesting. It takes a tool and turns it into an autonomous agent. We’ve seen approximations of this by chaining language models together, so called agentic models. So that they can have external feedback in realtime. But because the external feedback has the same limitations as the originating model, I think we will not get very far with this. We may have a model fail less often on the first try, but only because the pre-task-execution discussion via the feedback mechanism allows for additional overfitting of the requirements given, instead of generalization of the conceptual actions. That is to say, it can do well described tasks better on the first attempt, but it will still struggle with novel applications of the conceptual action. You see this when working with coding assistants while trying to use uncommon libraries which have uncommon patterns. You can “feel” it trying to apply a method that doesn’t fit. You can “feel” the colors it is painting with, and know they not right.

But can we fix this by forcing our models to target unrelated concepts at the same time?

Multi-target training to the rescue

Most LLMs calculate a loss based on the difference between a predicted sentence and target sentence, where “sentence” is actually an array of integers, each one representing a different letter or word. We calculate a “loss” based on how different they are, so the opposite of accuracy.

If during training, the sequence length of an input sample is say 1024 tokens. That means only 1/1024 characters in the guess will contribute to changing anything about the next guess, since the vast majority of the target will be correct-- it was copied from the input sample after all.

This is called next token prediction, and it’s absolutely incredible that it works at all. But it’s also rather structural, not conceptual, and that makes it naive I think.

It’s simple, and doesn’t require data prep that would be infeasible on Internet-scale pre-training tasks, so we do it. But I think we conflate structural insight with conceptual insight. It’s interesting that two words can be related, in ways recognized from large sets of random text alone. But I’m not sure it tells us anything about that relationship. It's a structural relationship, a convenience of architecture. It’s validating, but not necessarily valuable.

If our pre-training methods were changed such that they were forced to calculate loss on unrelated tasks, I suspect we’ll get much closer to real value.

What I mean by this is that the loss function of a model is arbitrarily defined to reward it in a certain way, and we can alter that reward to incentivize the model to find pathways which cross-pollinate desirable concepts, like pain.

It’s not enough to know the next letter in a sentence; my model must know that if it gets it wrong, it will be kicked in the shin.

That sounds silly but I don’t think it is. I believe we can work backwards to create neural networks which better reflect the instinctual reward systems which drive human behavior. A computer will not feel pain, sure, but what does it mean to feel pain? You cannot feel what I feel, only what I describe to you. And when you ask me questions about my ailments, I extract the feelings, transform them to sounds, and load them to my lips. The actual feelings were always abstracted away from you, and I think this is repeatable with LLMs. I imagine it being like a neural network Tamagotchi, with a defined scope and known world obstacles. The model has limited signals it can fire which prompt actions within the digital landscape, and actual repercussions can be learned through pain, and “death”.

I think a neural network like this would be great at understanding spatial concepts, directions, interactions with animate objects. We can create a world of digital babies that are unable to die through the magic of back propagation.

Or, perhaps through synthetic data, we could know which mathematical operations an LLM must perform for a given sample input, and check whether it performed them by giving it access to a calculator at training time. A calculator itself can be represented with a neural network, so it could be added to the model. The accuracy of the invocation of the calculator can be used in the loss along with the next token prediction.

This could force the model to approach a single problem (answer the question) with two different neural networks and combine the outputs before predicting the token, forcing the calculator to be considered in the character prediction.

Similarly, we may find that the model should be given a Researcher subsystem, which allows it to fetch arbitrary content from model layers via an index and length, to which the sample input may use as the query for cross-attention. The index and length can be learnable parameters which force the model to create an index system, and play the role of librarian, bringing you the book you need for the answer you’re crafting. This would allow us to more easily inspect the model to understand how it does the recitation, which could help us to remove the need of RAG. In a world where fine-tuning allows for data indexing, the model would perform RAG itself.

I think all of this is achievable by doing some custom pre-training and exploring how to use the model weights of a larger model to distill or pass on the knowledge to a smaller one, compressing it even more in the process.

Post-writing Reflection

I know it’s naive to anthropomorphize an LLM, giving it supposed agency in what it can and cannot do. It’s just math at the end of the day, but if you go low enough with humans, it’s just chemical reactions. Not much different to me. It is, however, simpler to talk about how you move your arm, as opposed to how you thought about moving your arm and your brain sent the signals to the appropriate muscles to make it happen. That is to say, anthropomorphizing is a communication or narrative device, not a reflection of how these models work.

I do not mention vLLMs because I think they are not any different than LLMs. Because they rely on trust, meaning that the descriptions of the images are always considered true, that all the same issues with a text-only LLM apply. I do think they will play a role in multi-target training scenarios, but their power would be unlocked with a sense of fear and hunger. Sight is not knowledge; sight is in input for survival, it’s a raw signal. When combined with feelings, it helps with decision making. Alone it is just a tool for regurgitation, repeating what one was told about something from a distance.

Where we go from here

To me, this is the most exciting time in human history because we are finally interrogating what it means to think on a global scale. It’s like philosophy is taking over every day life. In good ways, and terrifying ways.

It also makes me wonder about all this money being spent on hardware to train huge LLMs. If I am right about compression, there will be much smaller, more nimble companies who will hockey stick it based on the work of the giants. Will be very interesting to follow the tech-first VCs.

Much of the above is speculation, ideas or directions to explore. It’s not fact or theory and I’m not pretending it is. I’ll be spending some time looking at these ideas in 2025, keep in touch!

AI Disclaimer

I asked ChatGPT to critique this piece, but did not let it write anything itself.

I'm sure you could tell.