Geordi Rose and Suzanne Gildert at Sanctuary AI are one of the world leading companies working on humanoid robotics and AGI (artificial general intelligence). They have an interesting thought experiment about how much data it would take to generate a human like general intelligence. However, this would be an NPC (non-player character) style general intelligence. It would have no explicit internal world model. It would be like a perfect LLM (large language model) for text, where the LLM could perfectly predict the next word and it would appear to have intelligence and the results would be intelligent.
Jeff Hawkins, AI giant, had a theory that our brains predict the future. The book he wrote was “On Intelligence”.
If this is true, when a robot moves through the world it has to be able predict what will happen if even if it’s done in a way where the people who are programming the robots are actually doing all the hard work.
We are talking about the future because movement is about the future. It’s about where you will be in space.
A lot of sensory input and response is pattern prediction.
Putting these models into practice requires a motor layer.
A data-driven approach for humanoid robot motion control would be to take the experiential data of a robot so all of its senses and all of its positions and all of the information about where it is in the world and all that and building models on the data.
AI researchers tried in the early days and it didn’t work very well. However, flash forward to modern times so these large language models are good future predictors using text data for text prediction.
The Technologies underlying these large language models predict the future of text solely based on text. They don’t use priors and stuff like that. Other approaches to text AI parse sentences into verbs, nouns and perform analysis. LLM just dump a bunch of text into a bucket.
What if we took video frames of your entire life? As an approximation. 32 million seconds per year X 30 frames per second X 70 years. A billion frames per year for 70 years. 70 billion frames.
The frame is not just a video snapshot of your visual perception or of you. Geordie defines the life frame as a snapshot of all the data from you know. It sight, touch, hearing, everything. It where your limbs are which is commonly called proprioception. It is everything you can feel in your body you. Even what you feel in your digestive system. It’s a snapshot of your experiential immersion in the world 30 times a second.
How much data is in a frame ?
It is a tough problem. It is hard problem because the more tokens you have the more Fidelity you have to the underlying Raw stuff. You can choose to tokenize at different levels of detail so you as a designer of this data stream input have to decide how to tokenize. It’s a bit like saying what if you’re trying to decide how to compress a JPEG image it’s like what level of compression.
Let’s say that there’s a thousand tokens per frame when we tokenize each of the frames then we’re talking about a a hundred trillion tokens is your life.
Let’s say I wanted to take that data set assuming I had it. I wanted to train a Transformers like model like a prediction model by taking all of that data which is a sequence date in time. If I give you any set of frames as a prompt you’re just going to predict the next few frames. If I was a body running this model what it means to predict a frame is I’m going to hallucinate what’s about to happen and then I’m going to do it so I’m just going to act based on my prediction.
There’s a general rule of thumb and machine learning that you want to have 10 times more data than parameters. If I have a hundred trillion tokens in my life data set that’s 10 trillion parameters model.
The human brain has 100 trillion synapses. It’s weird that it’s even close. We get the same general estimate now one lifetime a 10 trillion parameter model and 10 lifetimes is the size of the human brain. If I could capture 10 lifetimes that’s a hundred times ten in this case A Thousand Years.
This would be enough data to train a model that was roughly the size of the human brain. The model that had no priors at all like there’s no nothing at all except data and we train it on experiential data of a person.
A robot running this type of model would behave just like a person but all they’re doing is predicting the next frame. they have no explicit inner World model they have no explicit understanding of anything.
Nextbigfuture notes that this robot would be the perfect NPC.
Brian Wang is a Futurist Thought Leader and a popular Science blogger with 1 million readers per month. His blog Nextbigfuture.com is ranked #1 Science News Blog. It covers many disruptive technology and trends including Space, Robotics, Artificial Intelligence, Medicine, Anti-aging Biotechnology, and Nanotechnology.
Known for identifying cutting edge technologies, he is currently a Co-Founder of a startup and fundraiser for high potential early-stage companies. He is the Head of Research for Allocations for deep technology investments and an Angel Investor at Space Angels.
A frequent speaker at corporations, he has been a TEDx speaker, a Singularity University speaker and guest at numerous interviews for radio and podcasts. He is open to public speaking and advising engagements.