Why AI art struggles with hands

Apr 27, 2024

You are called to create a post-apocalyptic astronaut giraffe. Generated. Genghis Khan playing a guitar solo, pixel art. Generated. A man holding a delicious apple... What's wrong with his

hands

? Why can't AI art make

hands

? It doesn't matter which AI art model you use. If you have a man holding a delicious apple, his hands will look weird holding it. Why is so difficult? Seems pretty easy, right? We have this strange situation where AI art instantly creates... Abraham Lincoln dressed as the glamorous David Bowie. But he

struggles

with a woman holding a cell phone. This is not just a strange problem.

Fighting AI art with your hands can actually teach you something more important... about how AI art works. I mean, what's so hard about this? I asked an artist who has taught thousands of people... how to draw hands with their imagination. Before someone becomes or begins to train to be an artist. How to officially train. It's pattern recognition. You grow up seeing a lot of hands... and you start to know what hands are like. You learn what things look like by living in the world and recognizing patterns. An AI is similar but has key differences. Imagine that an AI is like you... but trapped in a museum since birth.

More Interesting Facts About,

why ai art struggles with hands...

The only thing the machine has to learn are the images... and the little signs on the sides. Apple: A red apple on a brown table. It's like the images you see on the web and the descriptions that accompany them. It is similar to how you learn, but locked in that museum. If you want to understand an apple, you can spin it in your hand. You can see it whenever you want. If the AI wants to understand an apple, it has to find another image of an apple in the museum. Pattern recognition has allowed AI and people to draw decent apples... but the processes differ.

You start training to become an artist and now you say, "Okay, now I have to learn the rules." And that's where it becomes very different from how AI learns. Artists, to draw something complicated we tend to simplify things into basic shapes. And then when you look at a hand... you pretty much have the big blocky part of the palm, right? You have the front, you have the back and then you have the thickness. So you can make it into a square with some thickness. An artist can then add as much style, texture, and detail as he or she desires.

AI works differently. Look at this hand. The shapes are strange, but the AI has done a great job showing the light and texture here. Remember, AI knows how things look, but not how they work. So these patterns in pixels are easy to understand. However, he never learned that fingers don't bend like that. It does not simplify the forms. Remember, you're stuck in the museum, so you're just trying to guess where the hand-shaped pixels should be. Without knowing how hands work like us. But listen, I find this kind of dissatisfaction. I mean, I'm basically saying that AI can't draw hands because it's not a person.

But the AI also knows nothing about construction and can still build a beautiful skyscraper in New York City. To understand this better I spoke with two people who have worked with generative art models. Yilun Du is a graduate student whose heart is in robotics. But, you know, the art of AI is a big deal now. So, he got dragged into this. Because of how popular these models have been in generative art... I've been working on that too. And I talked to Roy Shilkrot, who has a super varied resume but has been teaching about generative art since 2018.

The good students who come in... who are trying to break those models take it to the next level. Talking to them helped me discover three big reasons. Not all the reasons, but three big reasons why hands are difficult for AI art models. The size and quality of the data, the way the hands act and the low margin of error. For data size, let's go back to the museum idea. The museum the robot is in has a lot of rooms dedicated to faces... but not as many rooms for hands. That means you have less to learn about.

As an example, available data sets such as Flickr HQ have 70,000 faces. 70,000. And this popular one scores 200,000 photographs of celebrities' faces... for many details such as glasses or pointy noses. There are plenty of great hand data sets that can really understand hands like this with 11,000 hands. But these may not have been used to train art-making AI. This scarcity of data is combined with the quality and complexity of the data. The data for the hands in the art museum is not yet annotated to show how they work. Like the pointy noses of celebrities. What they say is... there is a picture and there is a person in the picture and that person is holding an umbrella.

You don't give the machine many clues by saying that it is a person holding the umbrella. The thumb goes from one side of the handle and the fingers are curved... and then the thumb covers the index finger but not the other. This is all made worse because hands do a lot of things compared to, say... faces. So there is a fairly common face as a portrait photo. There are many photos of these on the internet and everything is very well centered, right? As if there were always eyes around here. As if this order always existed. That's not true for hands that can do this and this and that.

I swear I'm sober right now. Stan mentioned this too. How many fingers do you see now? Like two or three. As if he didn't know there were five. Because sometimes there are two, sometimes three, sometimes four, sometimes five. You can see these problems with the hands of AI, but the madness is in the entire art of AI. Just look at the horses. You can also have three legs, five legs, six legs. The model does not learn to explain this because there is too much diversity and it is not as biased as we are. Well. Did you hear the last part he said?

Good, because it's really important. He is not as prejudiced as we are. We worry a lot about our hands and we need them to be perfect. There is a low margin of error. But since the model doesn't understand hands, he hasn't seen many and because the hands act strangely... he makes images that are like hands that you see in the museum, but they are not an exact hand. That's enough for a lot of things, but not for your hands. Here, let me give you some examples. Come here. So I wrote "make me a person with exactly five freckles." This one is from Dall-E 2.

This one is from Stable Diffusion and this one is from Midjourney. So it's like, you know, a great job. You have a redhead person. They are more likely to have freckles. But there aren't exactly five freckles here. Here that doesn't really matter because we see a freckled face. But hands require higher standards. Look again at our man holding apples. I made 3 other variations. All the hands are strange, but don't look at them now. He changed the stripes on the shirt, the buttons, the apple style... None of that matters because it's like a stripe, like a button and like an apple.

But looking like a hand is not enough. I came away from this thinking a couple of things. AI art is basically bad at art. We can just see it with our hands... and B, it will never get better. But both things are a bit wrong. I will say that the newest AI art generator that came out at the time of this video is Midjourney version 5 and they sure made some progress with their hands... but it's not totally fixed yet. Don't tell the AI to hold an umbrella. I think they spend a lot of time on some things that you appreciate, that's why you like the pictures and a lot of things that you don't really even notice.

I think for a lot of natural settings or something like that, I feel like models might be better at it than people. And they are working on two things. First, they make the AI look at many more images, which requires more computing power. They're trying to solve that on a large scale because if you want to train with more than a handful of images... if you want to train more than 100 images, this would require enormous resources to retrain the model itself. The other solution could be to invite more people... to the museum. There is an interesting analogue.

So, have you heard of ChatGPT? The big difference was that it basically used human feedback. So they generated many, many sentences and asked people to rate which ones are good and which ones are not. Basically, they tune the model so that it generates sentences that are convincing to people. I guess it would take a lot of engineering to get people to label that much data. But I think if we could rate similar people... how good the images generated by these models are, then a lot of these problems would go away, actually. Because they are simply training the models to do what people like.

It's not just the hand... teeth and abs. Anything that has a pattern... a lot of something. He doesn't know the "there are so many" rule because he is trained in different quantities.

Watch Video & Subscribe

If you have any copyright issue, please Contact