112: Emad Mostaque - AI, Alignment, and Stable Diffusion

Narratives

0:00

-42:48

112: Emad Mostaque - AI, Alignment, and Stable Diffusion

Will Jarvis

Sep 19, 2022

In this episode, we're joined by Emad Mostaque, founder and CEO of Stability.ai to talk about his recent release of Stable Diffusion (an open source transformer language model), AI ethics, AI alignment, and what large language models mean for the fuutre.

This episode was co-hosted by Lars Doucet.

William Jarvis 0:05

Hey folks, welcome to narratives. narratives is a podcast exploring the ways in which the world is better than in the past, the ways it is worse in the past, where it's a better, more definite vision of the future. I'm your host, William Jarvis. And I want to thank you for taking the time out of your day to listen to this episode. I hope you enjoy it. You can find show notes, transcripts and videos at narratives podcast.com. Additionally, in this episode, my friend Lars do say joins us as a co host.

Unknown Speaker 0:40

Well, what are you doing today?

emad 0:42

Um, all right, so lovely, foggy days, London.

Unknown Speaker 0:46

Awesome. Awesome. Well, thank you so much for taking the time to come on the show, we really appreciate it. Do you mind giving us a brief bio and some of the big ideas you're interested

emad 0:54

in? Sure. So I'm a recovering hedge fund manager. So I spent most of my career making rich people richer.

Over the last few years, I decided to make the world a better place. So when I was a hedge fund manager, I used to focus on emerging markets, video games and AI. And that was a lot of fun.

And what I really tried to do the last few years is to bring that all together to make an impact. So one of the key things is Education Initiative, where we're teaching kids literacy numeracy, in refugee camps and other places in 13 months, and one hour a day. And over the last few years, it's all been about artificial intelligence. First off, being the lead architect of the United Nations back COVID ai initiative, and then building stability over the last year to make sure the future of AI is open and free. And that unlocks all our collective potential. My small mission.

Unknown Speaker 1:46

Very cool. I love that, can you tell us about what stability is and why it's important?

emad 1:50

So I kind of looked to the future, and the AI that was coming this transformer base foundational model AI, that it's getting to human level. And I was like, that's cool. You know, it's going to be amazing. But then I realized that can only come from big companies, and it'd be closed. And web to as it were the classic web, it was AI taking advantage of our attention everywhere, right? We were the product. That's why it was really, I was like holy crap, it's gonna be much worse. We need to have something open. So we can have our own API's. And it needs not the big models but reusable models. And be wouldn't it be cool if actually reflected the diversity of humanity, as opposed to you know, all this AI ethics stuff about models that don't recognize black people? Because they're only trained to white people? Or like dispensing things? Like, what if we could actually make AI diverse and open? It's got so powerful, it can really advance us as a race and help fix some of this broken stuff that thing does. Because society is just chaos, saying, What's the opposite of chaos stability? Let's call it that. And let's get people from all around the world to build cool stuff.

William Jarvis 2:55

That's great. That's great.

Unknown Speaker 2:57

You just released a product, which has been really cool. We've been using it here and narratives to generate some art. It's been, it's been great. Can you talk a little bit about the product you just released? Open onto the web?

emad 3:07

Yeah. So there's this thing about models and products. So stable diffusion is the model itself. It's a collaboration that we did with a whole bunch of people and we kind of spill it, we took 100,000 gigabytes of images and compressed it to two gigabyte file that can recreate any of those iterations. And those, I like to call it a trillion images in your pocket, rather than a MacBook Air one or anything like that. It's the first truly global text image model. That's good. So it's fast enough, good enough. And I'd say cheap enough, because we want to have the hardware, but the whole ecosystem can build around this. So that was kind of our wedge products. And then we've got audio and video and other things coming. But we think this will make a big difference.

Lars Doucet 3:49

So can we talk a little bit about the openness aspect of it. And I want to later in this in this conversation, dig into some of the controversies too, and just meet them head on. But let's start with the open aspect. You know, there's currently for people who haven't been following the AI art train, you have. What is it open AI is product, which is Dolly? And then you have Google's product. I forget what they call it. The I don't think there's even like a Yeah, I don't think anyone can even use it. They just have written a paper about it. And then there's some other lesser models out there. There's also mid journey, which it was was quite impressive. And then there's your stuff. Those are the ones I'm aware of. And can you talk a little bit about why you specifically decided to make yours open, what you mean by open and how that contrasts with some of the other models that are out there?

emad 4:44

Yeah, so my team for the last 18 months have been building up the aarC and so the models that underlying the journey and one bow and all of these other things are actually from my team. We released the MIT open source so anyone can use it without attribution or anything like that. because they were fantastic, and they could do really descriptive art. But by themselves, they weren't good enough to do potentially challenging art, shall we say, the images, the models have got to the point now where they can basically do photo realism and things like that. So in the release of this model, that's also much, much faster. And we wanted to put out to the world, we had to have some special considerations around usage. This is one of the reasons that Google hasn't released their model dahlias behind a wall, which is sure I could do amazing things in that it gives you wings as an artist, it's like looking motorcycle for the mind is blatantly paid. But then what if people use it for bad things, and this has been the whole ethics debate focused on bad rather than good. And it's a complex one, like, you know, why don't people use it for bad things, because it's good enough. So in the release of this, it's not a classical open source model. Instead, it's a public release model with a range of mitigations, such as bad stuff classifier that's there by default, and an ethical use policy. So it's public release, and anyone can use it. But it's not truly like, you can use it for anything, obviously, you shouldn't use it for anything illegal, anything unethical. But our view was that by releasing this, so anyone could take it and extend it, provided they stick with the license, it would spark a wave of creativity around the world, because there's nothing like being able to develop new experiences on your own computer, without any specialist hardware. And again, it was a big effort to get it down to that it's nothing like being able to create something that is striking. You know, like Robin Williams says, like the last, or a Tesla Roadster in the style of Starry Nights, it didn't do that and understand that. So our view is very different, because, you know, we have a different approach to what is beneficial, what is right, and what is open versus a big tech company, which is a lot more cautious. You know, again, our focus is on Unlocking Potential in reasonable and ethical way.

Lars Doucet 6:57

Right. So what are some of those reasonable restrictions? You talked about that? You have, you have fewer restrictions than open AI, but you still have some? So what is an example of things I'm not allowed to do with stable diffusion? And I mean, that in the legal sense, not in the sense that I can download your model and remove the NSFW filter?

emad 7:15

Yeah. So you know, this is question between legality, morality and ethics, right? And they kind of legality you're not allowed to do anything illegal, full stop. Government will come off to you. Right. Right. Morality,

Lars Doucet 7:28

but that that goes without saying, like, if I create, you know, some class of images, that's illegal in my country, like the police will just enforce that. You don't have to enforce that.

emad 7:38

Yeah. But I mean, some of the pushback we've been getting is what if people use this to create hate crime images or illegal stuff. And the classical thing of open source is that you don't kind of restrict it, because the assumption is people will use it legally. It is another tool like Photoshop, which you can also use to create this. But this is particularly pertinent because of the speed. And so there is a question like, is it ethical to release something that can create bad stuff that speed? Our view is that the good stuff is far beyond that, again, there are ethical use restrictions beyond the legality, but we also specifically mentioned the legality of this, you know, but it's not a straightforward or simple thing. The ethical or moral side is a little bit different. Because, you know, in France, you can have topless ladies in India, you definitely can't, how do you adapt to different things. So rather than having a giant filter and deciding what is ethical and what isn't what's moral and isn't, as opposed to legality, which is separate, we took just a snapshot of the internet. So one of the ways I describe this as a generative search engine, like when you go on Google images, and you type in a search, you're doing it for a purpose to get that image back. This can generate any image from a scrape of the internet, which is biased, and does have not safe for work and other things. And it's up to you how you use it. We're very big on personal agency and kind of usage, whereas a lot of the big tech companies a lot more paternalistic, in that they're concerned about the downstream impacts, or ethical and business reasons as well. This model was also released by the University of Heidelberg compass team, who our academic partners who were one of the leads on this, and so it falls under European legislation. And then we are a UK based company, which falls under UK legislation. And then people using it for all around the world, which falls under different legislations. There is the model itself. And then there are implementations of it, like dream studio, where you're not allowed to create not safe for work, and it's not a separate kind of policy. The reality is that this is an emerging field. It's happening so fast that we don't have all the answers. One of the things we want you to do is broaden the conversation in particular. So people can both create amazing things. There were restrictions and guardrails legally, and then it kind of goes well now, finally, as part of the open, real creative ml license that we put it with. You have to include the license if you use it, or if you show it to end users stay things that they do have this ethical responsibility and reminding them this is a very powerful tool. That's a bit not normal, and there is a legal obligation. Although the likelihood of being enforced, it's going to be complicated to find the fine. They said, we call out people in public, we don't put it. And usually they put it afterwards because it's not an onerous thing. But again, it's just like you have a powerful tool, please use it responsibly.

Lars Doucet 10:21

What about this, as you made it, you know, much more open than the existing things, although it doesn't follow the technical definition of open source. So there's this other notion with these big models that are trained on these huge datasets, so I can run stable diffusion on my machine, but I can't train, you know, bajillions of images on my machine. So how much of a barrier is that? Like? That seems like a technical barrier to making it available to just every Joe Schmo. But how, what do you think of that barrier? Will it come down? And what have your efforts been in that regard?

emad 10:56

No, I mean, every Joe Schmo can actually use it. And if you've got maybe 3090, you'll be able to fine tune images on it. It might take a while. But it's actually accessible. Like

Lars Doucet 11:06

what? What is fine tuning me? Are you saying that I can actually like change the training? Yes, even with my local resources? Yeah, interesting.

emad 11:14

So what you do is you take this gigantic 100,000 gigabytes of images, and you compress it down to two gigabytes and this neural network have what's known as the latent spaces or hidden layers of meaning. This is what glasses or glasses that are glasses on the material that understands things in context, because it forms again, these hidden layers of meaning. So I can press brain, then what you can do is you can add neurons on, so you can fine tune it, for example, on textures. And then you have a model that's texture based. So we'll be releasing in the next few weeks things to be able to create textures, or pixel based art, or watercolors or classical medieval paintings, it learns from those as you add it to the training dataset, over and over again. And again, it does require resource, but it will work on the 3019 Nvidia graphics card, which I think is amazing, because people will create their own sets to create the base model, though that foundation layer foundation model, do you need a fricking great big supercomputer? Although, again, we took it down and massively the amount of compute needed for this model. Other models we're training are far larger like our supercomputers, the time passes in the world. But this one required like hundreds of 1000s of dollars of compute, versus the million or 10 million other models than before.

Lars Doucet 12:31

I mean, in the scheme of company financing, 100 1000s sounds, I mean, the number of zeros makes a big difference. Yeah,

emad 12:38

and 100%. Again, like we have one of the fastest supercomputers in the world. So we can do massive model training. But we wanted this to be accessible. And the model training code is out there. So people can take it and train their own models from scratch. Or they can use our base and train the new models. So like, right now, as of today, someone just trained animator fusion. So they took the damn guru dataset, and they created 75,000 images, and they've implemented that, that's now available to the world for a specified MMA model, and we expect there to be a Ghanaian model, a Malawian model, you know, we expect there to be a model just for for, that'd be very kind of things. I'm sure people will make specific models, like our business model itself is that we're going into content providers. And we're saying, let's tell you static, boring content intelligent. So you can have video game characters, you can have, beloved ice, you can have all of Bollywood in custom models, and then you have a range of accessibility to content, that's smart.

Lars Doucet 13:45

Can you lay out your specific vision for going back to the goods? You know, a lot of people focus on the bads, but you want to focus on the goods? What are the goods as you envision them? And in this week, since it's been released, all these advances have come out? Is it what you expected? Talk about your vision for why this is going to be brilliant, and, and how you would put that in your own words.

emad 14:09

I have a vision of an intelligent internet, where every person, company culture and country has their own AI across modalities talking to each other, augmenting our potential. And it connects our combined information and compresses it to knowledge. And in our context, it gives wisdom. So we don't have that right now, we don't have anyone looking out for us. Instead, it's centralized. And it flows through Facebook and others pipes, right. So that's what I'm trying to make it make us anyone can build these models for themselves with a company country culture. And that's the big vision of the future. In the week that we have, we can have history and launch. Maybe it's two weeks national two weeks to the day. I think we were meant to have this recording on the day of release and it was a bit mad. Rearranging, we've had probably better this way. Yeah, we've had 100,000 developers downloading it and Amazing stuff from animation to being able to flip it, and figure out the words that create an image, textual inversion, and all sorts of things have been created by the community, which we've supported. We supported people who use our API don't use, right, but it doesn't matter a billion people gonna use this. Just to put this in context, this wave of innovation has occurred with just 100,000 downloads, downloading at 25 million downloaded all language models. Every dev in the world is going to be trying this out, and we're doing hackathons across the world. And it's just gonna go exponential with the stuff that people create. Especially because they realize this is like a little tiny file. But it's like a universal translator in a way. People just putting words in and see it come out. But it's most powerful when it's in the pipeline. Like you think about it from a games developer perspective, you don't call this all the time. You just call it for character creation, you know, or character adjustment or something like that. And you have the thief and poke, and then image of any type will come out the other side. How cool is that? And players can share it and all these other things, because the same input going in couple of bytes. Just a word sentence can create a masterpiece on the other side. How insane is that? So I'm looking forward to people pushing the envelope or having insane things. And I reckon we can get this file down to 100 megabytes, well, and get it to real time as well. Yeah, one frame a second. 24 frames a second 100 megabytes, that's my target. If you get that then you can create a movie on the fly just from a description. And you've seen that with like Xander stourbridge, he had 64 different prompts. And it was the history of the world. And that's all there was. And it had frame interpolation between the different prompts and the latent space. And it started from the Stone Age, and it went to the year 3000 ad, like seamlessly from one to the other, just from 64 sentences. How cool is that?

Lars Doucet 16:49

That's incredibly amazing. Now, I'd be remiss if I didn't acknowledge the other side of this debate, which is a lot of my friends have really polarized pretty quickly around the existence of these tools. There's a lot of people who feel kind of threatened by this, and I'm, I'm sure you've received a lot of email over the last two weeks about this sort of thing. There now seem to be all sorts of people are creating all sorts of interesting positions like I will, I will use this but I will never use it to recreate the style of a living artist, or it's unethical to use generative AI art for reasons or whatever. And I was wondering, what arguments have you heard? And are there any that you find persuasive or coherent, even if you don't agree with them specifically?

emad 17:40

I think there's understandable. Anytime you have a big change, there will always be fears, and you can't dismiss them out of term. I think it's a complex thing. And again, more voices need to be heard in this because previously, it was just people deciding that you couldn't have the technology and other technologies out there. The question is, whose responsibility is it? We want to be a leading voice and having these discussions, so we're working with various parties, on things like artists fingerprinting tools, so they can opt out if they want, because style isn't copyrighted, but you still might not want your style copied? You know, there are kind of other people who are saying that there are valid discussions around the dangers of this misinformation and other things we're trying to focus on is building tools to combat misinformation and getting the awareness of this. Because we know state actors have this, this is something we found our language model papers, where a lot of the pushback was what if this is used for bots? Well, guess what? The bad actors have the bots and they had access to these models, before they went open source, open source kind of wasn't the thing. So I think that the main things here are that, you know, people feared for their livelihood, people feared for the negative impacts. Those are the two main arguments here. I think with the livelihood, there'll be brand new industries created from this, like if you get it now you're in the early doors of a technology is going from 5 million to a billion people. And we'll be in everything we see. There's no way you won't make money off that if you're trying to make money and it automates the boring parts of art in some ways. In my opinion, again, you may disagree with that. I think that the nature of the negative stuff is I trust the community more than I trust large corporations and institutions to deal with that effectively. I remember real large amount of trust in humanity and again, we've seen people building all sorts of amazing stuff and people using it for not safe for work but still legal stuff. You know, like again, it's up to us as a society to figure out what's good or bad as opposed to it being decided for us I think

Lars Doucet 19:37

what what is what is an example of something that you were surprised by that you also delighted by that came out of this? Oh, like like something like you would hold up as like this is a shining example of what our technology can do. And it wasn't possible before.

emad 19:56

image to image, taking children's little sketches and turning them into beautiful art is like one of the most amazing things like seeing it live in front of their play. And no, I didn't realize that it'd be quite like that. But you could just draw a basic shape and then all sudden boom, or take like a Lego thing that you built, and it becomes an entire scene in the style of GTA six or something like that. Like, that was just awesome. Also the other side turning images into text. I didn't think we'd get to that quite so fast. We sponsored research in there, but now you can go both ways. And the probability of that is just mind boggling. You know, like, I posted a tweet, I was watching Silicon Valley on HBO. And I was like, we created a compression engine to create a new internet. Like, I'm freaking Pied Piper. In a way, if we can go both ways. Like I hope I'm not Ehrlich or Dinesh.

Unknown Speaker 20:49

Ahmad, I wanted to ask you now about AI alignment. A lot of people are pretty concerned with AI aligned with human interest and what could go wrong. If that's not the case. You know, you've pushed a lot of innovation in the space, which has made people more concerned because they think the timelines are getting closer. Up to you know, a lot of people think about 15 years away from human grade AGI Artificial General Intelligence, what do you think about AI alignment? How concerned should we be? And if so, if you are concerned, what are you doing about it? That's stability?

emad 21:20

Right. So I think that's a great question. I think AI alignments, one of those difficult ones he didn't know actually seen it. And no one really knows what it looks like, right? So if you look at the metallic metal calculus kind of projections for it, it's like, just five years out, 10 years out, nobody knows. Because the pace that we're going, if I told you a year ago that you could make a cat night or whatever, or like a level of quality be like that wouldn't happen, right? Like when GPT, four hits, and it can do AP Bio exams, or whatever it's going to do. Like Wait, what, like being able to solve IMO problems, like I remember the initial method, but that was hard. And now kind of, they can do it with open out like the binder. I think that the original Eleuthera team kind of some of them are instability now. And some of them went to set up conjecture to do alignment as alignment startups and the Trump tropic. It's been a big focus of ours. But what I personally believe and again, this is my personal belief is that the best way to do alignment and reflect human values is to make it so AI is human, in terms of the diversity of humanities or humane, shall we say. So our approach has been to allow smaller models to be there for everyone, every person, culture, company, country effectively across the modalities, the closest thing to here that I see is data. So data by DeepMind is a giant auto regressive model. That's 1.3 billion parameters based on reinforcement learning things that shows elements of generalization. But if you don't have in your data set, like Indian ethics and Chinese ethics and other things, what are you going to have, right? That's going to be very different to if you have a Western oriented pile, or common crawl are the things where you're stacking layers. This is the also the implication of the DeepMind. Chiller scaling paper. So the naive interpretation about it, it's about training for more epochs to get to ensure optimal kind of outcomes. And one of the things we're doing is she'll optimal language models giving sweet. But the real upshot of that it's about better data. And what is the better data look like? So one of the things that we're doing is in our education initiative, you know, we're going, we've been educating kids in 13 months and one hour a day, internet in refugee camps, we aim and remit to educate millions of children. And we're going to build the best open source system inviting the world to do that. But the output of that data is also ideal for creating better AI models that reflect local diversity and culture because it learns how to learn. And if you've got that happening across the world, I think you're much better to have likely to have more aligned outcomes than if you've just got one type of dataset, which is the internet. And our internet has been optimized for engagement, outrage, and negative things, because those are the things that sell ads. So we need to create better datasets that are more diverse, better models that are more diverse, and then we're less likely to hold the paper clips and die, which would be great. That that really

Unknown Speaker 24:12

wouldn't be great if we avoid that. I'm curious, it seems like one of the next big challenges is going to be you know, where do you get data from it? You know, Lars Lars has talked about this a lot like what happens when you kind of run out of data that's available on the internet, like, what do you do next? And how these things can kind of be put together at large. Do you want to talk about that a little bit? You might be able to frame this question. Well, yeah,

Lars Doucet 24:34

there's two kind of aspects to it is like one, you know, I feel like you mentioned chinchilla, and there was an article on this wrong by nostalgia braced about chinchillas wild implications kind of implying that it might actually be the case that we are now data bound rather than scale bound. What you know that instead of just stacking more and more layers, that things are going to go more in the direction of you need more data and better quality data. And that's where the returns they are. Do you agree with that assessment? Or is there more to it than that?

emad 25:06

No, I agree. 100%, I think you've seen the last elements of this kind of scale pulling as it were. So obviously, you see palm, and then there's a big, some big models coming, shall we say, that show generalization above a trillion parameters. But they're not flexible and dynamic enough. So one of the things about pulling out our models is that a lot of people use it, and then combine them in different ways to have various outcomes. But it's about data quality, because we've moved from big data, which then targets well, or laws, etc, to big models that use structured data. And we've seen elements of improvement in that for extension, this is why again, at length about the pile, don't lie on 400, m, and then five b Lie on moving to more structured ones will lead to better results, because it's again, reflective of humanity. If you sit down in front of a TV all day, and you absorb information is great. But if you have a structured lesson that's optimized, it's always better. And that's what we need to kind of learn and train on. That's why, like I said, our approach is that we build national level models with open datasets for each country. In the middle, we go to the broadcaster's and IP to other content and build models for all these companies and corporations. And from bottom up, we build educational models, for Malawi, for Bangladesh, for other countries, where it learns how to learn between five and 18. Because that's the best to teach. In fact, it's reinforcement learning with human feedback, right? So GPT, three is 175 billion parameters. When you see how it's used, how you use the models, like how the models are use you now, you can compress it down to 1.3 billion parameters, which is what using the open AI API. And we've got a whole lab around this on contrast of learning called carpet that really focuses on these elements to build on the code models and other things that we have coming up.

Lars Doucet 26:49

What about the other aspect of data is when so there's this notion of I use a metaphor of pre 1945 steel, which is that steel manufactured after 1945 has radiation in it, because of atomic bomb tests. And so for certain applications, there's certain kinds of steel that you have to go back and source historical steel to use. Because only those don't have the radiation silver didn't use to tarnish before the Industrial Revolution, because now there's too much sulfur in the atmosphere. So I use this to draw a metaphor between pre 2021 data, right? Because we're at least in the domain of art, because now stability.ai, and Dolly and all this are now adding AI generated art into the datasets. And before that, you just scrape the internet for images, you can assume that they're not created by our AI are the inflection point they are, is there any kind of danger of stagnation or weird kind of effects on your model now that you're consuming your models on output? Well, we

emad 27:53

did that for stable diffusion. So stable diffusion, Alpha was led to the simulacra bought a static data set. So we got people to rate the outputs. And then we fed that back into the model. That's why it's so aesthetic, and it's compressed, you know, we had an element of kind of our lhf at the end effectively, just like some of the new upcoming big models from people that will not be named had that as the final steps, instruct elements really help guide the final part of tuning for these models. So models are dogfooding, the previous elements as it were, going forward, this explosion of images that you see, it's not about the amount of data, it's about the structure of the data. Again, if you look, for example, at a Luthor AI releasing the pile, the pile was nearly a terabyte of highly structured data, our KYV and PubMed and things like that. That's been used by Microsoft and Nvidia and dozens of others because it's actually bothered to be structured. You don't need terabytes. Like how many everybody says things do we have? We have massive amounts. It's about structured data. And that was highly structured and better than common crawl that was used in the original GPT. Two and to an element DPT, three GPT. Three had some data structure in that. So

Lars Doucet 29:00

for my structure, you mean you mean clean? You mean clean data, specifically, right?

emad 29:04

clean data similar to how one of the things we then did is we took the aesthetic scoring of this data set that we really simulacra, but captions. And we use the clip model to figure out the aesthetic subset of lie on to be 2 billion images for 600 million images that were aesthetic, because things like PowerPoint slides are not aesthetic, but it does catch other things like Pokemon are not aesthetic. So it's very bad at Pokemon. As an example, we need to find a way to fine tune them back in they can look, let's face it, right. Faces and things. It's very good. Because that's what people and cats are very good at cats.

Lars Doucet 29:41

Yeah. So when somebody talks about diversity and stuff, you know, aesthetics is is highly subjective field, like is the answer there to create a lot of different models that are fine tuned on. People who think Pokemon are the most beautiful images in the world?

emad 29:56

Yes, the the way is to allow diverse Do you have models as opposed to have one model that tries to capture diversity? So the way that daddy two has tried to do it is they randomly add for non gendered words, genders and races. So you'll get a female Indian sumo wrestler, when you type in sumo wrestler, which could exist, you know, probably not many of them to be honest. Whereas if you have

Lars Doucet 30:20

like, like they literally, like they literally just add the words to the end of your prompt silently, semi randomly.

emad 30:26

Yes, you can do that. Because you can say sign a man with a sign holding up, and then it will say, like, black, or Latin American, and things like that, because the system picks up on those words.

Lars Doucet 30:38

Oh, that's hilarious. Interesting. Yeah,

emad 30:39

I mean, when you look, this table diffusion output is the raw output of the model, because we wanted to see how people did that before we added all the attributes mid journey takes label diffusion, or latent diffusion, and then adds processing steps on the front, and processing steps inside. Similarly, Daddy does both of those. So when you say well, diffusion correctly, you see something like magenta betta, which is beautiful and amazing, you know. But we don't want to filter inputs and outputs, we leave that up to the end user. But then like I said, it's going to move from people using it to people creating their own, especially stable diffusion. 1.4 isn't really what you should be fine tuning on, because in the next few months, you'll get version two and version three, and stapler diffusion. Stable diffusion is probably what you should start training on. To be honest, like this is just a test at the moment. But this diversity, again, we can think about this right data has three, you have data information, and then you have knowledge, and then you have wisdom. That's the way that I kind of put it. So the 100,000 gigabytes of images that we have from lie on TV for the actual training set, that's information, it compressed into knowledge, because it figures out the interconnections and latent spaces between words. But the wisdom comes from how you use it to figure out your own aesthetic in your own context, whether that was a person or a community or something else, and then you create your anime model or your textile model or something else. And that means that then when you combine these models together, you can learn from them. So it's like you put out the framework, and then you bring it all back in to create a more generalized model that's based on the sub model datasets. It's almost a trainer model kind of thing to kind of teach the student. But then also, it's reminiscent of kind of the MLE approach, right? And this is why some of the more interesting architectures coming out, have gone back to kind of Emily from kind of highly condensed model architectures.

Lars Doucet 32:25

So one of my thoughts here is, you know, there, it's accelerating so fast like after release this all these applications came out, I'm most interested in how this gets used in pipelines, right? We're already seeing Photoshop plugins where someone's like, give me this give me that, then put them together, erase in fill out fill, you know? Is it going to be possible for this to be used in video and animation? Like am I going to be able to create a model sheet of a character have it understand that the the cheekbones look like this, and there's always exactly this many eyelashes and little details like that, and then be able to, you know, get a walk cycle, do things like that animate on Tuesday animate on ones, those sorts of are, we gonna be able to fine tune it to that degree? Or is there gonna have to be a lot of tooling in the middle to handle that sort of deal. With both

emad 33:16

right, you can actually literally create anything with it now, because you have a generalized text to image architecture, but also image text architecture. So if you utilize something like style clip, or something like that, you can kind of do dynamic adjustments. But being able to understand what in an image is what you will move to the point where you don't need to have fingers anymore. You can just say, I want this adjusted in this way. And then it will adjust it just like the neural filters in Adobe Photoshop, right? For the animation architectures and things like that, again, we really have examples of being able to use codecs and other code models to build stuff in Blender. You know, what is the 3d file, a GLB file is a JSON plus a bunch of flat textures. And then we also have kind of text enough. So I think all these architectures will be able to go 2d To 3d to animate it, they're all their game studio already has animation inbuilt because the team that built it are the same team that's building majesty, diffusion and disco diffusion are these. So I included a example on my Twitter, have a sneak peek of the animation features that can do 120 frames every four seconds to create seamless animation. So some of these things, but it's not integrated the kind of dynamic prompt analysis, the ability to do in painting with just targeted things as opposed to drawing these will all evolve and again, people will mix and match the pipeline so they have different models for different things. This is also important because one of the things I recommended to people is the stable diffusion model has a very small clip. We have a bit G coming out this week and then a bit ah, that will make it go way beyond daddy to Why don't you use it with a model with a different architecture like the VA architecture that you've got and cry and or daddy Mini. Use that as the first step and then you stay with the futures and seconds To get on Mario and others have shown what type of amazing app that you can get, do multi step outputs with the best models there. And then do dynamic targeting of these various elements to do text to 3d mesh and other things. And you can use existing libraries, then don't try to want to short everything, or even try to show everything, just be intelligent and realize what it is, is this translation engine.

Lars Doucet 35:21

Right? So you've built all this amazing technology, and then you you've made it open within the confines of how that's possible within this sort of technology. Can you describe a little bit about how your business actually works? You know, obviously, Google and the others have decided that it's best to keep it closed and satisfied as a software as a service for the audience. You know, but what is your plan to turn this into, into a, you know, sustainable business.

emad 35:46

So for example, we announced our Indian partnership with Eros. Last week, we have an exclusive partnership and all Bollywood assets. And we're going to turn it all intelligence to have dynamic Bollywood things. So you can create your own Bollywood music videos, and audio and images. And we have a revenue share on that. We have world class industrial API's. So if you want to do 100 images in four seconds, you can do that for your API, which you can't do on your local GPU. But for G dream studio, the next version of it, you'll be able to use your local GPU or the cloud GPU with our brand new interface. So we've got a product strategy, which has some amazing products, that dream studio for animation and prosumers. We have a API strategy, where we have some of the biggest companies in the world, plugging into our API, and lots of announcements on that. And we will be the lowest cost API provider. Because we've got that scale. Plus, you get the latest models from our API before they're released. So I think today, we're releasing 1.5 via the API, and a couple of weeks, we'll release that public, we just got a few things to sort. But then version three will be short after and things like that plus are specified in painting models and other things, plus brand models from various brands that you can incorporate, those won't be open source, because they're not benchmark models, they're just fine tunes that we did like anyone else, we have a commitment to releasing the benchmark models to build the ecosystem, within a couple of weeks, you know, of them being available. So and then the final thing, like I said, Is this for deployed thing, if you are a brand with an asset, like a game studio, or a luxury brand, or cartoon or something like that, you can train your own model, but that's hard. So just we can go in and train it for you. And then we support the community and training their own models, and then people will build businesses, for the SME sector training models for that. And then that will be available through our marketplace, stability approved models effectively.

Unknown Speaker 37:32

That's great. That's great. Man, I'm curious, what did the next 10 years look like for you know, stability? Ai? You know, what do you want that to look like? Do you foresee just you know, running this for a long time? Do you have any big goals within the next decade?

emad 37:46

Yes, in the next decade, every child should have access to the absolute best education, health care, and resources they need. Like I want every country to be running on open source intelligent architecture, to make people happier. I think we'll have real time, Ready Player One type experiences for people who want to share and communicate, and it's gonna be freaking awesome. That's kind of it. I love it. I love it.

Lars Doucet 38:12

So one question is about your educational stuff. We've kind of zipped right in the middle. While we've also been talking about kind of the kind of very narrowing in on the AI art stuff. So you've got these, have you got an actual educational platform that you're released now? Or is that still in the works?

emad 38:28

We've been deploying for years. So we have randomized control trials with the United Nations and UNICEF and the International Rescue Committee and others, showing efficacy of 76% of children in refugee camps get literacy and numeracy and 13 months and one hour a day. And now the remit is to take that and invite the world to say, if you have an entire country, we can't say which one just yet, but it's been mentioned some places, or multiple as we will have to educate and you control the hardware software, deployment and curriculum to a degree, how can you give them the biggest potential and happiest, and I think that's something a lot of people will be interested in participating in as a global project. And you know, that will be better than what we have right now. Because what we have is non personalized education, non feedback education. So we have a platform now that's done, and still does great. So let's make it smart, and help them achieve their potential.

Lars Doucet 39:22

Right? So how do you feel about intellectual property in your brave new artificially intelligent world, you know, part of your message has been, you don't want to be close. You don't want to be open. The old companies want to basically create a mainframe that you connect to to do your AI. You're talking more about the people sort of owning the AIS, and rather than have one that purports to do everything have diverse models. How does intellectual property fit into this? Who owns this stuff? Who owns the data? Who owns the models? In as you see it if you just wave a magic wand?

emad 39:56

So a benchmark models of the country international level that suppose Some open datasets and mining of public information right along with UK and EU laws that are very specific on that. But then the individual person's models, that's their own data, that's their own copyright and things like that. Companies that we go into the output is owned by that company and then licensed. So if you want to use Bollywood stuff, they retain the license, but they license it to you to use in Photoshop to have that dramatic picture of Shahrukh Khan, or whatever you know. So that's the vision that I have whereby ownership goes back to the individual and the company, it's retained that if you use stuff, that's copyright, if you create something that's copyrighted, just like in paint shop, or some arm's length or something like that, you try and sell it, then you're violating copyright, you know, and then Disney will come after you if you do Mickey Mouse or whatever, again, do things that are right, but at least they'll now be licensed options. And I think this is the thing, when you have infinite abundance, you need to have an element of authenticity. So this is why I don't think it's NF Ts, it's maybe something else. Or maybe it's NF Ts, you know, in terms of just that shortcut to support, because right now what we did, we went to infinite abundance with things like Spotify, to get the music and for example, you have a million listens, and you get $2,000. That's not right. There have to be better ways to create this, make it dynamic, make it intelligent, I think you need blockchains. Maybe you just need to have a trusted protocol. But that's something that's a little far away. Let's get people building models versus the primitives, and then figure out how to get them to talk to each other.

Unknown Speaker 41:29

I love it. I love it. Well, Mike, thank you so much for joining us today. Where can people find you? Where should we send them?

emad 41:37

So you can go to stability.ai There's our communities there. So you know, we have harmonized coming online this month for music, we're going to release off as models their dance diffusion and a couple of weeks later on for images and Luther for language opened by ML for our protein folding and other work and many other communities kind of coming up. Yeah, well, socials are all going live and you can find it from there. Or you can just follow me on Twitter and of course the stable diffusion discord. Please join us for that.

Unknown Speaker 42:06

Awesome thanks to the monitor mod. We really appreciate it.

Lars Doucet 42:09

Thanks very much,

emad 42:10

everyone. Thanks guys.

William Jarvis 42:11

Bye. Special thanks to our sponsor, does market analysis for the support. Bismarck analysis creates the Bismarck brief, a newsletter about intelligence great analysis of key industries, organizations and live players. You can subscribe to Bismarck free at brief dot Bismarck analysis.com. Thanks for listening. We'll be back next week with a new episode of narratives. Special thanks to Donovan Dorrance, our audio editor. You can check out documents work in music at Donovan dorrance.com

Transcribed by https://otter.ai

Narratives

112: Emad Mostaque - AI, Alignment, and Stable Diffusion

Discussion about this episode