minimaxir 6 hours ago

Whisk itself (https://labs.google/fx/tools/whisk) was released a few months ago under the radar as a demo for Imagen 3 and it's actually fun to play with and surprisingly robust given its particular implementation.

It uses a prompt transmutation trick (convert the uploaded images into a textual description; can verify by viewing the description of the uploaded image) and the strength of Imagen 3's actually modern text encoder to be able to adhere to those long transmuted descriptions for Subject/Scene/Style.

  • torginus 4 hours ago

    Why text? why not encode the image into some latent space representation, so that it can survive a round-trip more or less faithfully?

    • minimaxir 4 hours ago

      Because Imagen 3 is a text-to-image model, not an image-to-image model, so the inputs have to be some form of text. Multimodal models such as 4o image generation or Gemini 2.0 which can take in both text and image inputs do encode image inputs to a latent space through a Vision Transformer, but not reverseable or losslessly.

    • flkenosad 3 hours ago

      Text might honestly be the best latent space representation.

  • cubefox 4 hours ago

    > This tool isn’t available in your country yet

    > Enter your email to be notified when it becomes available

    (Submit)

    > We can't collect your emails at the moment

  • strangattractor 5 hours ago

    Google is the new Microsoft in the sense that they can Embrace, extend, and extinguish their competition. No matter what xAI or OpenAI or "anything"AI tries to build Google will eventually copy and destroy them at scale. AI (or A1 as our Secretary of Education calls it) is interesting because it is more difficult to protect the IP other than as trade secrets.

    • mritun 4 hours ago

      > Google will eventually copy…

      Weird take given Google basically invented and released through well written papers and open-source software the modern deep learning stack which all others build on.

      Google was being disses because they failed to make any product and were increasingly looking like Kodak/Xerox one trick pony. It seems they have woken up from whatever slumber they were in

      • Workaccount2 3 hours ago

        They didn't entirely drop the ball since they did develop TPUs in anticipation of heavy ML workloads in the future. They tripped over themselves getting an LLM out, but quickly recovered primarily because they didn't have to run to nvidia and beg for chips like everyone else in the field is stuck doing.

      • strangattractor 3 hours ago

        Like MS, Google is ubiquitous - search is much like Office and DOS before that. Anything OpenAPI or the other AI competitors create would normally be protected by patents for instance. Not so with an AI models. Google has the clout/know how to responded with similar technology - adding it to their ubiquitous search. People are both lazy and cheap. They will always go with cheaper and good enough.

        • harrall 3 hours ago

          Google invented the technology.

          https://en.m.wikipedia.org/wiki/Attention_Is_All_You_Need

          OpenAI was the copycat.

          If Google had patented this technique, OpenAI wouldn’t have existed.

          • strangattractor an hour ago

            How do you patented it? What specific "practical, real-world application" does AGI purport to solve? All these algorithms work by using massive amounts of data. They all do it the same way or close to the same way.

            "Algorithms can be patented when integrated into specific technical applications or physical implementations. The U.S. Patent Office assesses algorithm-based patent applications based on their practical benefits and technological contributions.

            Pure mathematical formulas or abstract algorithms cannot be patented. To be eligible, an algorithm must address a tangible technical challenge or enhance computational performance measurably.

            Patenting an AI algorithm means protecting how it transforms data into a practical, real-world application. Although pure mathematical formulas or abstract ideas aren’t eligible for patents, algorithms can be embedded in a specific process or device." [1]

            [1] https://patentlawyer.io/can-you-patent-an-algorithm/#:~:text...

wewewedxfgdf 3 hours ago

1: Press release about amazing AI development.

2: "Try it now!" the release always says.

3: I go try it.

4: Doesn't work. In this case, I give it a prompt to make a video and literally nothing happens, it goes back to the prompt. In the case of the breathtakingly astonishing Gemini 2.5 Coding - attach to source code file to the prompt "file type not supported".

That's the pattern - I've come to expect it and was not disappointed with Google Gemini 2.5 coding nor with this video thing they are promoting here.

  • throwup238 3 hours ago

    On the contrary I had completely written off Google until a few days ago.

    Gemini 2.5 Pro is finally competitive with GPT/Claude, their Deep Research is better and has a 20/day limit rather than 10/month, and now with a single run of Veo 2 I’ve gotten a much better and coherent video than from dozens of attempts at Sora. They finally seem to have gotten their heads collectively unstuck from their rear end (but yeah it sucks not having access).

  • martinald 2 hours ago

    I really don't know why Google especially seems to struggle with this so much.

    While Google have really been 'cooking' recently, every launch they do is like that. Gemini 2.5 was great but for some reason they launched it on web first (which still didn't list it) then a day or so later on app, at which point I thought it was total vapourware.

    This is the same - I have gemini advanced subscription, but it is nowhere to be seen in mobile or app. If you're having scale/rollout issues how hard is it to put the model somewhere and say 'coming really soon'? You don't know if it's not launched yet or you are missing where to find it.

  • siva7 3 hours ago

    you're using it wrong. change file ending to .txt instead

    • bornfreddy 3 hours ago

      I can't tell if this is sarcasm or a helpful advice?

      • Workaccount2 3 hours ago

        It's how you have to do it. The gemini model is excellent, but the implementation/chat environment seems like it was thrown together in a weekend as an afterthought.

        You cannot upload a .py file, but if you change the name to "main.txt" you can upload it, and it will automatically treat it as "main.py". Not sure how this hasn't been fixed yet, but it is google so...

xnx 6 hours ago

This is amazing. I wouldn't think that something as computationally expensive as generating 8 second videos would be available outside of paid API anytime soon.

torginus 4 hours ago

I am not really technical in this domain, but why is everything text-to-X?

Wouldn't it be possible to draw a rough sketch of a terrain, drop a picture of the character, draw a 3D spline for the walk path, while having a traditional keyframe style editor, and give certain points some keyframe actions (like character A turns on his flashlight at frame 60) - in short, something that allows minute creative control just like current tools do?

  • nodja 3 hours ago

    Dataset.

    To train these models you need inputs and expected output. For text-image pairs there exists vast amounts of data (in the billions). The models are trained on text + noise to output a denoised image.

    The dataset of sketch-image pairs are significantly smaller, but you can finetune an already trained text->image model using the smaller dataset by replacing the noise with a sketch, or anything else really, but the quality of the output of the finetuned model will highly depend on the base text->image model. You only need several thousand samples to create a decent (but not excellent) finetune.

    You can even do it without finetuning the base model and training a separate network that applies on top of base text->image model weights, this allows you to have a model that essentially can wear many hats and do all kinds of image transformations without affecting the performance of the base model. These are called controlnets and are popular with the stable diffusion family of models, but the general technique can be applied to almost any model.

  • wepple 36 minutes ago

    LLMs were entirely text not that long ago.

    Multi modality is new; you won’t have to wait too long until they can do what you’re describing.

  • minimaxir 4 hours ago

    Everything is text-to-X because it's less friction and therefore more fun. It's more a marketing thing.

    There are many workflows for using generative AI to adhere to specific functional requirements (the entire ComfyUI ecosystem, which includes tools such as LoRAs/ControlNet/InstantID for persistence) and there are many startups which abstract out generative AI pipelines for specific use cases. Those aren't fun, though.

  • Rebelgecko 4 hours ago

    You can do image+text as well (although maybe the results are better if you do raw image to prompted image to video?)

delichon 6 hours ago

I think I would buy "yes" shares in a Polymarket event that predicts a motion picture created by a single person grossing more than $100M by 2027.

  • tracerbulletx 5 hours ago

    Everyone keeps ignoring supply and demand when talking about the impacts of AI. Let's just assume it really gets so good you can do this and it doesn't suck.

    Yes the costs will get so low that there will be almost no barrier to making content but if there is no barrier to making content, the ROI will be massive, and so everyone will be doing it, you can more or less have the exact movie you want in your head on demand, and even if you want a bespoke movie from an artist with great taste and a point of view there will be 10,000 of them every year.

    • motoxpro 4 hours ago

      Totally agree.

      This is what Instagram and YouTube did and we got MrBeast and Kylie Jenner making billions of dollars. The cost of creating content is tapping record on your phone and the traditional "quality" as defined by visuals doesn't matter (see Quibi). Viral videos are selfies recorded in the bedroom.

      When you lower the barrier to entry things get more heterogeneous, not less. So you have bigger outcomes, not smaller, because the playing field expands. TikTok's inside was built on surfacing the 1 good video from a pool of 10s of millions. The platforms that surface the best content will be even more important.

      It's a little disheartening, I think, for people to think that the only reason they can't be creative is money, time, or technical skill, but in reality, it's just that they aren't that creative.

      So yes, everyone can create content in a world of AI, but not everyone is a good content creator/director/artist (or has the vision), same as it is now.

      • SirMaster 4 hours ago

        Will the AI itself never be a good content creator/director/artist?

        People are always out there tying to convince others that AI is better than humans at X. How close is it to being better than humans at being a content creator itself? Or how long before that threshold is crossed?

        • Workaccount2 3 hours ago

          It will always be subjective. There will always be holdouts who will denounce any AI work as "bad" simply because it was created by AI.

          Even when AI is objectively better and dominates in blind ratings tests, there will still be a strong market for "authentic" media.

          For instance we already have factories that churn out wares that are cheaper, stronger, better looking, and longer lasting than "hand made", yet people still seek out malformed $60 coffee mugs from the local artistan section in country shops.

      • darepublic 3 hours ago

        I don't think Mr Beast is particularly creative. He makes common denominator crap that appeals to kids. I expect the same of Kylie Jenner

        • motoxpro 2 hours ago

          You may not like them, as another poster said, it's all subjective.

          That doesn't mean they aren't incredibly good at what they do and that millions (billions) of people have tried to do what they have and failed.

          One of the reasons it's "common denominator crap" is because the blob of the internet has 100s of millions of videos copying MrBeast and the Jenner/Kardasians created an entire generation of people that wanted to be influencers. Most of the copies are Slop.

          Once they are intrenched they can continue to produce "crap" as you call it because they have distribution, the copies don't work because they aren't novel, which makes people feel like it doesn't take talent and is the algorithms fault, until the next person to be "creative" gets distribution and the cycle repeats.

          There is just a lot less creativity than people imagine. It's not a right that we all have as humans; it's rare. 8.2 billion people on earth, 365 days in a year, 3 trillion shots on goal, and only a few hundred novel discoveries, art creations, companies, and ideas come from it.

    • yorwba 5 hours ago

      And one of those 10,000 will have a multimillion marketing budget and people are talking about it online and remixing it into memes and it will make a lot more money than the second-most popular movie, even though there's no discernable quality difference.

      • Asraelite 5 hours ago

        It will basically be like the rise of indie games. Every now and again you get something like Among Us which is low quality but good enough to be enjoyable and with the right combination of luck and timing it becomes insanely popular.

        • Wowfunhappy 5 hours ago

          Not just Among Us. You also get Minecraft!

    • barrenko 5 hours ago

      To quote Nikita Bier, never underestimate how many people just want to watch Netflix and die.

    • GloamingNiblets 4 hours ago

      A good parallel is writing books. Books can cost little to write and publish, but their success is Pareto distributed, not Normally distributed.

      • mlboss an hour ago

        Writing book is really expensive. You have to think and put words on paper that engages the read. It is really hard.

    • panarky 4 hours ago

      That was the story with CGI too, that there would be overwhelming supply that drives prices and value toward zero.

      And yet Marvel exists.

      Turns out in a world of infinite supply, value comes from story, character, branding, marketing and celebrity. Those factors in combination have very limited supply and people still pay up.

      I don't see any reason why AI-gen video is any different.

      • tracerbulletx 2 hours ago

        It's still quite difficult and extremely time consuming to create a visual effect. And the technique to film actors and blend them is additionally quite difficult. If you get to the point one person can make a movie, yes you will be limited by your own creativity, but the number of people who can do that is still a lot greater than the number of people who can do that, and manage a 200 million dollar budget production and get an end product that meets their vision.

    • nmilo 5 hours ago

      It will be like YouTube. Distribution will be hard and most of it will be slop but every now and then you’ll discover something so good and so creative and it couldn’t have possibly existed before that it makes the whole experience worth it. The best creative works are led by one person and I’m excited to see what people can come up with.

    • googlryas 3 hours ago

      A lot of people can't actually say what kind of movie they want, until they see it. And even if there are 100,000 releases every year in every genre, virality will probably still exist where even if random, one of those movies is going to get more popular than the rest and then everyone will "need" to see it.

    • mvdtnz 5 hours ago

      Most of us have no idea what movies we want. The most delightful films are a total surprise (other than the drones who watch every Marvel film of course).

  • jddj 5 hours ago

    I came to the same sort of conclusion when watching Kitsune, which I think was one person and VEO https://vimeo.com/1047370252

    Granted, 5 minutes isn't 1h30 but it's not a million miles away either.

    • xrd 5 hours ago

      It's fantastic.

      I just watched Kitsune, thanks for sharing.

      It reminds me why Flow was so good.

      Flow was great because I could see the shader artifacts. It was the opposite of a Disney model, it was not polished and perfect.

      That's why I loved it. Disney would never do a movie with a plot like Flow. They would write and rewrite it and it would be a perfect example of humanity, but totally devoid of the humanity behind it.

      It is ironic that this new coming wave of AI generated (or AI assisted) films feel like they have more human craftmanship than Disney films, when honestly it is the opposite. Disney has incredible and brilliant animators, but that is all crushed behind the merchandising and gross behemoth of the Disney corporation.

      I used to love seeing independent films. Those art house theaters really only exist in places like Portland, OR these days. But, I'm excited about the next wave of film because it'll permit small storytelling, and that's going to be great.

    • msabalau 2 hours ago

      Kitsune is great!

      I've been a VideoFX tester, and have made a couple of five minute shorts. You end up having to generate a lot of shots that you throw away. This is a lot easier to bear if you are tester without really strict monthly limits, or having to pay to get past them.

      Also, there are all sorts of things you have to juggle or sidestep related to character consistency and sound synchronization. They'll be also sorts of improvements there, but I suspect getting to 90 minutes isn't really a question of spending more time and generations. Right now I think a strong option for solo aspiring AI film makers is to work on a number of small projects, to master the art, and tackle longer projects when the tooling is better.

    • gh0stcat 4 hours ago

      This actually so amateurish and cliche it's painful. The fact people like this shows that art never had a chance when the masses have no taste. This makes me depressed for artists and the future.

      • switchbak 3 hours ago

        Well sure, but we're in the early stages here smashing bones together. When a few million bored teenagers bang at this, I bet you'll see perspectives you've never thought of. It'd be like having someone in the 1920's listen to Nirvana - just a completely different experience.

        Given the dreck coming out of Hollywood, I'm open to that, even if other folks have to wade through a million shitty videos for me to get it.

      • jddj 4 hours ago

        It won't be novel the 100th or 1000th or millionth time, and standards will rise accordingly. But for now it is, or at least 2 months ago it was.

        Someone created that relatively coherent 5min animated story largely by communicating with a computer in natural language.

        The masses have had plenty worse

  • xnx 6 hours ago

    We've got a pretty good datapoint along that trajectory with Flow. Almost entirely one person and has grossed $36 million. https://en.wikipedia.org/wiki/Flow_(2024_film)

    • jsheard 6 hours ago

      It was a small team for sure but not a one man show, there are 22 credits for the animation work alone, plus 13 more for sound and music, not counting the director.

      • xnx 4 hours ago

        > Almost entirely one person

        It is closer to one than number the staff of other animated films. It's a good data point to keep in mind as AI tools enable even smaller teams to do more.

    • karolist 5 hours ago

      Went to the cinema with my kids for the 2nd time to watch this one, was pleasantly surprised to read this movie was done using Blender, highly recommended.

    • mattfrommars 5 hours ago

      One person? What do you mean? It literally says in the wiki more than one.

      This isn't solo dev game project.

  • NitpickLawyer 6 hours ago

    I think you might need qualifiers on that. Are we talking an unknown / unrelated person living in the proverbial basement, or are we talking a famous movie director? I could see Spielberg or Cameron managing to make something like that happen on their name + AI alone.

    If we're talking regular people, the best chance would be someone like Andy Weir, blogging their way to a successful book, and working on the side on a video project. I wouldn't be surprised if something along these lines happens sooner or later.

  • bookofjoe 3 hours ago

    Me too. Sam Altman recently predicted that we will see a one-person unicorn company in the near future.

  • SirMaster 4 hours ago

    Well text generation is way ahead of video generation. Have we seen anyone create something like a best selling or high grossing novel with an LLM yet?

    • delichon 3 hours ago

      That's why going from one person to zero persons will be so hard. But one Kubrick/Carmack and a bunch of AI could make a compelling movie now.

  • silksowed 6 hours ago

    very excited to play around. will be attempting to see if i can get character coherence between runs. the issue with the 8s limit is its hard to stitch them together if characters are not consistent. good for short form distribution but not youtube mini series or eventual movies. another comment about IP license is indeed an issue but its why i am looking towards classical works beyond their copyright dates. my goal is to eventually work from short form, to youtube to eventual short films. tools are limited in their current form but the future is promising if i get started now.

  • colesantiago 5 hours ago

    My prediction is on track to this and this was made only 4 months ago.

    https://news.ycombinator.com/item?id=42368951

    • delichon 5 hours ago

      There may be a solo (not Han) movie good enough to compete in five years, but I doubt that Academy voters will be that welcoming of the tech that can obliterate most of their jobs by then.

      • switchbak 3 hours ago

        If AI can fix the terrible ending that was Game of Thrones, then perhaps it won't have been a complete waste after all.

      • kridsdale1 4 hours ago

        Based on the training data being pop culture, we may even get a good Han Solo movie from tools like this. Starring young Ford.

  • kevingadd 6 hours ago

    I think the obstacles there are distribution and IP rights. I think we will see content like that find widespread appeal and success but actually turning it into $100m in revenue requires having the copyright (at present, not possible for AI-generated content) and being able to convince a distributor to invest in it. Those both seem like really tough things to solve.

    • delichon 5 hours ago

      Purely AI-generated content -- with no human authorship -- is not eligible for US copyright protection. However if a human contributes meaningfully to the final output (editing, selection, arrangement, etc.) it becomes eligible. See Thaler v. Perlmutter (2023).

      • Workaccount2 5 hours ago

        >is not eligible for US copyright protection

        Once industry adopts AI generation, which it will, a new law will be quickly signed.

        In a way, not allowing copyright of AI material really only serves a tiny group of people. "We want to empower everyone to bring their ideas to market, not just those with the ability to draw them" is not a particularly evil or amoral sentiment.

        • gh0stcat 3 hours ago

          As if the ability is not attainable, people want to be put on top of the mountain without any effort.

          • Workaccount2 3 hours ago

            When society climbs mountains, they eventually build elevators. Its a core functionality and the reason why we are so advanced. Just take a moment to realize how many peaks you already sit on top of, without even thinking about. Your home is overflowing with cheap wares from mountains ascended ages ago that you now have "no effort" access to.

    • r58lf 2 hours ago

      Yeah, people underestimate how hard it is to get a movie into a theatre AND get people to pay for a ticket.

      Hollywood can barely get any well made movies past $100 million these days unless it's based on some well known franchise (minecraft, Captain America, Snow White) or it has some well known actor.

smallnix 6 hours ago

Brave to make ads with the Ghibli style. Would have thought that's burned by now.

  • gh0stcat 4 hours ago

    No one has any morals or soul at this point. It's all garbage in, garbage out.

  • minimaxir 5 hours ago

    Looking at the video, I think there's shenanigans afoot. The anime picture they input as a sample image is more generic anime, but the example output image is clearly Ghibli-esque in the same vein as the 4o image generations.

byearthithatius 5 hours ago

Very impressive release compared to what was possible even a single year ago. It feels like we are in a great state right now with respect to ML where all the big companies are competing and pushing each other to make the tech better. This is rare nowadays in America (or in general).

Zopieux 36 minutes ago

No thanks. Just stop.

volkk 4 hours ago

this is semi-relevant -- and I do love how technically amazing this all is, but a massive caveat for someone who's been dabbling hard in this space, (images+video) -- I cannot emphasize enough how draining text-2-<whatever> is. even when a result comes out that's kind of cool, I feel nothing because it wasn't really me who did it.

I would say 97% of the time, the results are not what I want (and of course that's the case, it's just textual input) and so I change the text slightly, and a whole new thing comes out that is once again incorrect, and then I sit there for 5minutes while some new slop churns out of the slop factory. All of this back and forth drains not only my wallet/credits, but my patience and my soul. I really don't know how these "tools" are ever supposed to help creatives, short of generating short form ad content that few people really only want to work on anyway. So far the only products spawning from these tools are tiktok/general internet spam companies.

The closest thing that I've bumped into that actually feels like it empowers artists is https://github.com/Acly/krita-ai-diffusion that plugs into Krita and uses a combination of img2img with masking and txt2img. A slightly more rewarding feedback loop

hu3 3 hours ago

is there a tool to generate AI videos that doesn't change the original picture so much?

Whisk redraws the entire thing and it barely resembles source picture.

bk496 4 hours ago

I wonder what takes more compute power: this or a blender render farm?

ninininino 3 hours ago

As usual with Gen AI the curated demo itself displays misunderstanding and failure to meet the prompt. In the "Glacial Cavern" demo, the "candy figures" are not within the ice walls but are in the foreground/center of the scene.

These things are great (I am not being sarcastic, I mean it when I say great) if and only if you don't actually care about all of your requirements being met, but if exactness matters they are mind-bogglingly frustrating because you'll get so close to what you want but some important detail is wrong.

transformi 6 hours ago

So many bugs... Couldn't generate even one video on AI-studio (due to errors in the platform). Shame on Google for those poor releases.