Thesis: I want to create "humanity-affirming" art.

0.1 In a world where every single modality has been subsumed/consumed by AI, we have less and less aspects, virtues, actuations, that make uniquely human, that affirm our ontological uniqueness.

0.2 Art is the vehicle in which I can prove to myself that I am human. Embodiment is the only weapon we have left, and I want to explore this tension/theme at CAI studio.

1.

With the advent of self-supervised learning, advanced generative models (e.g., VAEs, Diffusers) and autoregressive transformers, every modality - text, image, audio, can be "modeled" and thus "solved" with AI.

1.0 The main blocker for AI from 1980s-2015 was the constraint of having human labels - costly, noisy, about ten dollars a pop, unable to scale. The largest datasets were only in the O(100)-O(10k) dataset range (e.g., MNIST with 60k images of handwritten digits, each 28x28 pixels). Even later breakthroughs like ImageNet, with over 14 million human-labeled images across thousands of categories, represented a massive effort in curated data.

1.1 Then came self-supervised learning - by using web-scale data and intrinsic relationships or "views" of the same concept, we were able to get these labels for free. The cost per label became zero.

1.2 Text was the first to be modeled through next-token prediction or transformer-based classification through models like GPT or BERT. The decoder-only prediction task was simple - find the next token given the sequence of past k tokens. This way we can use every single text corpus and every post on the web to train an autoregressive transformer.

1.3 Images were next to go, first through modeling images through ViTs (Vision Transformer, paper: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale") then using self-reconstruction as a task (MAEs).

1.4 Videos are just images with a temporal component. See spatio-temporal transformers.

1.5 Audio is last on the chopping block, mainly due to the lack of high-quality text-grounded data. That will be solved soon through scaffolding synthetic models and synthetic labelers.

1.6 We have so much data and pretraining at such large scale now measuring datasets within the language of "tokens." 7T tokens can be learned/compressed into 80 GBs of model weights that represent the holistic breadth and depth of human aesthetic, conceptual, and rational thought throughout all of history.

1.7 Text in this case is the "glue" that associates each modality with each other. Every time there is a modality with >=100M paired examples with text and another modality we can train contrastive models (late-fusion) and/or multimodal LMs (early-fusion) using this text pairing. CLIP is an example of the multimodal fusion which allows us to fold text and image modalities into the same space. Next came CLAP for audio.

1.8 Every concept has a different projection into modality space.

1.8.1 For example, the "cat concept" can be projected into audio ("the cat purring") and text ("the semiotic symbols of the letters C-A-T") and images ("a picture of a cat")

1.8.2 Within a modality you can have separate subprojections across different symmetries - e.g. images are skew/shift/scale-invariant, so you can have different images (e.g. a tail, whisker, patterned fur) that evoke the concept of a cat.

1.8.3 By abstracting one level up, we can learn any concept, generate any concept, condition on any concept, by associating each modality each other, with "text" as the least common denominator, using various views and projections of that concept.

1.9 Why am I an expert in this?

1.9.1 In my four years at Waymo, I've worked extensively with training contrastive image-text models.

1.9.2 I've also used self-supervised models to model aesthetic preference (e.g., in music with approaches like MusicRL), audio quality, and description coherency.

1.9.3 I deeply understand audio, image and text. What novel combinatorial explorations can occur if we combine other modalities, not just these conventional ones?

1.9.4 Dance? Scent? Action? Emotion? What if we combine all modalities at once? Can we condition dance or rhythm based on music? Live visuals based on audio? How can explore the latent space of each modality in novel ways? How might we break generative models by exploring latent spaces of low probability density? With text as glue, and every modality able to be reduced into a latent n-dimensional multivariate gaussian, the possibilites are endless.

2.

"I think, therefore I am."

2.1 Have you ever thought that the entire world is a simulation? How do we prove that we aren't in one right now?

2.2 This basic adage is the best panacea for affirming that we exist. But as of the release of GPT-3 and ever more-powerful reasoning models (Deepseek-R1, Gemini-2.5, RLVR), this is no longer the case.

2.3 We are no longer the only automata capable of unique and coherent thought! Every philosophical treatise has this as an unmovable prior. The era in which humans are the only objects that are capable of self-reflection, of art, of understanding, of poetry, is over, all aspects that we used to think exemplified the "pinnacle" of our humanity!

3.

Now given the above, what is left? Embodiment. "I feel, therefore I am."

3.1 This understanding aligns with research in trauma and somatic experience, notably highlighted by Bessel van der Kolk in his work "The Body Keeps the Score: Brain, Mind, and Body in the Healing of Trauma", which emphasizes how deeply emotional experiences and trauma are imprinted in our physical being.

3.2 Humans are sensory machines. Emotions are found in the body.

3.3 When you feel nervous, your heart contracts. When you fall in love, the cheeks blush, serotonin is released. When you feel shame, you want to hide in the corner.

3.4 I've experienced grief - I lost three close family members over the past 3 years. I've also experienced ecstasy - from the early-morning raves of the Brooklyn underground to motorcycling the california coastline solo.

3.5 I know the embedding space of emotion and how crucial the body plays a role in shaping this experience, a concept deepened from my work at Hume AI.

3.6 Without a soft plastic bag filled with organs and modulating touch and olfactory sensors, robots will never feel emotion the way we do, they can simulate it and emulate prosody, but they will never understand.

3.7 I want to explore this. This is the only thing we have left. This is the last frontier.

主题:我想创造"肯定人性"的艺术。

0.1 在一个各种模态都被AI吸收/消费的世界里,我们越来越少有独特的人性特质、美德和表现,这些特质彰显了我们的本体论独特性。

0.2 艺术是我能向自己证明我是人类的载体。具身性是我们仅剩的武器,我想在蔡工作室探索这种张力/主题。

1.

随着自监督学习的出现,先进的生成模型(如VAEs扩散模型)和自回归变换器,每种模态 - 文本、图像、音频,都可以被AI"建模"并因此"解决"。

1.0 从1980年代到2015年,AI的主要障碍是人类标签的限制 - 成本高、噪声大、每个标签的成本约为十美元、无法扩展。最大的数据集仅在O(100)-O(10k)范围内(例如,MNIST有6万张手写数字图像,每张28x28像素)。即使是后来的突破如ImageNet,拥有超过1400万张人类标记的图像,跨越数千个类别,也代表了巨大的数据整理工作。

1.1 然后出现了自监督学习 - 通过使用网络规模的数据和同一概念的内在关系或"视图",我们能够免费获得这些标签。每个标签的成本变为零。

1.2 文本首先通过GPT或BERT等模型进行下一个标记预测或基于变换器的分类。解码器预测任务很简单 - 给定过去k个标记的序列,找到下一个标记。这样我们可以使用网络上的每个文本语料库和每个帖子来训练自回归变换器

1.3 图像紧随其后,首先通过ViTs(视觉变换器,论文:"一张图片值16x16个词:大规模图像识别的变换器")建模,然后使用自重建作为任务(MAEs)。

1.4 视频只是具有时间成分的图像。参见时空变换器

1.5 音频是最后被处理的,主要是因为缺乏高质量的文本基础数据。这很快将通过构建合成模型和合成标记器来解决。

1.6 我们现在有如此多的数据和如此大规模的预训练,用"标记"的语言来衡量数据集。7T标记可以被学习/压缩成80GB的模型权重,代表整个人类历史上美学、概念和理性思维的广度和深度。

1.7 在这种情况下,文本是"胶水",将每种模态相互关联。每当有一种模态与文本和另一种模态有>=1亿个配对示例时,我们就可以训练对比模型(后期融合)和/或多模态语言模型早期融合)。CLIP是多模态融合的例子,允许我们将文本和图像模态折叠到同一空间。接下来是用于音频的CLAP

1.8 每个概念在模态空间中都有不同的投影。

1.8.1 例如,"猫的概念"可以投影到音频("猫的呼噜声")和文本("字母C-A-T的符号")以及图像("猫的图片")

1.8.2 在一种模态内,你可以有跨越不同对称性的单独子投影 - 例如,图像是倾斜/平移/尺度不变的,所以你可以有不同的图像(例如尾巴、胡须、图案化的毛发)来唤起猫的概念。

1.8.3 通过向上抽象一层,我们可以学习任何概念,生成任何概念,以任何概念为条件,通过将每种模态相互关联,以"文本"作为最小公分母,使用该概念的各种视图和投影。

1.9 为什么我是这方面的专家?

1.9.1 在Waymo的四年里,我广泛使用对比图像-文本模型。

1.9.2 我还使用自监督模型来建模美学偏好(例如,在音乐中使用MusicRL等方法)、音频质量和描述连贯性。

1.9.3 我深入理解音频、图像和文本。如果我们结合其他模态,不仅仅是这些常规模态,会发生什么新颖的组合探索?

1.9.4 舞蹈?气味?动作?情感?如果我们同时结合所有模态会怎样?我们能基于音乐来调节舞蹈或节奏吗?基于音频的现场视觉效果?我们如何以新颖的方式探索每种模态的潜在空间?我们如何通过探索低概率密度的潜在空间来打破生成模型?以文本为胶水,每种模态都能被简化为n维多元高斯分布,可能性是无限的。

2.

"我思故我在。"

2.1 你是否曾经想过整个世界都是一个模拟?我们如何证明我们现在不在一个模拟中?

2.2 这个基本格言是肯定我们存在的最佳良药。但随着GPT-3和更强大的推理模型(Deepseek-R1、Gemini-2.5、RLVR)的发布,这不再是事实。

2.3 我们不再是唯一能够进行独特和连贯思考的自动机!每个哲学论文都有这个不可动摇的前提。人类是唯一能够自我反思、艺术、理解和诗歌的对象的时代已经结束,这些我们曾经认为体现了我们"人性巅峰"的所有方面!

3.

考虑到上述情况,还剩下什么?具身性。"我感受,故我在。"

3.1 这种理解与创伤和躯体体验的研究一致,特别是Bessel van der Kolk在他的著作"身体记得:创伤治疗中的大脑、心灵和身体"中强调的,该书强调了情感体验和创伤如何深刻地印在我们的身体中。

3.2 人类是感官机器。情感存在于身体中。

3.3 当你感到紧张时,你的心脏收缩。当你坠入爱河时,脸颊发红,血清素释放。当你感到羞耻时,你想躲在角落里。

3.4 我经历过悲伤 - 在过去3年里失去了三位近亲。我也经历过狂喜 - 从布鲁克林地下室的清晨狂欢到独自骑摩托车穿越加州海岸线。

3.5 我知道情感的嵌入空间以及身体在塑造这种体验中的关键作用,这个概念在我Hume AI的工作中得到了深化。

3.6 没有装满器官的软塑料袋和调节触觉和嗅觉传感器,机器人永远不会像我们一样感受情感,它们可以模拟和模仿韵律,但它们永远不会理解。

3.7 我想探索这一点。这是我们仅剩的东西。这是最后的边疆。