Research Log #026

Welcome to Research Log #026! We document weekly research progress across the various initiatives in the Manifold Research Group, and highlight breakthroughs from the broader research community we think are interesting in the Pulse of AI!

We’re growing our core team and pursuing new projects. If you’re interested in working together, join the conversation on Discord and check out our Github.

NEKO

The NEKO Project aims to build the first large scale, Open Source "Generalist" Model, trained on numerous modalities including control and robotics tasks. You can learn more about it here.

This week, we’ve primarily been focused on debugging & improving some aspects of the VQA performance prior to a merge to the master version of the model, in preparation for a larger training run!

Agent Survey

The Agent Survey is an effort to comprehensively understand the current state of LLM based AI-Agents, compare to formalizations of AI agents from other areas like RL & Cognitive Science, and identify new areas for research and capability improvements. We’re running this as a community survey, so if you’d like to get involved, check out the roadmap and join the discord channel!

We’ve made some progress in compiling V0 of our survey, found here.
We’re also exploring agent definitions, and compiling work into the review material tracker.

Tool Use Models

The Tool Use Model Project is an effort to build an open source large multimodal model capable of using digital tools. We’re starting with exploring the ability to use APIs. Get involved with the project on discord!

We’re currently in the middle of surveying available action models (turns out there are very few) and related datasets. We’ve compiled some example datasets here.

Pulse of AI

There have been some exciting advancements this week. A completely open source Model has entered the arena, a new text to video model that blows other results out of the water and text to 3d mesh 10x faster than before. Read on for this week's Pulse of AI!

Dolma & OLMo

Dolma and OLMo are a new dataset and state of the art LLM from AllenAI. It is one of the most exciting projects in a while because everything is open source about it. The pretraining data is completely open and for OLMo there are checkpoints every 1000 steps until the end of model training. These are real free and open source models!

*Figure 1: “The Dolma corpus at-a-glance” Source:* *paper*.

Dolma is a corpus of Three Million Tokens. There are multiple sources for these datasets, they have classics like Common Crawl and C4, passing through Wikipedia and project Gutenberg. Originally, before cleaning, and careful selection there were 200TB of raw text. Dolma was created because there is a need for bigger and more high quality open source datasets. It could be seen as a competitor against C4, Pile and the RedPajama datasets.

OLMo is a family of languages trained on the Dolma dataset. They used a decoder only transformer and have released two models with a third one awaiting release. The models are 1B and 7B parameters with a 65B model still training.

*Figure 2: Accuracy progression over tokens seen for OLMo-7B. Source:* *paper*.

OLMo and Dolma are completely open source, even the tooling can be downloaded. If you want to read more about it, you can check the Dolma paper here and the OLMo paper here

Lumiere

A new text-to-video diffusion model that basically blows the competition by Google out of the water. This new model is more realistic and can be used for stylized generation, conditional generation, image to video, inpainting and cinemagraphs.

*Figure 4: The STUNet architecture. Source:* *paper*.

The model is based on a normal text to image U-Net that they “inflate” into a space-time U-net. Finally, the crux of the model is an Attention-based Inflation block and the rest of the downsample and upsample blocks are Convolution based. They trained the model with 30M videos along with their text captions.

If you want to read more, you can check out the webpage or their paper.

AToM: Amortized Text-to-Mesh using 2D Diffusion

*Figure 5: Prompts and 3d objects from the AToM model. Source:* *paper*.

AToM is a new text to 3d object model made by researchers at Snap research and KAUST. The advantages of this new model are that it is roughly 10x faster than before and it has 4x more accuracy than prior state of the art in some datasets.

*Figure 6: Training and inference pipeline for AToM. Source:* *paper*.

The model has two different pipelines, one for inference and the other one for training. The inference works by encoding the text prompt into a triplane where a 3D network generates the meshes from the triplane features. The training has two stages, the first stage the model trains the 3d generation and the second part has mesh optimization so that the output is more refined.

If you want to read more, the paper is here, and we hope they will eventually open source their weights.

If you want to see more of our updates as we work to explore and advance the field of Intelligent Systems, follow us on Twitter, Linkedin, and Mastodon!

Research Log #026

NEKO

Agent Survey

Tool Use Models

Pulse of AI

Dolma & OLMo

Lumiere

AToM: Amortized Text-to-Mesh using 2D Diffusion

Santiago Pedroza

Harsh Sikka

Research Log #026

NEKO

Agent Survey

Tool Use Models

Pulse of AI

Dolma & OLMo

Lumiere

AToM: Amortized Text-to-Mesh using 2D Diffusion

Santiago Pedroza

Harsh Sikka

Research Log #048

Research Log #047

Research Log #046