Vera 1.6 — Frontier Multimodal AI Model

We introduce Vera 1.6, a unified multimodal language model developed by Cortex Research and purpose-built for agentic applications. Vera 1.6 employs a novel hybrid architecture combining Gated Delta Networks with Sparse Mixture-of-Experts (SMoE), enabling highly efficient inference at scale. Trained on a proprietary 150B-token synthetic dataset and further aligned via Reinforcement Learning, the model achieves strong performance across instruction following, graduate-level reasoning, multilingual understanding, agentic tool use, and multimodal comprehension. With a 1M-token context window, native vision and video processing, and support for 201+ languages, Vera 1.6 is designed as the backbone for production-grade agentic AI systems.

Introduction

Deploying frontier models in production agentic systems — where models must autonomously execute multi-step tasks, interact with external tools, navigate web interfaces, and maintain coherence over extremely long contexts — presents unique architectural and training challenges that general-purpose models are not optimally designed to address.

Vera 1.6 is Cortex Research's second-generation multimodal model in the Vera series, architected specifically for agentic workloads. Vera 1.6 introduces a redesigned hybrid attention mechanism combining Gated Delta Networks for efficient linear attention with periodic Gated Attention layers, paired with a Sparse Mixture-of-Experts feed-forward structure. This design enables favourable compute-performance trade-offs at deployment scale.

Model Architecture

Vera 1.6 is a Causal Language Model augmented with a Vision Encoder, forming a unified Vision-Language backbone. The core language model adopts a hierarchical hybrid design comprising 40 layers organised into 10 macro-blocks. Each macro-block contains four sub-blocks that interleave efficient linear and full quadratic attention with sparse feed-forward routing, in a 3:1 ratio of linear-to-quadratic attention — reducing per-token FLOPs for the majority of layers while periodic full attention layers maintain global context integration across the 1M-token window.

Gated Delta Networks

The primary attention mechanism is the Gated Delta Network (GDN), a form of linear recurrent attention achieving O(L) inference complexity with respect to sequence length L. This is critical at 1M-token context lengths where quadratic complexity would be prohibitively expensive. The architecture uses 32 linear attention heads (V) and 16 query/key heads (QK) with a head dimension of 128. The asymmetric head count increases representational capacity in the value projection while reducing the overhead of the recurrent state update computation.

Gated Attention

Full quadratic attention layers appear once per macro-block and employ Grouped-Query Attention (GQA) with a 16:2 query-to-KV head ratio, reducing KV-cache memory by 8× relative to multi-head attention while preserving model quality. Rotary Position Embeddings (RoPE) are applied with a compressed dimension of 64, optimised for long-range dependency modelling. This GQA configuration enables single-node deployment practical for enterprise customers.

Sparse Mixture-of-Experts

Every attention layer is followed by a Sparse Mixture-of-Experts (SMoE) feed-forward block. With 256 total experts and only 9 activated per token (8 routed + 1 shared), the model achieves a ~28× capacity multiplier over a dense model of equivalent activated parameter count. The shared expert is always active and provides a general-purpose pathway, while the learned router selects 8 task-specialised experts dynamically per token, enabling fine-grained input-conditioned computation.

Vision Encoder

Vera 1.6 incorporates a Vision Encoder that projects image and video feature maps into the language model embedding space, forming a unified Vision-Language model without separate specialist modules. The encoder supports still images, multi-page documents, and video sequences, with architectural compatibility inherited from the Qwen3.5 visual stack.

Property	Value
Foundation	Proprietary MoE Base
Architecture	Hybrid: Gated DeltaNet + SMoE
Context Window	1,000,000 tokens
Languages	201+
Modalities	Text, Image, Video
Hidden Dimension	2,048
Layers	40 (10 macro-blocks)
Training Dataset	150B token synthetic corpus
Hardware	NVIDIA DGX B200 (8× B200, 1,440 GB)

Training Methodology

Pre-training Foundation

Vera 1.6 is built on a state-of-the-art MoE base model providing strong multilingual understanding, mathematical reasoning, and code comprehension as a foundation for task-specific alignment stages.

Agentic Supervised Fine-tuning

The primary training signal is a proprietary 150B-token synthetic dataset constructed for agentic applications using the NVIDIA NeMo Data Designer framework. Coverage includes multi-step agentic task planning and execution, API and tool calling with complex function schemas, agentic web browsing and information retrieval, terminal and shell command execution sequences, long-context document processing and synthesis, multimodal instruction following with interleaved vision inputs, and cross-lingual generalisation across 201+ languages.

Reinforcement Learning Alignment

Following SFT, Vera 1.6 undergoes a Reinforcement Learning stage to optimise behaviour for agentic task completion. Reward signals are derived from task success metrics across code execution outcomes, tool-call accuracy, instruction adherence, and multimodal comprehension quality. This stage improves self-correction, long-horizon planning, and reliable external tool invocation within agentic pipelines.

Training Infrastructure

All training was conducted on a single NVIDIA DGX B200 node, comprising 8× Blackwell B200 GPUs providing 1,440 GB of total GPU memory, 2× Intel Xeon Platinum 8570 CPUs, 2 TB DDR5 system memory, and NVLink GPU-to-GPU interconnect.

Benchmark Evaluation

Vera 1.6 was evaluated across twelve diverse benchmarks spanning instruction following, scientific reasoning, mathematics, multilingual understanding, agentic tool use, and multimodal comprehension.

Vera 1.6 benchmark results across all twelve evaluations

Benchmark	Score	Category
HMMT Feb 2025	92.0%	Mathematics
OmniDocBench v1.5	89.3%	Document Understanding
Video-MME	87.3%	Video Reasoning
MMMLU	86.2%	Multilingual
GPQA Diamond	85.9%	Graduate Science
IFBench	76.5%	Instruction Following
MMMU-Pro	75.1%	Visual Reasoning
SWE-bench Verified	72.4%	Agentic Coding
BFCL V4	69.1%	Tool Use
ERQA	64.7%	Embodied Reasoning
BrowseComp	61.0%	Agentic Search
Terminal-Bench 2	41.6%	Terminal Coding

Vera 1.6 achieves particularly strong performance in mathematical reasoning with 92.0% on HMMT Feb 2025, document understanding at 89.3% on OmniDocBench v1.5, and video comprehension at 87.3% on Video-MME. A MMMLU score of 86.2% confirms broad multilingual capability across the 201+ supported languages. Graduate-level scientific reasoning at 85.9% on GPQA Diamond reflects the effectiveness of the RL alignment stage.

Agentic benchmarks reveal expected task-dependent variance. SWE-bench Verified (72.4%) and BFCL V4 (69.1%) demonstrate strong capability in code-based and tool-use agentic tasks. Terminal-Bench 2 (41.6%) and BrowseComp (61.0%) highlight active development areas in low-level terminal execution and autonomous web navigation — both domains prioritised in subsequent Vera releases.

Discussion

Architectural Trade-offs

The hybrid Gated DeltaNet + SMoE design represents a deliberate trade-off optimised for production agentic deployment. The 3:1 linear-to-full-attention ratio substantially reduces per-token FLOPs for long-context inference. GQA (16Q/2KV) in quadratic attention layers reduces KV-cache memory by 8× compared to multi-head attention, enabling single-node deployment practical for enterprise customers. The 256-expert SMoE with 9 active experts per token achieves a ~28× capacity multiplier over a dense model of equivalent activated parameter count.

Limitations

Terminal-Bench 2 performance (41.6%) indicates that low-level shell execution remains challenging, attributable to limited terminal-execution trajectories in training data and the compounding error sensitivity of sequential shell commands. BrowseComp (61.0%) similarly reflects the difficulty of long-horizon autonomous web navigation. Both areas are being addressed through targeted data generation and RL reward shaping.

Future Work

Planned improvements include expanded terminal and shell execution training data, improved tool-use generalisation across novel API schemas, enhanced video understanding for extended clips, extended multimodal modalities, and further RL alignment for complex multi-agent coordination and collaborative agentic scenarios.

Conclusion

Vera 1.6 is a multimodal language model with Gated Delta Networks and Sparse Mixture-of-Experts, purpose-built for agentic applications. The model achieves strong performance across mathematical reasoning, document understanding, multilingual knowledge, and graduate-level scientific Q&A.

With a 1M-token context window, 201+ language support, and native vision and video capabilities, Vera 1.6 is designed to serve as a capable and efficient backbone for production-grade agentic AI systems. Vera 1.6 represents Cortex Research's ongoing commitment to building models that balance frontier performance with practical deployment efficiency — so that powerful AI remains accessible to every team, not just the most well-resourced.

Platform

Models

Introducing Vera 1.6