Dual Sparsity
10 X 10 =

Our hardware can achieve multiplicative benefits in speed and efficiency when both forms of sparsity are present.

Sparse Weights

  • Supports sparsely connected models

  • Only stores and computes on weights that matter

  • 10x improvement in speed, efficiency, and memory

icon model 2 copy
icon model 2 copy

Sparse Activations

  • Supports sparse activations

  • Skips computation when a neuron outputs zero

  • 10x increase in speed and efficiency

icon model 1 copy
icon model 1 copy

Sparsity-First Approach

Sparsity First Approach
Sparsity First Approach

SPU Architecture

icon sparse acceleration
icon sparse acceleration

Sparse Acceleration

  • Native hardware accelerates sparse math and compresses sparse data
  • Custom instructions maximize efficiency and speed for algorithms with sparsity
  • Custom memory formats maximize packing density for sparse data

Near-Memory Compute

  • Disaggregate on-chip memory and disperse near processing elements to reduce data motion and parallelize data access
  • Retain workloads on-chip to eliminate energy and throughput bottlenecks of off-chip memory access
  • Sparse acceleration maximizes effective on-chip memory capacity

Scalable Core

  • Tile or divide cores to match the needs and constraints of any deployment
  • Cover a wide range of applications and form factors with the same architecture
  • Pure digital design can be easily ported across process nodes to optimize performance vs. cost


Build with sparsity

  • Design new neural networks that maximize available memory, power, and bandwidth.
  • Optimize existing neural networks for speed, efficiency, and memory footprint with minimal impact on performance.

Tune with ease

  • Optimize for different scenarios with intuitive, flexible, and powerful fine-tuning utilities 
  • Use custom layers, sparsity regularization, sparsity loss functions, and quantization aware training to meet your objectives

Deploy seamlessly

  • Deploy models to hardware simulation for rapid iteration or to actual hardware with minimal firmware.
  • Build prototypes using one of our pre-built hardware integrations.

Sparsity in Action

Femtosense’s dual sparsity design consumes much less energy than existing approaches by using only a fraction of active neurons.


chip for web
chip for web

Introducing SPU-001

  • 1.52mm x 2.2mm WLCSP15 packaging
  • 22nm ULL Process
  • 1 MB of on-chip SRAM
    • 10 MB effective memory with sparsity
    • Unused SRAM can be repurposed
  • SPI for interfacing with host processor
  • Power gating with sleep mode
  • Sub-mW inference for speech, audio, and other 1-D data

Core Design

coredesign gif 2
coredesign gif 2

4 Core Configuration

  • 512 kB on-chip SRAM per core
  • 5 MB effective SRAM with weight sparsity
  • 1.3 mm2 single core (22nm process)
  • AXI interface

Want to learn more about SPU-001 & Our IP Design?

Specification Sheet


Femtosense SDK

Our SDK supports PyTorch, TensorFlow, and JAX, so developers can get started with minimal barriers. The SDK provides advanced sparse optimization tools, a model performance simulator, and the Femto compiler. 

femto stack
femto stack

Deploying AI Algorithms to the SPU

Start with existing models : Deploy TensorFlow, PyTorch, and JAX models to the SPU without fine-tuning. Achieve high efficiency for dense models, and even higher efficiency for sparse models.


Optimize models for the highest performance with sparsity regularization and quantization aware training. Fine-tune existing models or train from scratch.


Simulate energy, latency, throughput, and footprint of your model on the SPU.


Use the Femto compiler to deploy to the SPU for verification, testing, and production.

Get updates

Be the first to know about new products and features