fovi.arch.vit

fovi.arch.vit.apply_2d_rotary_pos_emb(q, k, cos_x, sin_x, cos_y, sin_y)[source]

Apply 2D rotary position embeddings to query and key tensors.

Parameters:

q – Query tensor [batch_size, num_heads, seq_len, head_dim]
k – Key tensor [batch_size, num_heads, seq_len, head_dim]
cos_x – Cosine embeddings for x-coordinate [seq_len, half_head_dim]
sin_x – Sine embeddings for x-coordinate [seq_len, half_head_dim]
cos_y – Cosine embeddings for y-coordinate [seq_len, half_head_dim]
sin_y – Sine embeddings for y-coordinate [seq_len, half_head_dim]

Returns:

Query and key tensors with 2D rotary embeddings applied

Return type:

q_rot, k_rot

class fovi.arch.vit.PositionalEncoding(embed_dim: int, coords: Tensor | None = None, num_patches_h: int | None = None, num_patches_w: int | None = None, device: str = 'cuda')[source]

Bases: Module

Positional encoding based on xy coordinates.

This creates positional encodings using the xy coordinates of each patch, allowing the model to understand spatial relationships in the original image space.

When coords=None, grid coordinates are computed from num_patches_h and num_patches_w.

__init__(embed_dim: int, coords: Tensor | None = None, num_patches_h: int | None = None, num_patches_w: int | None = None, device: str = 'cuda')[source]

Initialize xy positional encoding.

Parameters:

embed_dim – Embedding dimension
coords – xy coordinates of patches [num_patches, 2]. If None, computed from grid dims.
num_patches_h – Number of patches in height (required if coords is None)
num_patches_w – Number of patches in width (required if coords is None)
device – Device to run on

forward(x: Tensor) → Tensor[source]

Add positional encoding to input tokens.

Parameters:: x – Input tokens [batch_size, num_patches, embed_dim]
Returns:: Tokens with positional encoding added

class fovi.arch.vit.RoPEPositionalEncoding(embed_dim: int, num_heads: int, coords: Tensor | None = None, num_patches_h: int | None = None, num_patches_w: int | None = None, device: str = 'cuda')[source]

Bases: Module

2D Rotary Position Embeddings (RoPE) based on xy coordinates.

This applies 2D rotary embeddings using the spatial coordinates of each patch, allowing the model to understand 2D spatial relationships in the original image space.

Note: This is designed to work with patch tokens only. The CLS token should be handled separately in the attention layer.

When coords=None, grid coordinates are computed from num_patches_h and num_patches_w.

__init__(embed_dim: int, num_heads: int, coords: Tensor | None = None, num_patches_h: int | None = None, num_patches_w: int | None = None, device: str = 'cuda')[source]

Initialize 2D RoPE positional encoding.

Parameters:

embed_dim – Embedding dimension
num_heads – Number of attention heads
coords – XY coordinates of patches [num_patches, 2]. If None, computed from grid dims.
num_patches_h – Number of patches in height (required if coords is None)
num_patches_w – Number of patches in width (required if coords is None)
device – Device to run on

forward(q: Tensor, k: Tensor) → Tuple[Tensor, Tensor][source]

Apply 2D RoPE to query and key tensors.

Parameters:

q – Query tensor [batch_size, num_heads, seq_len, head_dim] (patch tokens only)
k – Key tensor [batch_size, num_heads, seq_len, head_dim] (patch tokens only)

Returns:

Query and key tensors with 2D RoPE applied

Return type:

q_rot, k_rot

class fovi.arch.vit.PatchEmbedding(img_size: int = 224, patch_size: int = 16, in_channels: int = 3, embed_dim: int = 768, bias: bool = False)[source]

Bases: Module

Standard patch embedding layer for Vision Transformers.

Uses a strided Conv2d to divide the image into fixed-size patches and project each patch to a token. Positional encoding is handled separately by the transformer, not here.

__init__(img_size: int = 224, patch_size: int = 16, in_channels: int = 3, embed_dim: int = 768, bias: bool = False)[source]

Initialize patch embedding layer.

Parameters:

img_size – Size of input image (assumed square)
patch_size – Size of each patch (assumed square)
in_channels – Number of input channels
embed_dim – Embedding dimension for tokens
bias – Whether to use bias in convolution

forward(x: Tensor) → Tensor[source]

Forward pass: convert image to patch tokens.

Parameters:: x – Input image [batch_size, in_channels, height, width]
Returns:: Token embeddings [batch_size, num_patches, embed_dim]
Return type:: tokens

class fovi.arch.vit.MultiHeadSelfAttention(embed_dim: int, num_heads: int, dropout: float = 0.0, attn_backend: str = 'flash')[source]

Bases: Module

Standard multi-head self-attention layer for Vision Transformer with selectable backend.

__init__(embed_dim: int, num_heads: int, dropout: float = 0.0, attn_backend: str = 'flash')[source]

Initialize multi-head self-attention.

Parameters:

embed_dim – Embedding dimension
num_heads – Number of attention heads
dropout – Dropout probability
attn_backend – Attention backend (‘flash’ for Flash Attention 2, ‘standard’ for standard implementation)

forward(x: Tensor) → Tensor[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class fovi.arch.vit.RoPEMultiHeadSelfAttention(embed_dim: int, num_heads: int, coords: Tensor | None = None, num_patches_h: int | None = None, num_patches_w: int | None = None, dropout: float = 0.0, attn_backend: str = 'flash')[source]

Bases: MultiHeadSelfAttention

Multi-head self-attention with 2D RoPE positional embeddings based on xy coordinates. Applies RoPE only to patch tokens, leaving the CLS token unchanged.

When coords=None, grid coordinates are computed from num_patches_h and num_patches_w.

__init__(embed_dim: int, num_heads: int, coords: Tensor | None = None, num_patches_h: int | None = None, num_patches_w: int | None = None, dropout: float = 0.0, attn_backend: str = 'flash')[source]

Initialize RoPE multi-head self-attention.

Parameters:

embed_dim – Embedding dimension
num_heads – Number of attention heads
coords – Coordinate tensor for RoPE. If None, computed from grid dims.
num_patches_h – Number of patches in height (required if coords is None)
num_patches_w – Number of patches in width (required if coords is None)
dropout – Dropout probability
attn_backend – Attention backend (‘flash’ for Flash Attention 2, ‘standard’ for standard implementation)

forward(x: Tensor) → Tensor[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class fovi.arch.vit.TransformerBlock(embed_dim: int, num_heads: int, mlp_ratio: float = 4.0, dropout: float = 0.0, attn_backend: str = 'flash')[source]

Bases: Module

Standard transformer block with self-attention and MLP.

__init__(embed_dim: int, num_heads: int, mlp_ratio: float = 4.0, dropout: float = 0.0, attn_backend: str = 'flash')[source]

Initialize transformer block.

Parameters:

embed_dim – Embedding dimension
num_heads – Number of attention heads
mlp_ratio – Ratio of MLP hidden dim to embed dim
dropout – Dropout probability
attn_backend – Attention backend (‘flash’ for Flash Attention 2, ‘standard’ for standard implementation)

forward(x: Tensor) → Tensor[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class fovi.arch.vit.RoPETransformerBlock(embed_dim: int, num_heads: int, mlp_ratio: float = 4.0, dropout: float = 0.0, coords: Tensor | None = None, num_patches_h: int | None = None, num_patches_w: int | None = None, attn_backend: str = 'flash')[source]

Bases: TransformerBlock

Transformer block with 2D RoPE-enabled self-attention based on xy coordinates.

When coords=None, grid coordinates are computed from num_patches_h and num_patches_w.

__init__(embed_dim: int, num_heads: int, mlp_ratio: float = 4.0, dropout: float = 0.0, coords: Tensor | None = None, num_patches_h: int | None = None, num_patches_w: int | None = None, attn_backend: str = 'flash')[source]

Initialize RoPE transformer block.

Parameters:

embed_dim – Embedding dimension
num_heads – Number of attention heads
mlp_ratio – Ratio of MLP hidden dim to embed dim
dropout – Dropout probability
coords – Coordinate tensor for RoPE. If None, computed from grid dims.
num_patches_h – Number of patches in height (required if coords is None)
num_patches_w – Number of patches in width (required if coords is None)
attn_backend – Attention backend (‘flash’ for Flash Attention 2, ‘standard’ for standard implementation)

class fovi.arch.vit.VisionTransformer(img_size: int = 224, patch_size: int = 16, in_channels: int = 3, embed_dim: int = 768, num_heads: int = 12, num_layers: int = 12, mlp_ratio: float = 4.0, dropout: float = 0.0, num_outputs: int = 1000, pos_emb_type: str = 'absolute', coords: Tensor | None = None, patch_embed: Module | None = None, attn_backend: str = 'standard', aggregation: str = 'cls_token')[source]

Bases: Module

Standard Vision Transformer implementation.

This is the classic ViT architecture with patch embedding and transformer blocks.

When coords=None, grid coordinates are computed from img_size and patch_size. When coords is provided (e.g., from KNNViT), those coordinates are used directly.

__init__(img_size: int = 224, patch_size: int = 16, in_channels: int = 3, embed_dim: int = 768, num_heads: int = 12, num_layers: int = 12, mlp_ratio: float = 4.0, dropout: float = 0.0, num_outputs: int = 1000, pos_emb_type: str = 'absolute', coords: Tensor | None = None, patch_embed: Module | None = None, attn_backend: str = 'standard', aggregation: str = 'cls_token')[source]

Initialize Vision Transformer.

Parameters:

img_size – Size of input image (assumed square)
patch_size – Size of each patch (assumed square)
in_channels – Number of input channels
embed_dim – Embedding dimension
num_heads – Number of attention heads
num_layers – Number of transformer layers
mlp_ratio – Ratio of MLP hidden dim to embed dim
dropout – Dropout rate
num_outputs – Number of output classes
pos_emb_type – Type of positional embedding (‘absolute’ or ‘rope’)
coords – Cartesian coordinates tensor [num_patches, 2]. If None, grid coords computed from img_size/patch_size.
patch_embed – optional pre-specified patch embedding
attn_backend – Attention backend (‘flash’ for Flash Attention 2, ‘standard’ for standard implementation)

_init_weights()[source]: Initialize model weights.

forward(x: Tensor) → Tensor[source]

Forward pass through Vision Transformer.

Parameters:: x – Input image [batch_size, in_channels, height, width]
Returns:: Model output [batch_size, num_classes] or [batch_size, embed_dim]
Return type:: output