fovi.arch.knnvit

class fovi.arch.knnvit.KNNPatchEmbedding(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, in_cart_res: int = 224, cart_patch_size=16, patch_overlap_factor=1, device='cuda', force_patches_less_than_matched=True, new_parameterization=False, transposed=False, max_coord_val=1, sample_cortex='geodesic', ref_frame_side_length=None, **kwargs)[source]

Bases: KNNConvLayer

KNN-based patch embedding layer that replaces standard patch embedding in Vision Transformers.

Instead of dividing the image into uniform non-overlapping patches, this layer divides a foveated manifold into nearly non-overlapping KNNs to create patches.

It then performs a standard KNNConv operation for the patch embedding.

We typically prefer KNNPartitioningPatchEmbedding, which builds on this, as it provides an optimal tiling of patches without any visual inspection.

__init__(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, in_cart_res: int = 224, cart_patch_size=16, patch_overlap_factor=1, device='cuda', force_patches_less_than_matched=True, new_parameterization=False, transposed=False, max_coord_val=1, sample_cortex='geodesic', ref_frame_side_length=None, **kwargs)[source]

Initialize KNN tokenization layer.

Parameters:

in_channels – Number of input channels
embed_dim – Embedding dimension for tokens
in_res – Input resolution
fov – Field of view parameter for foveated sampling
cmf_a – a parameter controlling foveated sampling via the CMF
style – Sampling style (‘isotropic’, etc.)
auto_match_cart_resources – Whether to automatically match cartesian resources
in_cart_res – Resolution of input cartesian grid
cart_patch_size – Size of cartesian patches
patch_overlap_factor – Factor for patch overlap
device – Device to run on
force_patches_less_than_matched – Whether to force patches to be less than matched
new_parameterization – Whether to use new parameterization
transposed – Whether to transpose output
max_coord_val – Maximum coordinate value
sample_cortex – Cortex sampling method
**kwargs – Additional arguments passed to parent class

forward(x)[source]

Apply convolution using k-nearest neighbors.

Parameters:: X_l (torch.Tensor) – Node features from layer l [batch, d_l, N_l]
Returns:: Node features from layer l+1 [batch, d_l+1, N_l+1]
Return type:: torch.Tensor

class fovi.arch.knnvit.PartitioningPatchEmbedding(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, force_patches_less_than_matched: bool = True, in_cart_res: int = 224, cart_patch_size=16, device='cuda', transposed=False, max_coord_val=1, ref_frame_side_length=None, sample_cortex='geodesic', bias=False, arch_flag='', in_coords=None, out_coords=None)[source]

Bases: KNNPatchEmbedding

Partitioning patch embedding layer that replaces standard patch embedding in Vision Transformers.

This layer divides a foveated manifold into non-overlapping neighborhoods to create patches.

It turns these neighborhoods into KNNs with padding and then performs a standard KNNConv operation for the patch embedding.

__init__(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, force_patches_less_than_matched: bool = True, in_cart_res: int = 224, cart_patch_size=16, device='cuda', transposed=False, max_coord_val=1, ref_frame_side_length=None, sample_cortex='geodesic', bias=False, arch_flag='', in_coords=None, out_coords=None)[source]

Initialize partitioning patch embedding layer.

Parameters:

in_channels – Number of input channels
embed_dim – Embedding dimension for tokens
in_res – Input resolution
fov – Field of view parameter for foveated sampling
cmf_a – a parameter controlling foveated sampling via the CMF
style – Sampling style (‘isotropic’, etc.)
auto_match_cart_resources – Whether to automatically match cartesian resources
force_patches_less_than_matched – Whether to force patches to be less than matched
in_cart_res – Resolution of input cartesian grid
cart_patch_size – Size of cartesian patches
device – Device to run on
transposed – Whether to transpose output
max_coord_val – Maximum coordinate value
ref_frame_side_length – Reference frame side length
sample_cortex – Cortex sampling method
bias – Whether to use bias in linear layer
arch_flag – Architecture flag

class fovi.arch.knnvit.KNNPartitioningPatchEmbedding(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, in_cart_res: int = 224, cart_patch_size=16, device='cuda', force_patches_less_than_matched=True, transposed=False, max_coord_val='auto', sample_cortex='geodesic', **kwargs)[source]

Bases: KNNPatchEmbedding

__init__(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, in_cart_res: int = 224, cart_patch_size=16, device='cuda', force_patches_less_than_matched=True, transposed=False, max_coord_val='auto', sample_cortex='geodesic', **kwargs)[source]

Initialize KNN partitioning patch embedding layer.

Parameters:

in_channels – Number of input channels
embed_dim – Embedding dimension for tokens
in_res – Input resolution
fov – Field of view parameter for foveated sampling
cmf_a – a parameter controlling foveated sampling via the CMF
style – Sampling style (‘isotropic’, etc.)
auto_match_cart_resources – Whether to automatically match cartesian resources
in_cart_res – Resolution of input cartesian grid
cart_patch_size – Size of cartesian patches
device – Device to run on
force_patches_less_than_matched – Whether to force patches to be less than matched
transposed – Whether to transpose output
max_coord_val – Maximum coordinate value
sample_cortex – Cortex sampling method
**kwargs – Additional arguments passed to parent class

class fovi.arch.knnvit.KNNViT(fov: float, cmf_a: float, style: str, img_size: int = 224, patch_size: int = 16, patch_overlap_factor: float = 1, in_channels: int = 3, embed_dim: int = 768, num_heads: int = 12, num_layers: int = 12, mlp_ratio: float = 4.0, dropout: float = 0.0, num_outputs: int = 1000, device: str = 'cuda', arch_flag: str = '', sample_cortex: bool = 'geodesic', pos_emb_type: str = 'absolute', force_patches_less_than_matched=True, attn_backend: str = 'flash', aggregation='cls_token', ref_frame_side_length=None)[source]

Bases: VisionTransformer

Vision Transformer that uses KNN-based tokenization instead of patch embedding.

This model inherits from VisionTransformer and only overrides the patch embedding to use KNN-based tokenization that creates tokens based on spatial relationships in the foveated coordinate system.

__init__(fov: float, cmf_a: float, style: str, img_size: int = 224, patch_size: int = 16, patch_overlap_factor: float = 1, in_channels: int = 3, embed_dim: int = 768, num_heads: int = 12, num_layers: int = 12, mlp_ratio: float = 4.0, dropout: float = 0.0, num_outputs: int = 1000, device: str = 'cuda', arch_flag: str = '', sample_cortex: bool = 'geodesic', pos_emb_type: str = 'absolute', force_patches_less_than_matched=True, attn_backend: str = 'flash', aggregation='cls_token', ref_frame_side_length=None)[source]

Initialize KNNViT model.

Parameters:

fov – Field of view parameter for foveated sampling
cmf_a – a parameter controlling foveated sampling via the CMF; smaller = stronger foveation
style – Sampling style (‘isotropic’, etc.)
img_size – Size of input image
patch_size – Size of each patch
patch_overlap_factor – Factor for patch overlap
in_channels – Number of input channels
embed_dim – Embedding dimension
num_heads – Number of attention heads
num_layers – Number of transformer layers
mlp_ratio – Ratio of MLP hidden dim to embed dim
dropout – Dropout rate
num_outputs – Number of output classes
device – Device to run on
arch_flag – Architecture flag
sample_cortex – Whether to sample in cortical space
pos_emb_type – Type of positional embedding (‘absolute’ or ‘rope’)
force_patches_less_than_matched – Whether to force the number of patches to be less than a matched cartesian model, or just match as close as possible
attn_backend – Attention backend (‘flash’ for Flash Attention 2, ‘standard’ for standard implementation)
ref_frame_side_length – side length of reference frame for KNN-convolution in the patch embedding (None defaults to patch_size)

forward(x: Tensor) → Tensor[source]

Forward pass through KNNViT.

Parameters:: x – Input features [batch_size, in_channels, in_coords]
Returns:: Model output [batch_size, 1, embed_dim]
Return type:: output

class fovi.arch.knnvit.FoviDinoV3RoPE(base: int, head_dim: int, coords: Tensor, device: str = 'cuda')[source]

Bases: Module

inv_freq: Tensor

__init__(base: int, head_dim: int, coords: Tensor, device: str = 'cuda')[source]

Initialize DinoV3RoPE positional encoding.

Parameters:

base – Base frequency for RoPE
head_dim – Dimension of attention head
coords – Coordinate tensor
device – Device to run on

forward(pixel_values: Tensor) → tuple[Tensor, Tensor][source]

Forward pass for RoPE positional encoding.

Parameters:: pixel_values – Input pixel values
Returns:: Tuple of cosine and sine values for RoPE

fovi.arch.knnvit.resample_patch_embed_conv(conv: Conv2d, target_hw=(8, 8), mode: str = 'bicubic', align_corners: bool = True, preserve_kernel_norm: bool = False) → Conv2d[source]

Resample a patch-embedding Conv2d’s kernels to target size.

Resamples a patch-embedding Conv2d’s kernels from (kH, kW) -> target_hw and returns a NEW Conv2d with kernel_size=stride=target_hw. Supports both upsampling and downsampling.

Parameters:

conv (nn.Conv2d) – The patch embedding convolution to resample.
target_hw (tuple, optional) – Target height and width. Defaults to (8, 8).
mode (str, optional) – Interpolation mode. Defaults to “bicubic”.
align_corners (bool, optional) – Whether to align corners. Defaults to True.
preserve_kernel_norm (bool, optional) – Whether to preserve kernel norm. Defaults to False.

Returns:

A new Conv2d layer with resampled kernels.

Return type:

nn.Conv2d

Note

Assumes stride == kernel_size (patch embedding), padding == 0, groups == 1, dilation == 1.