fovi.arch.knnvit

class fovi.arch.knnvit.KNNPatchEmbedding(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, in_cart_res: int = 224, cart_patch_size=16, patch_overlap_factor=1, device='cuda', force_patches_less_than_matched=True, new_parameterization=False, transposed=False, max_coord_val=1, sample_cortex='geodesic', ref_frame_side_length=None, **kwargs)[source]

Bases: KNNConvLayer

KNN-based patch embedding layer that replaces standard patch embedding in Vision Transformers.

Instead of dividing the image into uniform non-overlapping patches, this layer divides a foveated manifold into nearly non-overlapping KNNs to create patches.

It then performs a standard KNNConv operation for the patch embedding.

We typically prefer KNNPartitioningPatchEmbedding, which builds on this, as it provides an optimal tiling of patches without any visual inspection.

__init__(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, in_cart_res: int = 224, cart_patch_size=16, patch_overlap_factor=1, device='cuda', force_patches_less_than_matched=True, new_parameterization=False, transposed=False, max_coord_val=1, sample_cortex='geodesic', ref_frame_side_length=None, **kwargs)[source]

Initialize KNN tokenization layer.

Parameters:
  • in_channels – Number of input channels

  • embed_dim – Embedding dimension for tokens

  • in_res – Input resolution

  • fov – Field of view parameter for foveated sampling

  • cmf_a – a parameter controlling foveated sampling via the CMF

  • style – Sampling style (‘isotropic’, etc.)

  • auto_match_cart_resources – Whether to automatically match cartesian resources

  • in_cart_res – Resolution of input cartesian grid

  • cart_patch_size – Size of cartesian patches

  • patch_overlap_factor – Factor for patch overlap

  • device – Device to run on

  • force_patches_less_than_matched – Whether to force patches to be less than matched

  • new_parameterization – Whether to use new parameterization

  • transposed – Whether to transpose output

  • max_coord_val – Maximum coordinate value

  • sample_cortex – Cortex sampling method

  • **kwargs – Additional arguments passed to parent class

forward(x)[source]

Apply convolution using k-nearest neighbors.

Parameters:

X_l (torch.Tensor) – Node features from layer l [batch, d_l, N_l]

Returns:

Node features from layer l+1 [batch, d_l+1, N_l+1]

Return type:

torch.Tensor

class fovi.arch.knnvit.PartitioningPatchEmbedding(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, force_patches_less_than_matched: bool = True, in_cart_res: int = 224, cart_patch_size=16, device='cuda', transposed=False, max_coord_val=1, ref_frame_side_length=None, sample_cortex='geodesic', bias=False, arch_flag='', in_coords=None, out_coords=None)[source]

Bases: KNNPatchEmbedding

Partitioning patch embedding layer that replaces standard patch embedding in Vision Transformers.

This layer divides a foveated manifold into non-overlapping neighborhoods to create patches.

It turns these neighborhoods into KNNs with padding and then performs a standard KNNConv operation for the patch embedding.

__init__(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, force_patches_less_than_matched: bool = True, in_cart_res: int = 224, cart_patch_size=16, device='cuda', transposed=False, max_coord_val=1, ref_frame_side_length=None, sample_cortex='geodesic', bias=False, arch_flag='', in_coords=None, out_coords=None)[source]

Initialize partitioning patch embedding layer.

Parameters:
  • in_channels – Number of input channels

  • embed_dim – Embedding dimension for tokens

  • in_res – Input resolution

  • fov – Field of view parameter for foveated sampling

  • cmf_a – a parameter controlling foveated sampling via the CMF

  • style – Sampling style (‘isotropic’, etc.)

  • auto_match_cart_resources – Whether to automatically match cartesian resources

  • force_patches_less_than_matched – Whether to force patches to be less than matched

  • in_cart_res – Resolution of input cartesian grid

  • cart_patch_size – Size of cartesian patches

  • device – Device to run on

  • transposed – Whether to transpose output

  • max_coord_val – Maximum coordinate value

  • ref_frame_side_length – Reference frame side length

  • sample_cortex – Cortex sampling method

  • bias – Whether to use bias in linear layer

  • arch_flag – Architecture flag

class fovi.arch.knnvit.KNNPartitioningPatchEmbedding(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, in_cart_res: int = 224, cart_patch_size=16, device='cuda', force_patches_less_than_matched=True, transposed=False, max_coord_val='auto', sample_cortex='geodesic', **kwargs)[source]

Bases: KNNPatchEmbedding

__init__(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, in_cart_res: int = 224, cart_patch_size=16, device='cuda', force_patches_less_than_matched=True, transposed=False, max_coord_val='auto', sample_cortex='geodesic', **kwargs)[source]

Initialize KNN partitioning patch embedding layer.

Parameters:
  • in_channels – Number of input channels

  • embed_dim – Embedding dimension for tokens

  • in_res – Input resolution

  • fov – Field of view parameter for foveated sampling

  • cmf_a – a parameter controlling foveated sampling via the CMF

  • style – Sampling style (‘isotropic’, etc.)

  • auto_match_cart_resources – Whether to automatically match cartesian resources

  • in_cart_res – Resolution of input cartesian grid

  • cart_patch_size – Size of cartesian patches

  • device – Device to run on

  • force_patches_less_than_matched – Whether to force patches to be less than matched

  • transposed – Whether to transpose output

  • max_coord_val – Maximum coordinate value

  • sample_cortex – Cortex sampling method

  • **kwargs – Additional arguments passed to parent class

class fovi.arch.knnvit.KNNViT(fov: float, cmf_a: float, style: str, img_size: int = 224, patch_size: int = 16, patch_overlap_factor: float = 1, in_channels: int = 3, embed_dim: int = 768, num_heads: int = 12, num_layers: int = 12, mlp_ratio: float = 4.0, dropout: float = 0.0, num_outputs: int = 1000, device: str = 'cuda', arch_flag: str = '', sample_cortex: bool = 'geodesic', pos_emb_type: str = 'absolute', force_patches_less_than_matched=True, attn_backend: str = 'flash', aggregation='cls_token', ref_frame_side_length=None)[source]

Bases: VisionTransformer

Vision Transformer that uses KNN-based tokenization instead of patch embedding.

This model inherits from VisionTransformer and only overrides the patch embedding to use KNN-based tokenization that creates tokens based on spatial relationships in the foveated coordinate system.

__init__(fov: float, cmf_a: float, style: str, img_size: int = 224, patch_size: int = 16, patch_overlap_factor: float = 1, in_channels: int = 3, embed_dim: int = 768, num_heads: int = 12, num_layers: int = 12, mlp_ratio: float = 4.0, dropout: float = 0.0, num_outputs: int = 1000, device: str = 'cuda', arch_flag: str = '', sample_cortex: bool = 'geodesic', pos_emb_type: str = 'absolute', force_patches_less_than_matched=True, attn_backend: str = 'flash', aggregation='cls_token', ref_frame_side_length=None)[source]

Initialize KNNViT model.

Parameters:
  • fov – Field of view parameter for foveated sampling

  • cmf_a – a parameter controlling foveated sampling via the CMF; smaller = stronger foveation

  • style – Sampling style (‘isotropic’, etc.)

  • img_size – Size of input image

  • patch_size – Size of each patch

  • patch_overlap_factor – Factor for patch overlap

  • in_channels – Number of input channels

  • embed_dim – Embedding dimension

  • num_heads – Number of attention heads

  • num_layers – Number of transformer layers

  • mlp_ratio – Ratio of MLP hidden dim to embed dim

  • dropout – Dropout rate

  • num_outputs – Number of output classes

  • device – Device to run on

  • arch_flag – Architecture flag

  • sample_cortex – Whether to sample in cortical space

  • pos_emb_type – Type of positional embedding (‘absolute’ or ‘rope’)

  • force_patches_less_than_matched – Whether to force the number of patches to be less than a matched cartesian model, or just match as close as possible

  • attn_backend – Attention backend (‘flash’ for Flash Attention 2, ‘standard’ for standard implementation)

  • ref_frame_side_length – side length of reference frame for KNN-convolution in the patch embedding (None defaults to patch_size)

forward(x: Tensor) Tensor[source]

Forward pass through KNNViT.

Parameters:

x – Input features [batch_size, in_channels, in_coords]

Returns:

Model output [batch_size, 1, embed_dim]

Return type:

output

class fovi.arch.knnvit.FoviDinoV3RoPE(base: int, head_dim: int, coords: Tensor, device: str = 'cuda')[source]

Bases: Module

inv_freq: Tensor
__init__(base: int, head_dim: int, coords: Tensor, device: str = 'cuda')[source]

Initialize DinoV3RoPE positional encoding.

Parameters:
  • base – Base frequency for RoPE

  • head_dim – Dimension of attention head

  • coords – Coordinate tensor

  • device – Device to run on

forward(pixel_values: Tensor) tuple[Tensor, Tensor][source]

Forward pass for RoPE positional encoding.

Parameters:

pixel_values – Input pixel values

Returns:

Tuple of cosine and sine values for RoPE

fovi.arch.knnvit.resample_patch_embed_conv(conv: Conv2d, target_hw=(8, 8), mode: str = 'bicubic', align_corners: bool = True, preserve_kernel_norm: bool = False) Conv2d[source]

Resample a patch-embedding Conv2d’s kernels to target size.

Resamples a patch-embedding Conv2d’s kernels from (kH, kW) -> target_hw and returns a NEW Conv2d with kernel_size=stride=target_hw. Supports both upsampling and downsampling.

Parameters:
  • conv (nn.Conv2d) – The patch embedding convolution to resample.

  • target_hw (tuple, optional) – Target height and width. Defaults to (8, 8).

  • mode (str, optional) – Interpolation mode. Defaults to “bicubic”.

  • align_corners (bool, optional) – Whether to align corners. Defaults to True.

  • preserve_kernel_norm (bool, optional) – Whether to preserve kernel norm. Defaults to False.

Returns:

A new Conv2d layer with resampled kernels.

Return type:

nn.Conv2d

Note

Assumes stride == kernel_size (patch embedding), padding == 0, groups == 1, dilation == 1.