fovi.arch.knnvit
- class fovi.arch.knnvit.KNNPatchEmbedding(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, in_cart_res: int = 224, cart_patch_size=16, patch_overlap_factor=1, device='cuda', force_patches_less_than_matched=True, new_parameterization=False, transposed=False, max_coord_val=1, sample_cortex='geodesic', ref_frame_side_length=None, **kwargs)[source]
Bases:
KNNConvLayerKNN-based patch embedding layer that replaces standard patch embedding in Vision Transformers.
Instead of dividing the image into uniform non-overlapping patches, this layer divides a foveated manifold into nearly non-overlapping KNNs to create patches.
It then performs a standard KNNConv operation for the patch embedding.
We typically prefer KNNPartitioningPatchEmbedding, which builds on this, as it provides an optimal tiling of patches without any visual inspection.
- __init__(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, in_cart_res: int = 224, cart_patch_size=16, patch_overlap_factor=1, device='cuda', force_patches_less_than_matched=True, new_parameterization=False, transposed=False, max_coord_val=1, sample_cortex='geodesic', ref_frame_side_length=None, **kwargs)[source]
Initialize KNN tokenization layer.
- Parameters:
in_channels – Number of input channels
embed_dim – Embedding dimension for tokens
in_res – Input resolution
fov – Field of view parameter for foveated sampling
cmf_a – a parameter controlling foveated sampling via the CMF
style – Sampling style (‘isotropic’, etc.)
auto_match_cart_resources – Whether to automatically match cartesian resources
in_cart_res – Resolution of input cartesian grid
cart_patch_size – Size of cartesian patches
patch_overlap_factor – Factor for patch overlap
device – Device to run on
force_patches_less_than_matched – Whether to force patches to be less than matched
new_parameterization – Whether to use new parameterization
transposed – Whether to transpose output
max_coord_val – Maximum coordinate value
sample_cortex – Cortex sampling method
**kwargs – Additional arguments passed to parent class
- forward(x)[source]
Apply convolution using k-nearest neighbors.
- Parameters:
X_l (torch.Tensor) – Node features from layer l [batch, d_l, N_l]
- Returns:
Node features from layer l+1 [batch, d_l+1, N_l+1]
- Return type:
- class fovi.arch.knnvit.PartitioningPatchEmbedding(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, force_patches_less_than_matched: bool = True, in_cart_res: int = 224, cart_patch_size=16, device='cuda', transposed=False, max_coord_val=1, ref_frame_side_length=None, sample_cortex='geodesic', bias=False, arch_flag='', in_coords=None, out_coords=None)[source]
Bases:
KNNPatchEmbeddingPartitioning patch embedding layer that replaces standard patch embedding in Vision Transformers.
This layer divides a foveated manifold into non-overlapping neighborhoods to create patches.
It turns these neighborhoods into KNNs with padding and then performs a standard KNNConv operation for the patch embedding.
- __init__(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, force_patches_less_than_matched: bool = True, in_cart_res: int = 224, cart_patch_size=16, device='cuda', transposed=False, max_coord_val=1, ref_frame_side_length=None, sample_cortex='geodesic', bias=False, arch_flag='', in_coords=None, out_coords=None)[source]
Initialize partitioning patch embedding layer.
- Parameters:
in_channels – Number of input channels
embed_dim – Embedding dimension for tokens
in_res – Input resolution
fov – Field of view parameter for foveated sampling
cmf_a – a parameter controlling foveated sampling via the CMF
style – Sampling style (‘isotropic’, etc.)
auto_match_cart_resources – Whether to automatically match cartesian resources
force_patches_less_than_matched – Whether to force patches to be less than matched
in_cart_res – Resolution of input cartesian grid
cart_patch_size – Size of cartesian patches
device – Device to run on
transposed – Whether to transpose output
max_coord_val – Maximum coordinate value
ref_frame_side_length – Reference frame side length
sample_cortex – Cortex sampling method
bias – Whether to use bias in linear layer
arch_flag – Architecture flag
- class fovi.arch.knnvit.KNNPartitioningPatchEmbedding(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, in_cart_res: int = 224, cart_patch_size=16, device='cuda', force_patches_less_than_matched=True, transposed=False, max_coord_val='auto', sample_cortex='geodesic', **kwargs)[source]
Bases:
KNNPatchEmbedding- __init__(in_channels: int, embed_dim: int, in_res: int, fov: float, cmf_a: float, style: str = 'isotropic', auto_match_cart_resources: bool = True, in_cart_res: int = 224, cart_patch_size=16, device='cuda', force_patches_less_than_matched=True, transposed=False, max_coord_val='auto', sample_cortex='geodesic', **kwargs)[source]
Initialize KNN partitioning patch embedding layer.
- Parameters:
in_channels – Number of input channels
embed_dim – Embedding dimension for tokens
in_res – Input resolution
fov – Field of view parameter for foveated sampling
cmf_a – a parameter controlling foveated sampling via the CMF
style – Sampling style (‘isotropic’, etc.)
auto_match_cart_resources – Whether to automatically match cartesian resources
in_cart_res – Resolution of input cartesian grid
cart_patch_size – Size of cartesian patches
device – Device to run on
force_patches_less_than_matched – Whether to force patches to be less than matched
transposed – Whether to transpose output
max_coord_val – Maximum coordinate value
sample_cortex – Cortex sampling method
**kwargs – Additional arguments passed to parent class
- class fovi.arch.knnvit.KNNViT(fov: float, cmf_a: float, style: str, img_size: int = 224, patch_size: int = 16, patch_overlap_factor: float = 1, in_channels: int = 3, embed_dim: int = 768, num_heads: int = 12, num_layers: int = 12, mlp_ratio: float = 4.0, dropout: float = 0.0, num_outputs: int = 1000, device: str = 'cuda', arch_flag: str = '', sample_cortex: bool = 'geodesic', pos_emb_type: str = 'absolute', force_patches_less_than_matched=True, attn_backend: str = 'flash', aggregation='cls_token', ref_frame_side_length=None)[source]
Bases:
VisionTransformerVision Transformer that uses KNN-based tokenization instead of patch embedding.
This model inherits from VisionTransformer and only overrides the patch embedding to use KNN-based tokenization that creates tokens based on spatial relationships in the foveated coordinate system.
- __init__(fov: float, cmf_a: float, style: str, img_size: int = 224, patch_size: int = 16, patch_overlap_factor: float = 1, in_channels: int = 3, embed_dim: int = 768, num_heads: int = 12, num_layers: int = 12, mlp_ratio: float = 4.0, dropout: float = 0.0, num_outputs: int = 1000, device: str = 'cuda', arch_flag: str = '', sample_cortex: bool = 'geodesic', pos_emb_type: str = 'absolute', force_patches_less_than_matched=True, attn_backend: str = 'flash', aggregation='cls_token', ref_frame_side_length=None)[source]
Initialize KNNViT model.
- Parameters:
fov – Field of view parameter for foveated sampling
cmf_a – a parameter controlling foveated sampling via the CMF; smaller = stronger foveation
style – Sampling style (‘isotropic’, etc.)
img_size – Size of input image
patch_size – Size of each patch
patch_overlap_factor – Factor for patch overlap
in_channels – Number of input channels
embed_dim – Embedding dimension
num_heads – Number of attention heads
num_layers – Number of transformer layers
mlp_ratio – Ratio of MLP hidden dim to embed dim
dropout – Dropout rate
num_outputs – Number of output classes
device – Device to run on
arch_flag – Architecture flag
sample_cortex – Whether to sample in cortical space
pos_emb_type – Type of positional embedding (‘absolute’ or ‘rope’)
force_patches_less_than_matched – Whether to force the number of patches to be less than a matched cartesian model, or just match as close as possible
attn_backend – Attention backend (‘flash’ for Flash Attention 2, ‘standard’ for standard implementation)
ref_frame_side_length – side length of reference frame for KNN-convolution in the patch embedding (None defaults to patch_size)
- class fovi.arch.knnvit.FoviDinoV3RoPE(base: int, head_dim: int, coords: Tensor, device: str = 'cuda')[source]
Bases:
Module
- fovi.arch.knnvit.resample_patch_embed_conv(conv: Conv2d, target_hw=(8, 8), mode: str = 'bicubic', align_corners: bool = True, preserve_kernel_norm: bool = False) Conv2d[source]
Resample a patch-embedding Conv2d’s kernels to target size.
Resamples a patch-embedding Conv2d’s kernels from (kH, kW) -> target_hw and returns a NEW Conv2d with kernel_size=stride=target_hw. Supports both upsampling and downsampling.
- Parameters:
conv (nn.Conv2d) – The patch embedding convolution to resample.
target_hw (tuple, optional) – Target height and width. Defaults to (8, 8).
mode (str, optional) – Interpolation mode. Defaults to “bicubic”.
align_corners (bool, optional) – Whether to align corners. Defaults to True.
preserve_kernel_norm (bool, optional) – Whether to preserve kernel norm. Defaults to False.
- Returns:
A new Conv2d layer with resampled kernels.
- Return type:
nn.Conv2d
Note
Assumes stride == kernel_size (patch embedding), padding == 0, groups == 1, dilation == 1.