Data Mining: Projects List
October 1st, 2025
1 Problem 1: Relative Positional Encoding for ViTs with STRING
Consider the class of relative positional encoding (RPE) mechanisms from [Schenck et al., 2025]. The goal of this project is to compare the regular Cayley-STRING variant from that paper (with the skew-symmetric matrix S trained from scratch) with some customized versions of it. Build the following customized versions: (1) reflection-variant encoding rotation matrices P via reflections in disjoint spaces (rather than Givens matrices in disjoint spaces, as it is the case in the regular parameterization of the rotation matrices in RoPE mechanism), (2) sparse-S variant with a sparse matrix S. The sparsity is obtained by sampling a fraction f all nonzero entries (once) and learning those entries. Try different values of the hyperparameter f. Use the following datasets to conduct the comparison: (a) MNIST, (b) CIFAR-10. Provide code of your implementation and compare also training an inference time of different variants. For the sparse-S variations, apply linear solvers for sparse linear systems for the efficient computation. For each variant, the comparison should be conducted by building a small Vision Transformer (ViT) leveraging that particular variant.
2 Problem 2: Beyond STRING: achieving rotation invariance
Note that the class of STRING relative position encoding mechanisms (RPEs), defined in [Schenck et al., 2025] and containing RoPE [Su et al., 2024] as its very special instantiation, is still not rotation invariant. The goal of this project is to propose an efficient RPE that is rotation invariant. Design a mechanism, where modulated entries of the attention matrix A are of the form.
(1)
where p : R → R is an arbitrary polynomial and ci, cj ∈ Rd are the coordiante-vectors corresponding to the ith and jth token. To achieve particularly good computational efficiency, you can assume that the equality above is satisfied only on average (e.g. E[Ai,j] is equal to the RHS; thus the proposed mechanism can be randomized). For the randomized mechanism, present corresponding concentration results. As in the case of STRING, the mechanism should work by the parallel post- processing of queries and keys.
3 Problem 3: Improving efficient Vision Transformers with RPEs
The goal of this project is to improve downstream accuracy of the ViTs leveraging Performers, by combining them with various RPE methods. Consider a small ViT model with the Performer backbone for attention computation (consider two Performer variant: (a) leveraging positive random features for the unbiased approximation of the softmax kernel, (b) Performer-ReLU). Enrich your Performer-ViT with the following RPE mechanisms: (1) most general RPE [Luo et al., 2021], (2) circulant-STRING from [Schenck et al., 2025], (3) regular RoPE mechanism from [Su et al., 2024]. For all three RPE variants, provide efficient implementations of the corresponding enriched Performers. Compare Performer variations with the regular brute-force attention ViT. Can you close the accuracy gap between regular ViT and Performer variants, by leveraging RPEs ? Use the following datasets to conduct the comparison: (a) MNIST, (b) CIFAR-10. Provide code of your implementation and compare also training an inference time of different variants.
4 Problem 4: Approximating softmax kernel with pseudo-Gaussian projections
Consider a variation of the positive random feature map mechanism, as well as the mechanism lever- aging trigonometric functions (both discussed in the class), where Gaussian projections are replaced with Rademacher-vectors of entries taken independently from the two-element-set {-1, +1}. Com- pute the mean squared errors (MSEs) of the corresponding estimators of the softmax kernel. Can you characterized pairs of inputs to the softmax kernel for which those MSEs are maximized/minimized ? What can you say about asymptotic properties of both MSEs as the number of the applied projec- tions goes to infinity ? Propose a modification of the above mechanism, where different Rademacher- vectors are exactly orthogonal (i.e. the entries are no longer sampled independently across different Rademacher vectors, but are still independent within a given Rademacher-vector). Test empirically, whether your proposed modification further reduces MSE. For the modified variant, you should of course assume that the number of Rademacher vectors is upper-bounded by the dimensionality of queries/keys.
References
[Luo et al., 2021] Luo, S., Li, S., Cai, T., He, D., Peng, D., Zheng, S., Ke, G., Wang, L., and Liu, T. (2021). Stable, fast and accurate: Kernelized attention with relative positional encoding. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 22795-22807.
[Schenck et al., 2025] Schenck, C., Reid, I., Jacob, M. G., Bewley, A., Ainslie, J., Rendleman, D., Jain, D., Sharma, M., Dubey, A., Wahid, A., Singh, S., Wagner, R., Ding, T., Fu, C., Byravan, A., Varley, J., Gritsenko, A. A., Minderer, M., Kalashnikov, D., Tompson, J., Sindhwani, V., and Choromanski, K. (2025). Learning the ropes: Better 2d and 3d position encodings with STRING. ICML 2025, abs/2502.02562.
[Su et al., 2024] Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. (2024). Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.