Abstract

Parametric methods aim to explain data with a finite number of learnable parameters. These models are typically applied in settings where the number of data is greater than the number of parameters. Nonparametric methods, on the other hand, model data using infinite-dimensional function spaces and/or allow the number of parameters to grow beyond the number of data (a.k.a. the overparameterized regime). Many classical methods in data science fit into this latter framework, including kernel methods and wavelet methods. Furthermore, modern methods based on overparameterized neural networks also fit into this framework. The common theme being that these methods aim to minimize a quantity in function space. This tutorial will provide a tour of nonparametric methods in data science through the lens of function spaces starting with classical methods such as kernel methods (reproducing kernel Hilbert spaces) and wavelet methods (bounded variation spaces, Besov spaces) and ending with modern, high-dimensional methods such as overparameterized neural networks (variation spaces, Barron spaces). Remarkably, all of these methods can be viewed through the lens of abstract representer theorems (beyond Hilbert spaces). A particular emphasis will be made on the difference between ℓ2-regularization (kernel methods) and sparsity-promoting ℓ1-regularization (wavelet methods, neural networks) through the concept of adaptivity. For each method/function space, topics such as generalization bounds, metric entropy, and minimax rates will be covered.

References

Aronszajn, Nachman. 1950. “Theory of Reproducing Kernels.” Transactions of the American Mathematical Society 68 (3): 337–404.
Arora, Raman, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. 2018. “Understanding Deep Neural Networks with Rectified Linear Units.” In International Conference on Learning Representations (ICLR).
Arora, Sanjeev, Nadav Cohen, Wei Hu, and Yuping Luo. 2019. “Implicit Regularization in Deep Matrix Factorization.” Advances in Neural Information Processing Systems (NeurIPS) 32.
Bach, Francis. 2017. “Breaking the Curse of Dimensionality with Convex Neural Networks.” Journal of Machine Learning Research 18 (1): 629–81.
Barron, Andrew R. 1993. “Universal Approximation Bounds for Superpositions of a Sigmoidal Function.” IEEE Transactions on Information Theory 39 (3): 930–45.
Bartolucci, Francesca, Ernesto De Vito, Lorenzo Rosasco, and Stefano Vigogna. 2023. “Understanding Neural Networks with Reproducing Kernel Banach Spaces.” Applied and Computational Harmonic Analysis 62: 194–236.
Bartolucci, Francesca, Ernesto De Vito, Lorenzo Rosasco, and Stefano Vigogna. 2024. “Neural Reproducing Kernel Banach Spaces and Representer Theorems for Deep Networks.” arXiv Preprint arXiv:2403.08750.
Bietti, Alberto, Joan Bruna, Clayton Sanford, and Min Jae Song. 2022. “Learning Single-Index Models with Shallow Neural Networks.” In Advances in Neural Information Processing Systems.
Boyer, Claire, Antonin Chambolle, Yohann De Castro, Vincent Duval, Frédéric De Gournay, and Pierre Weiss. 2019. “On Representer Theorems and Convex Regularization.” SIAM Journal on Optimization 29 (2): 1260–81.
Bredies, Kristian, and Marcello Carioni. 2020. “Sparsity of Solutions for Variational Inverse Problems with Finite-Dimensional Data.” Calculus of Variations and Partial Differential Equations 59 (1): 1–26.
Burer, Samuel, and Renato DC Monteiro. 2003. “A Nonlinear Programming Algorithm for Solving Semidefinite Programs via Low-Rank Factorization.” Mathematical Programming 95 (2): 329–57.
Candès, Emmanuel J., Justin Romberg, and Terence Tao. 2006. “Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information.” IEEE Transactions on Information Theory 52 (2): 489–509.
Carl, Bernd. 1981. “Entropy Numbers, s-Numbers, and Eigenvalue Problems.” Journal of Functional Analysis 41 (3): 290–306.
Chandrasekaran, Venkat, Benjamin Recht, Pablo A. Parrilo, and Alan S. Willsky. 2012. “The Convex Geometry of Linear Inverse Problems.” Foundations of Computational Mathematics 12 (6): 805–49.
Chen, Zhengdao. 2024. “Neural Hilbert Ladders: Multi-Layer Neural Networks in Function Space.” Journal of Machine Learning Research 25 (109): 1–65.
Dai, Zhen, Mina Karzand, and Nathan Srebro. 2021. “Representation Costs of Linear Neural Networks: Analysis and Design.” Advances in Neural Information Processing Systems 34: 26884–96.
Damian, Alexandru, Jason Lee, and Mahdi Soltanolkotabi. 2022. “Neural Networks Can Learn Representations with Gradient Descent.” In Conference on Learning Theory, 5413–52. PMLR.
DeVore, Ronald A. 1998. “Nonlinear Approximation.” Acta Numerica 7: 51–150.
DeVore, Ronald, Robert D. Nowak, Rahul Parhi, and Jonathan W. Siegel. 2025. “Weighted Variation Spaces and Approximation by Shallow ReLU Networks.” Applied and Computational Harmonic Analysis 74 (101713).
Donoho, David L. 2000. “High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality.” AMS Math Challenges Lecture 1 (2000): 32.
Donoho, David L.. 2006. “Compressed Sensing.” IEEE Transactions on Information Theory 52 (4): 1289–1306.
Donoho, David L., and Iain M. Johnstone. 1994. “Ideal Spatial Adaptation by Wavelet Shrinkage.” Biometrika 81 (3): 425–55.
Donoho, David L., and Iain M. Johnstone. 1995. “Adapting to Unknown Smoothness via Wavelet Shrinkage.” Journal of the American Statistical Association 90 (432): 1200–1224.
Donoho, David L., and Iain M. Johnstone. 1998. “Minimax Estimation via Wavelet Shrinkage.” The Annals of Statistics 26 (3): 879–921.
Donoho, David L., Richard C. Liu, and Brenda MacGibbon. 1990. “Minimax Risk over Hyperrectangles, and Implications.” Annals of Statistics, 1416–37.
E, Weinan, Chao Ma, and Lei Wu. 2022. “The Barron Space and the Flow-Induced Function Spaces for Neural Network Models.” Constructive Approximation 55 (1): 369–406.
E, Weinan, and Stephan Wojtowytsch. 2020. “On the Banach Spaces Associated with Multi-Layer ReLU Networks: Function Representation, Approximation Theory and Gradient Descent Dynamics.” CSIAM Transactions on Applied Mathematics 1 (3): 387–440.
Ergen, Tolga, and Mert Pilanci. 2021. “Convex Geometry and Duality of over-Parameterized Neural Networks.” Journal of Machine Learning Research.
Fisher, Stephen D., and Joseph W. Jerome. 1975. “Spline Solutions to L1 Extremal Problems in One and Several Variables.” Journal of Approximation Theory 13 (1): 73–83.
Geer, Sara van de. 2000. Empirical Processes in M-Estimation. Vol. 6. Cambridge University Press.
Golubeva, Anna, Guy Gur-Ari, and Behnam Neyshabur. 2021. “Are Wider Nets Better Given the Same Number of Parameters?” In International Conference on Learning Representations.
Grandvalet, Yves. 1998. “Least Absolute Shrinkage Is Equivalent to Quadratic Penalization.” In International Conference on Artificial Neural Networks, 201–6. Springer.
Gunasekar, Suriya, Jason D Lee, Daniel Soudry, and Nati Srebro. 2018. “Implicit Bias of Gradient Descent on Linear Convolutional Networks.” Advances in Neural Information Processing Systems 31.
Heeringa, Tjeerd Jan, Len Spek, and Christoph Brune. 2025. “Deep Networks Are Reproducing Kernel Chains.” arXiv Preprint arXiv:2501.03697.
Jacot, Arthur. 2023a. “Implicit Bias of Large Depth Networks: A Notion of Rank for Nonlinear Functions.” In International Conference on Learning Representations (ICLR).
Jacot, Arthur. 2023b. “Bottleneck Structure in Learned Features: Low-Dimension Vs Regularity Tradeoff.” Advances in Neural Information Processing Systems 36 (December): 23607–29.
Jacot, Arthur, Franck Gabriel, and Clement Hongler. 2018. “Neural Tangent Kernel: Convergence and Generalization in Neural Networks.” In Advances in Neural Information Processing Systems. Vol. 31.
Jacot, Arthur, Eugene Golikov, Clément Hongler, and Franck Gabriel. 2022. “Feature Learning in L2-Regularized DNNs: Attraction/Repulsion and Sparsity.” Advances in Neural Information Processing Systems 35: 6763–74.
Jacot, Arthur, and Alexandre Kaiser. 2025. “Hamiltonian Mechanics of Feature Learning: Bottleneck Structure in Leaky ResNets.” Conference on Parsimony and Learning (CPAL).
Jacot, Arthur, Peter Súkenı́k, Zihan Wang, and Marco Mondelli. 2024. “Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse.” arXiv Preprint arXiv:2410.04887.
Klusowski, Jason M, and Andrew R Barron. 2018. “Approximation by Combinations of ReLU and Squared ReLU Ridge Functions with 1 and 0 Controls.” IEEE Transactions on Information Theory 64 (12): 7649–56.
Kůrková, Věra, and Marcello Sanguineti. 2001. “Bounds on Rates of Variable-Basis and Neural-Network Approximation.” IEEE Transactions on Information Theory 47 (6): 2659–65.
Li, Ker-Chau. 1991. “Sliced Inverse Regression for Dimension Reduction.” Journal of the American Statistical Association 86 (414): 316–27.
Lin, Rong Rong, Hai Zhang Zhang, and Jun Zhang. 2022. “On Reproducing Kernel Banach Spaces: Generic Definitions and Unified Framework of Constructions.” Acta Mathematica Sinica, English Series 38 (8): 1459–83.
Mammen, Enno, and Sara van de Geer. 1997. “Locally Adaptive Regression Splines.” Annals of Statistics 25 (1): 387–413.
Matoušek, Jiří. 1996. “Improved Upper Bounds for Approximation by Zonotopes.” Acta Mathematica 177 (1): 55–73.
McCarty, Sarah. 2023. “Piecewise Linear Functions Representable with Infinite Width Shallow ReLU Neural Networks.” Proceedings of the American Mathematical Society, Series B 10 (27): 296–310.
Meyer, Yves. 1992. Wavelets and Operators. 37. Cambridge University Oress.
Mhaskar, Hrushikesh N. 2004. “On the Tractability of Multivariate Integration and Approximation by Neural Networks.” Journal of Complexity 20 (4): 561–90.
Mousavi-Hosseini, Alireza, Sejun Park, Manuela Girotti, Ioannis Mitliagkas, and Murat A Erdogdu. 2022. “Neural Networks Efficiently Learn Low-Dimensional Representations with SGD.” In The Eleventh International Conference on Learning Representations.
Nichani, Eshaan, Alex Damian, and Jason D Lee. 2023. “Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks.” Advances in Neural Information Processing Systems 36: 10828–75.
Ongie, Greg, Rebecca Willett, Daniel Soudry, and Nathan Srebro. 2020. “A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case.” In International Conference on Learning Representations.
Parhi, Rahul, and Robert D. Nowak. 2021. “Banach Space Representer Theorems for Neural Networks and Ridge Splines.” Journal of Machine Learning Research 22 (43): 1–40.
Parhi, Rahul, and Robert D. Nowak. 2022. “What Kinds of Functions Do Deep Neural Networks Learn? Insights from Variational Spline Theory.” SIAM Journal on Mathematics of Data Science 4 (2): 464–89.
Parhi, Rahul, and Robert D. Nowak. 2023. “Near-Minimax Optimal Estimation with Shallow ReLU Neural Networks.” IEEE Transactions on Information Theory 69 (2): 1125–40.
Parkinson, Suzanna, Greg Ongie, and Rebecca Willett. 2023. ReLU Neural Networks with Linear Layers Are Biased Towards Single-and Multi-Index Models.” arXiv Preprint arXiv:2305.15598.
Petrushev, Pencho P. 1988. “Direct and Converse Theorems for Spline and Rational Approximation and Besov Spaces.” In Function Spaces and Applications: Proceedings of the US-Swedish Seminar Held in Lund, Sweden, June 15–21, 1986, 363–77. Springer.
Radhakrishnan, Adityanarayanan, Daniel Beaglehole, Parthe Pandit, and Mikhail Belkin. 2024. “Mechanism for Feature Learning in Neural Networks and Backpropagation-Free Machine Learning Models.” Science 383 (6690): 1461–67.
Razin, Noam, and Nadav Cohen. 2020. “Implicit Regularization in Deep Learning May Not Be Explainable by Norms.” Advances in Neural Information Processing Systems 33: 21174–87.
Razin, Noam, Asaf Maman, and Nadav Cohen. 2021. “Implicit Regularization in Tensor Factorization.” In International Conference on Machine Learning (ICML), 8913–24.
Razin, Noam, Asaf Maman, and Nadav Cohen. 2022. “Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks.” In International Conference on Machine Learning, 18422–62. PMLR.
Razin, Noam, Tom Verbin, and Nadav Cohen. 2023. “On the Ability of Graph Neural Networks to Model Interactions Between Vertices.” Advances in Neural Information Processing Systems 36: 26501–45.
Sahiner, Arda, Tolga Ergen, John M. Pauly, and Mert Pilanci. 2021. “Vector-Output ReLU Neural Network Problems Are Copositive Programs: Convex Analysis of Two Layer Networks and Polynomial-Time Algorithms.” In International Conference on Learning Representations.
Savarese, Pedro, Itay Evron, Daniel Soudry, and Nathan Srebro. 2019. “How Do Infinite Width Bounded Norm Networks Look in Function Space?” In Conference on Learning Theory (COLT), 2667–90. PMLR.
Schmidt-Hieber, Johannes. 2020. “Nonparametric Regression Using Deep Neural Networks with ReLU Activation Function.” Annals of Statistics 48 (4): 1875–97.
Schölkopf, Bernhard, and Alexander J. Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive Computation and Machine Learning. MIT Press.
Shenouda, Joseph, Rahul Parhi, Kangwook Lee, and Robert D Nowak. 2024. “Variation Spaces for Multi-Output Neural Networks: Insights on Multi-Task Learning and Network Compression.” Journal of Machine Learning Research 25 (231): 1–40.
Siegel, Jonathan W. 2023. “Optimal Approximation of Zonoids and Uniform Approximation by Shallow Neural Networks.” arXiv Preprint arXiv:2307.15285.
Siegel, Jonathan W, and Jinchao Xu. 2020. “Approximation Rates for Neural Networks with General Activation Functions.” Neural Networks 128: 313–21.
Siegel, Jonathan W, and Jinchao Xu. 2023. “Characterization of the Variation Spaces Corresponding to Shallow Neural Networks.” Constructive Approximation 57 (3): 1109–32.
Srebro, Nathan, Jason Rennie, and Tommi Jaakkola. 2004. “Maximum-Margin Matrix Factorization.” Advances in Neural Information Processing Systems 17.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society Series B: Statistical Methodology 58 (1): 267–88.
Timor, Nadav, Gal Vardi, and Ohad Shamir. 2023. “Implicit Regularization Towards Rank Minimization in ReLU Networks.” In International Conference on Algorithmic Learning Theory, 1429–59. PMLR.
Unser, Michael. 2021. “A Unifying Representer Theorem for Inverse Problems and Machine Learning.” Foundations of Computational Mathematics 21 (4): 941–60.
Unser, Michael. 2023. “Ridges, Neural Networks, and the Radon Transform.” Journal of Machine Learning Research 24 (37): 1–33.
Varshney, Prateek, and Mert Pilanci. 2024. “Convex Distillation: Efficient Compression of Deep Networks via Convex Optimization.” arXiv Preprint arXiv:2410.06567.
Wahba, Grace. 1990. Spline Models for Observational Data. Vol. 59. SIAM.
Wang, Yifei, Tolga Ergen, and Mert Pilanci. 2023. “Parallel Deep Neural Networks Have Zero Duality Gap.” In International Conference on Learning Representations.
Wen, Yuxiao, and Arthur Jacot. 2024. “Which Frequencies Do CNNs Need? Emergent Bottleneck Structure in Feature Learning.” In International Conference on Machine Learning (ICML).
Wilson, Andrew Gordon, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. 2016. “Deep Kernel Learning.” In Artificial Intelligence and Statistics, 370–78. PMLR.
Yang, Liu, Jifan Zhang, Joseph Shenouda, Dimitris Papailiopoulos, Kangwook Lee, and Robert D Nowak. 2022. “A Better Way to Decay: Proximal Gradient Training Algorithms for Neural Nets.” In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop).
Zeno, Chen, Greg Ongie, Yaniv Blumenfeld, Nir Weinberger, and Daniel Soudry. 2023. “How Do Minimum-Norm Shallow Denoisers Look in Function Space?” Advances in Neural Information Processing Systems 36: 57520–57.
Zhang, Haizhang, Yuesheng Xu, and Jun Zhang. 2009. “Reproducing Kernel Banach Spaces for Machine Learning.” Journal of Machine Learning Research 10 (12).
Zhang, Kaiqi, and Yu-Xiang Wang. 2023. “Deep Learning Meets Nonparametric Regression: Are Weight-Decayed DNNs Locally Adaptive?” In International Conference on Learning Representations.

Related Links