owned this note
owned this note
Published
Linked with GitHub
# Data Science Article
It is hard to define data science, but easy to notice its pervasive influence on scientific, economic, and social life in 2019. UC Berkeley is one of the nerve centers of this field, both in teaching and research. Our department plays a crucial role in this story on both fronts.
The mathematical foundations of data science are linear algebra, calculus, and probability. With the Division of Data Science's DS Major and Minor among the most popular on campus (XXX percent of Cal freshmen pre-declared it this fall), the most immediate effect on Evans Hall is vastly increased enrollments in Calculus 1A/B and Math 54. For instance, it is now standard to have over 4000 students take Math 54 every year, up from YYY ten years ago. Beyond the size of the course, the curriculum is also being revamped to better serve the needs of these students --- e.g., the Singular Value Decomposition, which used to be a fringe topic, is now taught in detail, often with demonstrations of applications to sound and image compression.
Within the applied math major, we offer since ZZZ a "Data Science Cluster", with choices from several upper division CS and Statistics courses. This was by far the most popular choice among our majors last year, accounting for ZZZ percent of them. At the graduation ceremony this spring, a full XXX fraction of our undergraduates went on to first jobs in DS.
In research, mathematics offers a different perspective on data science from Statistics and CS. One concrete example is the research of Prof. Lin Lin, which uses deep neural networks to efficiently find solutions to nonlinear partial differential equations arising in quantum chemistry. It has been understood for some time how to represent the "solution map" efficiently in the linear case, using products of certain "hierarchical" matrices which decompose the relevant space by recursively bisecting it into smaller and smaller grids, but it was a mystery how to do this in the nonlinear case. Lin and his colleagues generalized this approach to the nonlinear setting by using a neural network architecture --- essentially a product of the same kinds of matrices, but with nonlinearities interleaved between products. The architecture is trained using known solutions of the PDE as examples. The surprise is that these nets seem to learn the highly structured solution maps without overfitting!
Another example is the research of Prof. James Sethian. Deep Learning is famously good at classifying images using a huge number of examples, which are cheaply available on the internet. Sethian and his group study the problem of designing classifiers for very high resolution images of cells. The issue is that images of cells are more expensive than images of cats, so the number of examples is tiny (perhaps 10). To train neural nets in this "small data" setting, they severely restrict the structure of the nets to have very few parameters, in a way informed by multiscale analysis. The result is better classifiers for distinguishing between healthy and diseased cells in various contexts, which are being used for instance in a Chan-Zuckerberg initiative at UCSF for understanding parasite invasion.