This introduction to data science will cover tools and techniques for acquiring, cleaning, and utilizing real-world data for research purposes. In contrast to traditional course work, where one is often handed a prepackaged dataset obtained by a third party and prepared for a specific exercise, research projects often involve not only cleaning and preparing “messy” data, but often also acquiring that data oneself (e.g., through an API). The initial phase of these projects involves a good deal of exploratory analysis to gain a preliminary understanding of the dataset. Students will be introduced to scripting (on the command line and with Python and R) for these purposes, and will gain direct experience in acquiring and modeling data from online sources.
The course also serves as an introduction to problems in applied statistics and machine learning. We will cover the theory behind simple but effective methods for supervised and unsupervised learning. Emphasis will be on formulating real-world modeling and prediction tasks as optimization problems and comparing methods in terms of practical efficacy and scalability. Students will learn to fit and evaluate such models, with applications including spam filtering and recommendation systems.
Important update for 2020: Due to COVID-19, we will be holding this year’s summer school virtually and it will be shortened to 4 weeks (down from 8 weeks).
The Data Science Summer School (DS3) is an intensive, four-week hands-on introduction to data science for college students in the New York City area. As we are committed to increasing diversity in computer science, we strongly encourage women, minorities, and individuals with disabilities to apply.
Each student receives a $5,000 stipend for participating in the program, as well as a laptop.
DS3 includes both coursework in data science and group research projects. The summer school is taught by leading scientists at Microsoft Research, and is held at the new Microsoft Research office in the heart of New York City.