for Data Science and Machine Learning

This course will serve as introduction to basic statistical principles that are often used by data scientists and applied statisticians. Many of the concepts will be reinforced by using the statistical programming language R, one of the two most popular languages for Data Science.

The intent of this course is to expose students to common statistical issues and teach them how to avoid statistical fallacies. We begin with a high-level overview of probability and common statistical estimates and then proceed to move advanced topics like multiple hypothesis testing, independence, sample size and power calculations as well as bootstrapping.

By the end of the course, students will have a fundamental understanding of many of the statistical principles that underlie machine learning and data science.

This course is open to beginners, but students should have some experience with coding (Python or R preferable but not required) and have a basic understand of calculus, linear algebra and probability. A brief review will be provided but prior experience would be very helpful.

Students may opt to skip the pre-work if they:

- Have taken an introductory course to statistics or probability in college
- Are familiar with Linear Algebra (either coursework or work experience)
- Are able to do a hypothesis test to determine:
- If a coin is fair given 100 flips
- Calculate a confidence interval for the mean height given 100 observations
- Explain how to test if events are independent
- Use Bayes Rules to see what the probability of an event is given another event
- Fit a linear model in R.

Otherwise, students should familiarize themselves with Chapters 1-6 of CK-12 Foundation’s Basic Probability and Statistics – A Short Course. Each chapter should take between 1-2 hours.

Upon completion of the course, students have:

An understanding of basic statistical hypothesis testing and confidence intervals.

The ability to model data using well known statistical distributions as well as handle data that is both continuous and categorical.

The ability to perform linear regression and adjust for multiple hypothesis.

An understanding of how to calculate the number of samples needed to achieve required sensitivity and specificity.

An understanding of bootstrapping and Monte Carlo simulation.

**Join our meetup** (CHI | NYC | SEA | SF) to be alerted of future events.

Paul Trowbridge received advanced training in statistics, demography and sociology from the University of Washington and Rutgers University. He has worked in applied fields such as fMRI, epidemiology and public health, international relations, urban planning and micro-simulation modeling. He has taught statistics, data science and data visualization through New York University's School of Professional Studies.