On Thursday, January 19th, we're hosting a talk by Daniel Whitenack, Lead Developer Advocate at Pachyderm, in Chicago. He'll discuss Distributed Analysis of the 2016 Chess Championship, pulling from his recent analysis of the games.
In short, the analysis involved a multi-language data pipeline that attempted to learn:
- - For each game in the Championship, what were the crucial moments that turned the tide for one player or the other, and
- - Did the players noticeably fatigue throughout the Championship as evidenced by blunders?
After running all of the games of the championship through the pipeline, he concluded that one of the players had a better classical game performance and the other player had the better rapid game performance. The championship was eventually decided in rapid games, and thus the player having that particular advantage came out on top.
You can read more details about the analysis here, and, if you're in the Chicago area, be sure to attend his talk, where he'll present an expanded version of the analysis.
We had the chance for a brief Q&A session with Daniel recently. Read on to learn about his transition from academia to data science, his focus on effectively communicating data science results, and his ongoing work with Pachyderm.
Was the transition from academia to data science natural for you?
Not immediately. When I was doing research in academia, the only stories I heard about theoretical physicists going into industry were about algorithmic trading. There was something like an urban myth amongst the grad students that you could make a fortune in finance, but I didn’t really hear anything about “data science.”
What challenges did the transition present?
Based on my lack of exposure to relevant opportunities in industry, I basically just tried to find anyone that would hire me. I ended up doing some work for an IP firm for a while. This is where I started working with “data scientists” and learning about what they were doing. However, I still didn’t fully make the connection that my background was extremely relevant to the field.
The jargon was a little weird for me, and I was used to thinking about electrons, not users. Eventually, I started to pick up on the hints. For example, I figured out that these fancy “regressions” that they were referring to were just ordinary least squares fits (or similar), which I had done a million times. In other cases, I found out that the probability distributions and statistics I used to describe atoms and molecules were being used in industry to detect fraud or run tests on users. Once I made these connections, I started actively pursuing a data science position and honing in on the relevant positions.
- - What advantages did you have based on your background? I had the foundational mathematics and statistics knowledge to quickly pick on the different types of analysis being used in data science. Many times with hands-on experience from my computational research activities.
- - What disadvantages did you have based on your background? I don’t have a CS degree, and, prior to working in industry, most of my programming experience was in Fortran or Matlab. In fact, even git and unit tests were a totally foreign concept to me and hadn’t been used in any of academic research groups. I definitely had a lot of catching up to do on the software engineering side.
What are you most excited by in your current role?
I’m a true believer in Pachyderm, and that makes every day exciting. I’m not exaggerating when I say that Pachyderm has the potential to fundamentally change the data science landscape. In my opinion, data science without data versioning and provenance is like software engineering before git. Further, I believe that making distributed data analysis language agnostic and portable (which is one of the things Pachyderm does) will bring harmony between data scientists and engineers while, at the same time, giving data scientists autonomy and flexibility. Plus Pachyderm is open source. Basically, I’m living the dream of getting paid to work on an open source project that I’m truly passionate about. What could be better!?
How important would you say it is to be able to speak and write about data science work?
Something I learned very quickly during my first attempts at “data science” was: analyses that don’t result in intelligent decision making aren’t valuable in a business context. If the results you are producing don’t motivate people to make well-informed decisions, your results are just numbers. Motivating people to make well-informed decisions has almost everything to do with how you present data, results, and analyses and almost nothing to do with the actual results, confusion matrices, efficiency, etc. Even automated processes, like some fraud detection process, have to get buy-in from people to get put to place (hopefully). Thus, well communicated and visualized data science workflows are essential. That’s not to say that you should abandon all efforts to produce good results, but maybe that day you spent getting 0.001% better accuracy could have been better spent improving your presentation.
- - If you were giving advice to someone new to data science, how important would you tell them this sort of communication is? I would tell them to focus on communication, visualization, and reliability of their results as a key part of any project. This should not be forsaken. For those new to data science, learning these components should take priority over learning any new flashy things like deep learning.
Follow Daniel on Twitter at @dwhitena - and RSVP to attend his talk here!