Binary Classifier Decision Boundary

Why Read This Data Science / ML Blog?

Filling in some gaps for both DS/ML experts and beginners!

It seems like everyone these days has a DS/ML blog on Medium, or is posting on the topic to LinkedIn. So who am I and why should you read my posts? First a little background.

After decades of hardcore software engineering and engineering leadership at Silicon Valley startups, I found Data Science / Machine Learning and fell in love. I joined H2O.ai in February 2014 to run the engineering team and contribute to the architecture and codebase, and I am still learning new things about DS/ML almost every day. I have never been more excited about my work days than I am now. And I’ve found that my Electrical Engineering education, including Numerical Methods, Statistics, Calculus, and Linear Algebra, were excellent preparation!

At the three DS/ML automation companies I’ve been a part of I’ve worked on projects for two different kinds of target personas:

⦿ experienced, professional Data Scientists that spend their working days focused on ML projects, and

⦿ data workers who don’t have much Data Science background, but are eager to make use of DS/ML in their work.

ML Automation for Business Analysts, Domain Experts, and Software Engineers

My last role was to guide the engineering team in building a new ML automation product aimed at data analysts and domain experts. These users are familiar with their data, but don’t have the time to dedicate years of learning to become great Data Scientists. I believe it’s possible to create ML automation software that makes statistically-correct Machine Learning available to software engineers and to data workers who want to add ML to their toolkits.

This isn’t an easy task, because DS/ML projects are full of potholes and landmines. The software should help as much as possible to detect these, to suggest possible courses of action, and to guide the user in ways that actual humans can understand. We made good progress and the product is on the right track, but no company has yet achieved this goal. We need to bridge the gap with clear, easy-to-understand educational materials.

I spent a good deal of time in that position explaining these difficult topics to the engineering, product, and go-to-market teams; developing courseware; writing almost all of the in-product text; and speaking at global conferences. I learned a lot about how to write and talk about these topics in a way that’s clear to actual humans and mathematically correct.

ML and ML Automation for Data Scientists

Professional Data Scientists have various degrees of education in the field, along the spectrum from self-taught practitioners who jumped into scikit-learn on their own, to PhDs from top universities. In my experience, most DSs have a deep understanding of the mathematics and statistics behind their work, but often have missed some useful techniques that are not widely known outside of Kaggle discussions or the academic literature.

The Gap

There’s a huge gap in the educational material available for these two categories of learners. The blogosphere (Medium, StackOverflow, etc.) is full of redundant newbie articles that provide superficial information that helps neither of these groups to succeed in their ML projects. These posts almost never cover the underlying mathematical and statistical intuitions that are necessary, or the pitfalls to avoid, in order to have successful Data Science projects.

The deeper material, available in books like The Elements of Statistical Learning and in academic papers, or deep within discussions and solution kernels on Kaggle, is very technically dense and, more importantly, usually covers information that is aimed at implementors of the algorithms rather than the users.

My blog is intended to bridge that gap:

⦿ For data workers I’ll cover the practical statistical intuitions and concepts that are necessary for successful use of ML algorithms.

⦿ And for practicing Data Scientists I’ll cover topics that are often missed in their “book-learning” education.

Here are a few topics I have queued up:

⦿ Using Data Science / Machine Learning to Help Understand Your Data

⦿ The Best Ways to Automatically Find Relationships In Your Data

⦿ The Kinds of Relationship “Shapes” Various Correlation Measures Detect

⦿ Model and Prediction Understanding and “Multicollinearity“

⦿ Correlation Clusters: Thinning Out Multicollinearity Properly

⦿ How the Subtleties of Feature Selection Can Hijack Your ML Project

⦿ Missing Values: How to Deal with Them for Correlations Measures

⦿ Missing Values and Imputation: Best Practices

⦿ Building a 2500+ Dataset ML Collaborative Experimentation System

⦿ Multitable Feature Engineering, MLOps, and Performance

⦿ Eliminating Thousands of Lines of Boilerplate Code Through Reflection

⦿ Hyperparameter Search: What Are Local Minima Anyway?

If this sounds interesting to you, please follow my blog, comment, and share with your contacts. Thanks!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Raymond E Peck III

Raymond E Peck III

158 Followers

Doing software for ML since 2014: H2O.ai, Quantifind, dotData, and Alteryx. Lots of different kinds of startups since co-founding Pure Software (1995 IPO).