Getting started with machine learning


With the world’s biggest collection of open source data, GitHub’s Data Science Team has just started exploring how we can use machine learning to make the developer experience better. I see machine learning shaping experiences around me every day, and I’m excited about what’s to come in applying it to create more useful, predictive technologies.

In this collection, I'll share the basics of machine learning, along with some related resources and projects for people who are getting started with it.

What is machine learning?

Machine learning is the study of algorithms that use data to learn, generalize, and predict. What makes machine learning exciting is that with more data, the algorithm improves its prediction. For example, I remember when my family started using voice to search instead of typing. At first, it took a while for the machine to recognize our words, but within a week of working with it, the algorithm’s speech detection capacity was good enough that now, voice is my family’s primary mode of search.

At its core, machine learning isn’t a new concept. The term was coined in 1959 by Arthur Samuel, a computer scientist at IBM, and it’s been widely used in software since the 1980s.

As people move from the physical to digital realm, we can learn from the trail of data they’ve left behind.

Dating myself here, I remember building neural networks in the early 2000s as part of my academic training. While it was informative to learn and build these algorithms, they lacked a real commercial application. What was missing was access to vast amounts of data. As people move from the physical to the digital realm, they leave digital footprints that we can learn from. With about three billion people on the planet with access to the internet, these footprints make for a staggering amount of data.


These data stores are what we refer to when we use the phrase “big data”. With the emergence of big data, machine learning algorithms were finally able to transition from academia into industry, powering products that deliver a lot of value to consumers. However, collecting and gaining access to that data is only part of the puzzle towards building machine learning data products like search engines and recommender systems. Until recently, software programmers, data scientists, and statisticians lacked the tools to harness, clean, and package these massive datasets so that they could be used by other applications.

Now, with tools like Amazon Web Services and Hadoop, we have better, more cost-effective ways to manage information. Access to these tools opens a new realm of possibilities for gaining value from big data sets.

In recent years, machine learning has expanded to include new applications and endeavors of all kinds. We’ve trained algorithms to do everything from pattern recognition to mastering games to “dreaming”.

Even with all of the exciting developments in machine learning today, we’re only at the beginning of what’s possible.


How does machine learning work?

To understand what goes into machine learning, it’s helpful to break down the process into three components: inputs, algorithms, and outputs.

Inputs: the data that powers machine learning

Inputs are the data sets you need for training and algorithm. From source code to statistics, data sets can contain just about anything:

Because we need these inputs to train machine learning algorithms, finding and producing high-quality data sets is one of the biggest challenges in machine learning today.

Algorithms: how data is processed and analyzed

Algorithms are what turn data into insights.

A machine learning algorithm uses data to perform a specific task. The most common types of algorithms are:

  1. Supervised learning uses training data that has already been labeled and structured. By specifying a set of inputs and desired outputs, a machine learns how to successfully recognize and map one to the other.

For example, in decision tree learning, values are predicted by applying a set of decision rules to the input data:

  1. Unsupervised learning is the process of using unstructured data to discover a pattern and structure. Whereas supervised learning might use a spreadsheet as its data input, unsupervised learning might be used to make sense of a book or blog.

For example, unsupervised learning is a popular approach in natural language processing (NLP):

  1. Reinforcement learning requires the algorithm to achieve a goal. As the algorithm performs tasks towards that goal, it learns the correct approach through rewards and punishments.

For example, reinforcement learning might be used to develop self-driving cars or teach a robot how to manufacture an item.

Here are a few examples of algorithms in practice:

Some of the libraries and tools you’ll find to perform these analyses include:

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic …

What is deep learning? Deep learning is a subset of machine learning that uses neural networks to find connections between data. Deep learning may use supervised, unsupervised, or reinforcement learning to achieve its goal.

In this great visualization, you can actually play with neural networks right in your browser. Go ahead and give it a try.

While deep learning has existed for decades, neural networks have only become possible since the mid-2000s thanks to graphics processing unit (GPU) innovation. GPUs were originally developed to render pixels in 3D game environments, but they’ve since found a new purpose in training neural network algorithms.



Outputs are the final results of your hard work. They might be a pattern that recognizes when a sign is red, a sentiment analysis that classifies the tone of a webpage as positive or negative, or a predictive score with a confidence interval.

In machine learning, outputs can be just about anything. A few approaches to finding outputs include:

  • Classification: generate an output value for each item in a data set
  • Regression: given the data, predict the most likely value for variable under consideration
  • Clustering: group the data into similar patterns

Here are a few real-life examples of what people do with machine learning:

DeepMind used reinforcement learning to play StarCraft II:

Computational biologists use deep learning to understand DNA:

A French-to-English translator, using Tensorflow:

Ready to get started?

Dive in to machine learning resources curated by people on GitHub, or add your own resources to these lists.

Machine learning:

Deep learning: