blog

The missing guide to AzureML, Part 1: Setting up your AzureML workspace


by Robert Gibboni

Welcome to our series on using Azure Machine Learning. We've been using these tools in our machine learning projects, and we're excited to share everything we've learned with you!

A few months ago, I found myself needing to ramp up a machine learning project using the Azure Machine Learning platform. So I set off on a journey that's familiar to most data scientists who work in a world of ever-changing tools and standards — I googled "<new thing> tutorial", took a deep breath, and dove in. Luckily, there is an amazing wealth of AzureML learning resources out there. Unluckily, the quality (coverage, discoverability, correctness, etc.) is... spotty.

Now, after dozens of blue links turned purple, I wanted to sit down and write the guide that I wish had appeared on the first page of that initial search. A guide for data scientists new to AzureML (and likely new to the Azure ecosystem, period). A resource somewhere between a goal-oriented how-to guide (we want to get something done after all) and an explanation that grows our understanding of the tool. One that synthesized official documentation and linked to that documentation, so I could go deeper on my own.

This is that guide! In it, we will work through setting up your AzureML workspace and interacting with it from Python to compose and run a machine learning workflow on AzureML cloud resources. My recommendation is to read through once to get a general idea for the concepts and tools, then read through a second time following out any of the links for concepts you'd like to understand better. If by the end of that, you have a better understanding of the wide world of Azure and feel like you know where to go to answer the next question that comes up, my mission is complete.

AzureML learning resources

One of the more useful things this post can do is direct newcomers to the best existing resources on AzureML, so I wanted to do it early on. Here are the four I found most helpful ― they're not perfect, but they're probably still the best places to start with any questions that come up:

  • Azure Machine Learning documentation: The entry point for most of the official AzureML documentation. I found the "How-to guides" section to be especially useful when first starting out.
  • Azure Machine Learning service example notebooks GitHub repo: A great next step once you've learned all you can from this guide. Many of the examples here were adapted from these notebooks, and there is lots more too.
  • AzureML Python API Browser: The API reference for the azureml Python package. A good place to start when you already know which object or function you want to use. Sometimes it has your back, other times it just gives you an explanation like, "Parameter (str): The parameter that needs to go here".
  • MLOpsPython GitHub repo: A kitchen-sink learning resource if I ever saw one. Especially useful when you want to start integrating AzureML with the rich set of CI/CD tools in Azure DevOps. The thing you are trying to do is probably in here, but so is every bell and whistle in the entire AzureML/DevOps library, so good luck getting in and out quickly with the info you are looking for.

In addition, I've sprinkled in links to resources throughout the post to make it easier to follow up.

Why AzureML?

A reasonable first question you might have is, should I use AzureML for my machine learning project? AzureML provides structure that many machine learning projects could benefit from. To me, the benefits that AzureML provides can be summed up as follows:

  • Organization: AzureML is built around a pretty darn reasonable conceptualization of a machine learning project. A machine learning project is more than just code and data. There are runs, in which specific code is run on specific data. Runs typically result in a model, and models have metrics, like accuracy, precision, recall. There is a notion of an experiment, a set of conceptually related runs. AzureML incorporates all of these concepts into a single framework, and working within that framework can make it easier to bring a sense of order to your machine learning project.
  • Collaboration: AzureML makes it easy to collaborate. It provides a centralized location for data, models, and code. Collaboration also benefits from the strong organizational principles mentioned above. You can point your colleague to a model and they can track its entire lifecycle ― the code, data, and hardware used to train it.
  • Automation: AzureML plays nicely with the rest of the Azure ecosystem, enabling some neat continuous integration/continuous deployment workflows. You can create pipelines to train and evaluate a model and deploy to a web service whenever you push new code to your repository.

Azure basics

AzureML is just one of the many (and I mean many) services available through Microsoft Azure. I often found myself stuck for hours because I didn't understand some of the Azure basics. Thankfully, a complete AzureML workflow only involves a few of the Azure services and concepts:

  • Subscription: This is the top of the hierarchy, the container for everything else. Not coincidentally, this is the thing that tells Microsoft where to take your money from. Each subscription has a name and ID. The name is something human readable (mine is called "Pay-As-You-Go", the default name for the pay-as-you-go plan). The subscription ID looks like 01234567-890a-bcde-f012-3456789abcde. As you might have guessed, the subscription name is not unique across all Azure subscriptions ― you'll mostly need the subscription ID when trying to connect to your resources.

  • Resource group: It's a group for your resources, a.k.a. services. Probably more profound than that, but for now it's all we need to know.

  • Storage: What is data science without data? If you are familiar with Amazon Web Services, think of Azure Storage as S3, except S3 "buckets" are "containers" and S3 "objects" are "blobs." Or think of it as a folder on your computer. When you set up a Machine Learning service, it will automatically create and link a storage service.

Sources:

AzureML basics

Now we can turn to our main focus, Azure Machine Learning (a.k.a. AzureML and AML). Let's start by introducting a few of the key AzureML concepts:

  • Workspace: The Workspace is your instance of an AzureML service. It contains everything related to your ML project: datasets, models, compute resources, run environments, and more.
  • Compute cluster: These are the compute resources you have available. Think of them as computers that you can call upon to run your code.
  • Pipeline: A Pipeline is a plan for some process you want to run in AzureML. A typical Pipeline for machine learning involves something like "Load some data, train a model, write some logs, save the model." To be clear, a Pipeline does not do anything ― it is just a plan for what to do. For that reason, there is no notion of the "accuracy of a pipeline" or "pipeline logs." Which brings us to...
  • Run: A Run is what you get when you take a Pipeline and... well... you run it. That is, a Run executes a pipeline on a specified computational resource. This is the thing that actually "Loads some data, trains a model, writes some logs, saves the model."
  • Experiment: Experiments are a way to organize your Runs. Each Run takes place in one (and only one) Experiment. Organize however you like, but keep in mind that the AzureML studio dashboard only supports comparisons between Runs in the same Experiment, so if you'd like to, say, look at a graph comparing the accuracy of a bunch of Runs, you should run them in the same Experiment.
  • Environment: An Environment fully specifies the runtime for your code. This includes the base Docker image, Python version, conda and pip requirements, and environment variables.

Take a minute to familiarize yourself with how these concepts are laid out in the AzureML studio.

Sources:

Creating your AzureML workspace

Start by creating an Azure account and logging in. Then visit the Azure portal. From here you can access and manage all of your Azure services ― in our case it's just a subscription and AzureML. If your organization already has an Azure subscription set up, you'll probably need to talk to whoever administers the account to get authenticated. We'll wait. Then click the enormous "Create a resource" plus sign, and locate and click on the Machine Learning service within the Azure Marketplace.

Azure Portal

Give your workspace a memorable name like my-azureml-workspace, select your existing subscription, and either select an existing resource group or create a new one. Select a location (typically the location nearest to you), select the "Basic" Workspace edition, then "Review + Create". Review, then "Create!" 🎉

Create your AzureML workspace

Now back to your Azure portal, which should show your new AzureML workspace (if not you can use the search bar to locate it). Open it up.

AzureML workspace

Here you'll see the basic Azure-style portal for your AzureML workspace. There is a view much like this for every Azure resource, always things like "Overview", "Activity log", "Access control (IAM)", etc. The "Overview" tab is particularly important since it contains the subscription ID and resource group name that you will need in order to connect to your workspace.

This view is good and you'll want to be familiar with it, but you should know that Azure is rolling out a new interface to AzureML called "AzureML studio" that will probably be where most of the improvements to the AzureML web experience end up. So to prevent this guide from becoming instantly obsolete, we'll mostly be using the AzureML studio interface from here on out. Click "Launch now" in the banner to join us there.

AzureML studio

Source:

Connecting to your AzureML workspace

There are lots of ways to work with AzureML. In this guide, we'll focus on interaction via Python, using the azureml SDK (a Python package) to connect to your AzureML workspace from your local computer.

You'll need three pieces of information to connect to your workspace: your subscription ID, resource group name, and AzureML workspace name. The basic AzureML portal has a nice "Download config.json" button you can use. AzureML studio has one too, but you'll never find it. That's fine because since you already know what these pieces of information are (or at least where to find them), you can create a config.json yourself.

{
    "subscription_id": "01234567-890a-bcde-f012-3456789abcde",
    "resource_group": "my-resource-group",
    "workspace_name": "my-azureml-workspace"
}

Put it in a folder named .azureml in the root directory of your project so that azureml can automatically detect it. The config file only points to your AzureML resources ― you still need to authenticate. First, set up your virtual environment and pip install azureml-sdk. In a Python prompt, type:

from azureml.core import Workspace
workspace = Workspace.from_config()  # loads from your .azureml/config.json

The first time you do this, you'll see a message: Performing interactive authentication. Please follow the instructions on the terminal.. A browser window will pop up asking you to log into your Azure account. When you've done that, you'll see a new message: Interactive authentication successfully completed..

This saves an authentication token in ~/.azureml/auth/accessTokens.json (take a look just for fun) so you won't have to authenticate again until the token expires.

Source:

So there you have it! You have created an AzureML workspace and are poised to interact with it from Python. Head over to Part 2 where we'll show you how to create machine learning pipelines that you can run in AzureML.