In this post I’m going to work on one of my older projects with the use of the new Azure ML Workspace. In the beginning of November, I had the chance to present this awesome new interface at the Azure AI Days at Microsoft Copenhagen, and now I’d like to walk you through on the services available as well.

Wine tasting

In this year I was a mentor to a girl, who decided to get into the world of Data Science and Neural Networks. For this reason, we worked on a project together, the aim was to predict the quality of wines specifying some details about them, namely:

  • fixed acidity
  • volatile acidity
  • citric acid
  • residual sugar
  • chlorides
  • free sulfur dioxide
  • total sulfur dioxide
  • density
  • pH
  • sulphates
  • alcohol

Given that we both have incomplete knowledge about wines, we thought it’s a good idea to let machines decide whether a wine is good or not.

We had two datasets to work with, you can also download these to your computer, so you can try out the practical part.

The data requires some cleanup and preparation before using it for training. With the help of this post, you will have some ideas about how to transform and clean a quite messy dataset, and then about how to build a predictive model with different cloud services provided by Microsoft Azure.

The workspace

You don’t need to fully understand the mathematical and statistical background to get started with machine learning, you can understand the basic just by working yourself through this post. Microsoft now provides a cloud service, that is a centralized resource for your artifacts: Notebooks, Designer (earlier Machine Learning Studio or Visual Interface) and Automated ML. You can easily deploy your trained models to Container Instances, Kubernetes, FPGA or even to an Azure IOT Edge device as a module. If you want to learn more about the services of Azure Machine Learning Workspace, please read through this documentation.

In this post we are going to interact with the workspace with online services, but you can use the available SDK for Python or R as well, depending on your preferred environment.

To start with the workspace, you need to have an active Microsoft Azure subscription, which you can create when you have a Windows Live ID. When your subscription is ready, login to the Azure Portal, and create a new Workspace.

When your workspace is done, open up the overview of it. The service includes the usual possibilites: alerts, metrics, diagnostics and so on. The interesting part for you are the artifacts you can see on the following image:

When you create your workspace, you can choose whether you want to use a Basic or an Enterprise edition of the service. Note that, some of the artifacts are only available in the Enterprise edition, so if you want to try, for example, the Designer, you need to upgrade to Enterprise when you decide to. Find more details about pricing and availabilities on this link.

Click on Launch the new Azure Machine Learning studio to get started!

Notebooks

To find out the details of the dataset, we are going to work with the Notebooks. This service is very similar to the Azure Notebooks or the Jupyter Notebooks you might have used previously while coding with Python. You can even export your notebook to the original Azure Notebooks, from where you can export your files. So when you start up the new studio, you should see a screen, like this:

Click on Start now at the Notebooks and let’s investigate our data. When it starts up, on the left panel, you should see a folder with your subscription name, as User files. Click on this folder and create a new folder in it. You can find these buttons on the top of this window.

Name that folder as Winetasting, and make sure that it gets to your desired folder.

Let’s upload the datasets to the Winetasting folder:

Create a new Python file, where we are going to write the code to prepare our data. Make sure you put it to the correct folder.

Now you have to create a VM too, this is where you will run your code on. You can define the size of this VM, find details of the pricing for these machines on this link.

If you don’t want to pay for this resource after you are done with your training and investigations, remember to remove the instance. You can do this by clicking on the Compute menu on the left side bar of the studio.

When your VM is up and running, test your notebook with the following code, and verify that it says Hello back to you! 🙂

Before we start writing code, please, verify that your environment looks like this:

Let’s read in the dataset we have, with the use of the Pandas package. When you print out the dataframe that is created, you can see that the red wine dataset includes 1599 rows, 12 columns and the white wine includes 4898 rows, 12 columns.

import pandas as pd
import numpy as np

redWine = pd.read_csv("winequality-red.csv", delimiter=(';'))
whiteWine = pd.read_csv("winequality-white.csv", delimiter=(';'))

redWine # or whiteWine

Let’s see some details about our columns with the help of the describe function.

Next to the basic statistical information we can also assure ourselves that there are no missing values, and also you might have noticed that one column is missing from this result: alcohol. It is because this column is not numerical type. I’d like to convert this column to numerical, but it throws an error. Apparently, in this column there are some weird data too, next to the familiar float values:

Let’s change the alcohol column to make it available for conversion, but it is also important to not include data that is noisy, as it can affect the training results in a bad way. We do it first for the white and then for the red wines, because the alcohol level is not necessarily the same. If there are more than one dots (.) in the value, we update it. The value, I use to update the strange values, is calculated to be the average in this dataset. After this update we can convert the whole column to float type.

redWine['alcohol'] = np.where(redWine['alcohol'].str.count('.').gt(1), 10.5, redWine['alcohol'])
whiteWine['alcohol'] = np.where(whiteWine['alcohol'].str.count('.').gt(1), 10.38, whiteWine['alcohol'])

redWine.alcohol = redWine.alcohol.astype(dtype=np.float64)
whiteWine.alcohol = whiteWine.alcohol.astype(dtype=np.float64)

Let’s put together the red and white wines for the sake of an easier processing in the future. I specify a type column, which tells us whether the wine is red or white.

redWine["type"] = 1
whiteWine["type"] = 0
wineDF = redWine.append(whiteWine, ignore_index=True)

One extra small fix on the dataset is to remove the white spaces from the column names:

wineDF.columns = wineDF.columns.str.replace(' ', '_')

Now our dataset includes 12 float numerical type columns as features and another column (quality) as a label. The dataset is ready to be used for training, and then we could let the machine decide how high is the quality of the wines, based on their features.

Before starting the training, let’s make a plot to see some relationships in the dataset. For this purpose, I’m going to use the Seaborn package. We need to change the quality column’s type to object, because plotting needs a comparable object, and variables to show whether they affect the wine’s quality.

import matplotlib
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow

import seaborn as sns

wineDF["quality"] = wineDF["quality"].astype(object)
i = sns.pairplot(wineDF, vars=["density","pH","alcohol"], hue="quality")
plt.show(i)

When the code finished running, you will get a result like this:

Now that we have some idea of our dataset, and it is cleaned up, let’s write some code and do the training in the Notebook. Save the cleaned and processed dataset into a new csv file, because we are going to use it in other services later.

wineDF.to_csv("wines.csv",index=False,sep=',')

To build a basic neural network, that is able to decide which wine we should drink, we need some packages to be imported: Keras and Scikit-learn. These packages include all the necessary functions we would use to build a deep learning model.

from keras import layers, optimizers, regularizers
from keras.layers import Dense, Dropout, BatchNormalization, Activation
from keras.models import Sequential
from keras.utils import plot_model

import keras.backend as K

from sklearn import preprocessing, model_selection 

Before building the model, we should process our data even further. One-hot encoding allows us to change the label. The result is a column for each quality level, holding 0 or 1 as data. By this, during the training we would have an array of information instead of just one value.

wineDF["quality"] =wineDF["quality"].astype(int)
wineDF = pd.get_dummies(wineDF, columns=["quality"])
wineDF.head(5)

We just write out the first 5 elements, and the new quality labels would look like this:

Now we have to specify for the algorithm as well, which columns are features and which ones are labels. Additionally, we need to apply normalization on the dataset, that will further process the dataset to recalculate the values of each columns to a normalized range. And finally, we have to split the dataset for training and validation, or testing. You can choose different amount, in this code I specified that 25% of the dataset should be used for testing.

X = wineDF.iloc[:,0:12].values # features
Y = wineDF.iloc[:,13:].values # labels
X = preprocessing.normalize(X, axis = 0)

X_train,X_test,Y_train,Y_test = model_selection.train_test_split(X,Y,test_size=0.25)

print(X_train.shape,Y_train.shape,X_test.shape,Y_test.shape)

We can also write out the shape of the training and test dataset. It returns four pairs of information, first the training features, then training labels, and the third is the test features and finally the test labels. Now we are ready to build the deep neural network. We specify that the model is a sequential model, and then add an input layer, two hidden layers and an output layer.

winemod1 = Sequential()

winemod1.add(Dense(30, input_dim=12, activation='relu', name='input',kernel_regularizer=regularizers.l2(0.01)))
winemod1.add(BatchNormalization(momentum=0.99, epsilon=0.001))

winemod1.add(Dense(50, name='hidden1',bias_initializer='zeros'))
winemod1.add(BatchNormalization(momentum=0.99, epsilon=0.001))
winemod1.add(Activation('tanh'))
winemod1.add(Dropout(0.5))

winemod1.add(Dense(100, name='hidden2',bias_initializer='zeros'))
winemod1.add(BatchNormalization(momentum=0.99, epsilon=0.001))
winemod1.add(Activation('tanh'))
winemod1.add(Dropout(0.5))

winemod1.add(Dense(6, name='output',bias_initializer='zeros'))
winemod1.add(BatchNormalization(momentum=0.99, epsilon=0.001))
winemod1.add(Activation('softmax'))

On each layer we apply an activation function, which are going to run on the features we pass. When you build and improve a neural network, you need to make sure you use the different functionalities correctly. While you train your model, some values will be presented, which help you to improve this network. You could improve the bias or the normalization for each layer. Let’s see the summary of this network with the use of the summary function.

Before we start training, we have to compile the model, for which we need to use an optimizer. You can also specify the different metrics and losses for your model, based on your interest.

SGD = optimizers.SGD(lr=0.01, nesterov=True)
winemod1.compile(optimizer = SGD, loss = "mean_squared_error", metrics = ["accuracy"])

Now it is time to start the training with the use of the fit function, in which you specify the training and validation dataset, the iterations, how many samples should be used for each iteration and so on.

winemod1.fit(x = X_train, y = Y_train, epochs = 200, verbose=1, batch_size = 64, validation_data=(X_test, Y_test))

After training, the model is evaluated with the test dataset and then we can just print out the relevant result, and based on this result you can always go back and improve your model.

preds = winemod1.evaluate(x = X_test, y = Y_test)
print()
print ("Loss = " + str(preds[0]))
print ("Accuracy = " + str(preds[1]))

The model I trained returned with around 0.55 accuracy, which means, that it is 55% sure about the exact quality of a wine. It is not a superintelligence, but let’s move on to the next service, and let’s see whether those can return a better accuracy.

All code and the new dataset can be found on GitHub.

Designer

The Designer is basically the well-known Azure Machine Learning Studio, integrated into the Azure Machine Learning Workspace. On the left side panel of your workspace, click on Datasets, and Create a new dataset, by uploading the file, you created in the previous section. If you couldn’t download it, you can get it from my GitHub repository.

Note: I might not explain all the steps in details, please, take a look at one of my previous posts to get more information about the different modules I’m going to use.

So create a dataset from local files, browse the wines.csv dataset, you can specify a name for it, and click Next. All settings and schema should be fine, so simply click on Create at the final window.

Now go to Designer, which option can also be found on the left side, and click on the plus sign to create a new experiment. You can rename your experiment, and if you have worked with Azure Machine Learning Studio before, you might find this environment familiar. On the left side, you can find the available modules, which is not a complete list yet, but the engineering team is working very hard to get all the functionality we are used to. So this is how your experiment looks like before you start working in it:

Now we want to build an experiment, in which we start training a machine, which should be able to tell us, which wine has higher quality, based on the features and labels from the training dataset, we prepared in the previous section. Let’s pull in the dataset from the left side, from the Datasets folder. Simply drag and drop it on the experiment. When you try to run the experiment, you are prompted to choose a compute target. You can create a new one too, but again, remember to remove it if you won’t use it after, otherwise it could get costy.

On the bottom of this window, you can click Create and then your experiment is ready to run.

Let’s pull in some modules too before we actually run the experiment. Let’s use the Edit metadata, which is really useful when you want to specify for example, which columns hold categorical data.

Connect the data to the Edit metadata module, and click on the Edit metadata module. Choose two columns: type and quality, these ones should be changed to categorical columns. Save and run the experiment. Running might take a while, because it has to start up your target machine too. In Designer, you have to create a Pipeline for your experiment as well, so simply create a new one when it’s prompted, and click Run.

Now it is time to train our model with the dataset we prepared. Pull in the Split data module, because we area going to use 75% of our data for training, and the rest goes for validation.

Now we choose an algorithm to work with. Note that, some of the algorithms are not yet available in the new interface, but they will be added soon. So pull in the Train model module (train for quality) and the Multiclass Neural Network for this project.

Your experiment now should look like the one on the picture. Save your experiment. Now it’s time to make a prediction, so how good are the different wines in our dataset, by using the Score model module. And to get an almost perfect overview of the results, also pull in the Evaluate model module. Connect all these to your trained model, just like on the picture:

Save and run your experiment. Visualize the result of the Evaluate model module, which would be able to tell you how to improve your model. In the Classic Studio, next to the accuracy results we could also observe the confusion matrix, but the new version is lacking this functionality yet. So this model is only 54% sure about the quality of the wines, so far my code returned a better accuracy!

After the pipeline is finished running, you can see some logs of the execution. On the left side panel, choose the Pipelines, click on the Run you are interested in and choose the Logs tab, to read some details.

If you are happy with the results you got, you can register your trained model and then deploy it. This will be discussed in a later section of this post. If you want to improve your model, you can find some inspiration in one of my previous posts! 🙂

Automated ML

Automated ML enables you to improve your models by letting the service making a decision on which algorithm should be used for training on the specific dataset you work with. So click on the Automated ML option on the left side panel, and create a new run.

First you either add a new dataset, or choose a previously uploaded one. Now, choose the wines dataset and then click on the Next button on the bottom of the window. On the next window you have to specify an experiment name (create a new one, so the service is able to provide you suggestions on which algorithms perform the best, otherwise it would use your chosen algorithms from the experiment you made).

The target column is the quality, because we want to train our model to predict the quality of the wine based on the other columns we provide. As a compute target choose the one we used in the previous section: BigTester. Click next, and then you need to choose the algorithm type, and given that the quality column holds categorical values, we choose the Classification with deep learning allowed.

Click Finish, and wait until it is done running.

While you are waiting, you can monitor the different models that already returned some results (Models tab), and use this data to decide which trained model could work the best on your data. The higher the accuracy after the training, the more surely the model can tell, which wine you should drink. You should also observe the messages generated on the Data guardrails tab, because this is where you will get a notification if the service found a problem with your dataset: missing values, cardinality features, or even unbalanced samples.

So, on the Models tab, you can download the trained model, if you click on the name of the algorithm you are interested in, you can find more details, metrics, visualizations, logs and all the outputs generated by this model. At this point, your model is also ready to be deployed from simply the Model details tab.

So from the algorithm list let’s choose the one with the highest accuracy: RobustScaler, KNN (accuracy: 0.6394453004622496). So the best model is only 64% sure about the quality of the different wines.

You can download the model which will save a .pkl file to your PC, that you can modify and use in a Python environment – even in Notebooks.

Before we deploy this model, let me suggest you to take a deeper look at the visualizations, so you can make an even better decision when you try to choose the perfect model. You can find lots of useful information about these results at this link.

Another interesting tab is the Outputs, this is where you can take a look at the files generated by the Automated ML Run. You can find the scoring Python file, the dependencies required for your model, a yaml file to support the deployment, the accuracy and confusion results, and the .pkl file that you downloaded previously.

Support and next steps

The section about deployment waits for further support from the Azure Community, given that there are some uncovered issues in the pipeline. If you also find some problems, or you want the engineers to look into your wishes as well, feel free to reach out to them at the MSDN Forum.

I continue with deploying the models we just prepared and trained asap! 🙂

Conclusion

These services gives a great opportunity to everyone when it comes to machine learning. If you are just getting started, you can try the Designer, and become familiar with the steps have to be taken when you build a model. You can try out the algorithms, and get started with the mathematics behind them. You can run and test your own code with the Notebooks, and train your models very quickly. Automated ML gives you a great idea, which algorithm you should use on the dataset you provide.

I remember when I started to work with deep learning, and we had to create different virtual machines, install the tools and packages on them, and wait for long hours until the training finally finished. The workspace provides you an environment to train and test your models without worrying too much about installations and setup.

I hope I could give you some value with this post. Feel free to reach out if you have any questions, or you want to get deeper in this beautiful world! 🙂

Follow me 🙂

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.