In this post I would like to suggest a few new ways of improving an Azure Machine Learning (AML) solution, namely the Nepal Earthquake damages project. You might have seen my previous posts about the topic (Improve an Azure ML Studio experiment, Earthquake in the Visual Interface), now I’m back with the same problem but with a quite new solution. Not long after I wrote these posts, Microsoft released the new design of the Preview version of the Azure Machine Learning Workspace. Now, I would like to walk you through some of the possibilities you get when you create a new workspace using an old familiar project. You will also get some ideas about how to improve your solutions.
What we want to do is to build a predictive model which is able to answer our question, by learning from the data we provide: Which buildings are in danger when another earthquake comes?
Predictive analytics deal with designing statistical models or machine learning models that predict. They are calibrated on past data (features), they learn from it to predict the future (labels).
So we build and train our model by performing supervised machine learning with the help of Azure Machine Learning Workspace, which is a fully-managed cloud service that enables you to easily build, deploy, and share your machine learning solutions.
Prepare the environment
Create a new Azure Machine Learning Workspace and when the deployment is done, go to the resource and click on the Launch now button at the Overview page. You should see a panel with the Notebooks, AutoML and Designer options.
Notebooks – overview of the dataset
Let’s get started with first understanding the data that we are going to use while building the predictive model. The aim is to see which values are useful features for us, and which are the ones we could leave out, because just give noise to the observation. Click on the Start now button under the Notebooks, and create a new folder for this project.
Create a new file with Notebook type inside the newly created folder, the button for that can be found next to the New folder button. Make sure you specify, into which folder you want to put it. You can do that just by scrolling down in this window.
The Central Bureau of Statistics collected the largest dataset by a survey, which contains valuable information about numerous properties (area, age, demographic statistics and more) of buildings in Nepal. Download the data from my GitHub repo. Upload the “train_values.csv” and the “train_labels.csv” files into the same folder where you placed your Notebook. To run the code here, you need to create a new compute, so click on the plus button on the right side of the panel.
A new window opens up, where you have to give your compute a name, choose a virtual machine (VM) type, and its size. For this tutorial, it is enough to use CPU, for deep machine learning training you should rather go with GPU. Click Create.
Make sure, your compute is running, you can start and stop it from the top panel of your notebook too. It is a good practice to stop the compute if you don’t use it, because it might get expensive otherwise. To enable writing code within this online environment, click on the Edit button on the left side of the top panel, and click on the Edit Inline (preview) option.
We can use Python’s Pandas package. Pandas stands for “Python Data Analysis Library”, it has a lot of great functions to use on objects called dataframes. Dataframes are interactive tables, which you can easily manipulate and transform.
import pandas as pd dfValues = pd.read_csv("train_values.csv") dfLabels = pd.read_csv("train_labels.csv") df = dfValues.set_index('building_id').join(dfLabels.set_index('building_id')) display(df)
If you display the dataframe, you can see many columns describing each buildings, number of floors, how old is a building and so on. These columns will be used as features for training the predictive model. The label column is the damage_grade, which indicates how high is the risk of damage for the buildings in case of another earthquake (1 – low, 2 – medium, 3 – high). So our model is going to predict the risk by learning from “experience” data. We can further investigate the label column, by running the following code.
You can see that there are 3 different values in the damage_grade column, and also the number of features that are available for each of the labels. The best case would be if these numbers were quite close to each other, though it is visible that the number 2 has much more examples than the label 1 and 3. This is a problem, because if the model is trained on more examples with label 2, it more often returns the label 2 even when the risk is supposed to be high or low. We shouldn’t train the model on unbalanced data, so we are going to deal with this issue when we start the transformation and preparation of the data before training.
The field of statistics is often misunderstood, but it plays an essential role in our everyday lives. Statistics, done correctly, allows us to extract knowledge from the complex, and difficult real world. When we have a set of observations, it is useful to summarize features of our data into a single statement called a descriptive statistics. Instead of scrolling through and try to understand all your data just like by looking at it, we can use some nice python functions.
What we can see here in the first row that how many not-null values are in the column (as Python’s describe function counts only the not empty fields), and by this information we can be sure that each of the rows in each of the observed columns hold not-null values. This is important, otherwise it gives a lot of noise into any calculations and predictions.
Another important observation is at the columns starting with has_, that it’s minimum is 0, and the maximum is 1. I could assume, that these are probably true or false, but I cannot be sure about it right now. We can make a check on for example this column with the use of the unique function.
This shows the distinct values of the chosen column, so now we are sure, that these columns hold true and false values.
Another observation you can make here is the min and the max values for each column. For example, the buildings have minimum 1 and maximum 9 floors, or that the age of the building is at least 0 and can go up to 995. We can also use a visualization tool – just in case, to have a graphical overview of the data.
In the case of the age we can see that this is the mean here, and most of the buildings are between 0 and 200 years, which are very close to the mean, and there are some outlier values, with around 995 years – which are probably memorial buildings.
Also, keep in mind that we have 8 rows and 31 columns as a result of the describe() function. This function only observes the numerical columns, the rest is ignored. We can look at frequency distribution of the non-numerical columns to understand whether they make sense or not.
This gives me an idea that I have 3 different land surface conditions, and most of the time it is t. This doesn’t tell me much, I could assume that this is a categorical column, but assigning a numerical category will not be in scope for this session, especially, because we cannot be sure that it will always have only this many different values.
Let’s prepare the data for training!
Designer – data preparation and training
Let’s upload the same datasets that we used in the previous section. Go to Datasets on the left side panel of the Workspace, and click Create dataset, choose the local files option.
A new window opens up, where must give first a name to your dataset (I named it train_values_nepal and train_labels_nepal), you can keep the rest of the settings as default. When you click on Next, you also have to specify a datastore. Choose to use the default storage, then you can choose the file from your local computer, click on Browse.
When you have the file, click on Next. Your data is being validated, then you should see the Settings and preview page. You can leave everything on default settings, except the Column headers. This option should be set to Use headers from the first file.
Scroll down, to review your dataset, and click Next. On the Schema step leave everything as default and just click Next, then confirm that the details are correct, and finish the process by clicking on Create.
On the left side panel of the Workspace, find the Designer option, and then click on the plus sign under New pipeline. When the new Experiment is ready, you should see a window like the following.
For running the experiment, you need a compute target. On the right side panel, click on the Select compute target, and then Create new. You can use the predefined one, you only need to give it a name. Otherwise, follow the link given there in the window.
Now we are going to pull in the data we just added previously. On the left side panel, there are three separated menus: Datasets, Modules, Models. Click on the “Datasets”, and choose the data you uploaded, then simply drag and drop on the experiment. Let’s do the same join action on these like we did in the Notebooks. On the left side panel of the Designer window, choose the “Modules” option, and in the search bar start writing: join.
Put the Join data module on the working panel, and connect the two datasets into it. A right side panel shows up, where you have to set the columns to join by. Select building_id for each, by using the Edit column link.
Save your changes, and then click the Submit button on the right side of the Designer window. Set a new experiment name, and you can also give it a description – optionally. Submit it, and let it run. Now your experiment should look like on the picture, while the experiment is running.
If you want, you can visualize the dataset by clicking on the Join data module with the right button of your mouse. For example, if you choose the damage_grade column while looking at the visualization of your dataset, you can see some statistics of it. And just like we have figured out already in the Notebooks, this column has 3 different values: 1 (low risk of damage), 2 (medium risk of damage), 3 (high risk of damage).
Now we have a good understanding about the dataset and it is now clear that we have to deal with the following problems before training:
- unbalanced values
- non-numerical columns: strings and categorical
- outlier values: normalization
During the statistics part of the session we observed all the columns in the dataset, so we have a good idea of which features are useful and which can we exclude. To exclude noisy values (such as string columns), we can use the Select columns in dataset module. Just search for it on the top of the left side of the screen. Pull the module in and connect it to the output of the Join Data module.
Click on the Select columns in Dataset module, and find the Edit column button on the right side panel, and in the opening window, paste the following column names:
After submitting, and finished running your experiment, let’s move on to the next transformation step. The has_ columns are verified to be true/false columns, so we could change those to categorical columns with the use of the Edit Metadata module. Choose the Edit column option, and add the following columns to include in the transformation:
Leave all settings as default, but set the Categorical option for your chosen columns. Your experiment should look like this now:
Another important step of data preparation is something called Normalization, to exclude outlier values like we have seen for example at the age column. The goal of normalization also to change the values of numeric columns in the dataset to use a common scale, without distorting differences or losing information. This is useful since most of the column has different metrics with different scale, now all the columns will be on the same range of values.
So, I will normalize all the columns except the damage grade and the categorical values with a simple mathematical function called ZScore. All these can be set on the right side panel with Edit column option of the Normalize Data module.
ZScore uses the mean and the standard deviation for calculating each values. It takes the value and decreases it with the mean of it, and then divides it with the standard deviation of it.
The challenge of working with imbalanced datasets in most machine learning techniques is ignored, and in turn have a poor performance. Although if you think about it, typically the performance on the minority class is also very important. In our case the high risk of damage is a minority class for example. Take a look at the damage_grade column by visualizing the output of the Normalizate Data module after running the experiment. Note that the high and low risk of damage is visibly less than the medium risk.
One approach to address imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, but that wouldn’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is the approach we are going to use: Synthetic Minority Oversampling Technique (SMOTE).
The idea is that we compare the class 1 and class 3 to the class 2 damage_grade values, in two individual steps. To do that effectively, we can use the Split Data module, pull this module in twice, and connect both to the output of Normalize Data module. Click on the left side Split Data module and on the right side panel, set the splitting mode to Relative Expression, and the expression should be: ”damage_grade”<3
In this way, on the left side we will be able to compare the amount of features for label 1 and label 2. We want to achieve the same on the other side, but compare the label 2 with 3. So on the right side we choose Relative Expression as well, and the expression should be: ”damage_grade”>1
Look for the SMOTE module from the left side panel and connect one to the Split Data module on each side. Set the label column with the use of the Edit column, and choose the damage_grade column. On the left side, set the SMOTE percentage to 400, which will oversample the label 1 features.
We choose the damage_grade as label column on the right side similarly, but set the SMOTE percentage to 40 now. These percentages can be improved further, you are welcome to play further with them. If you visualize the output of each SMOTE module, you can see that the number of features are now almost the same for both labels.
Now we have label 2 features on both sides, so on the right side let’s add another Split Data module with Relative Expression mode and the expression should be: ”damage_grade” > 2
By using the Add Rows module, you can collect all the transformed data. Submit and visualize your data to review your changes. Your experiment should now look something like this:
Now finally, with the help of the Split data module, we will have data for training and for validating the model. Set the fraction of splitting to 0.7, in which case 70% will be train data and 30% will be used to validate data.
Now, we are ready to train our model. For this we need to choose the best fitting algorithm. You do not need to fully understand all the mathematical functions behind these algorithms, as Microsoft provides a cheat sheet for us that helps a lot to make a choice.
There are several specific types of supervised learning that are represented within the Designer, for example: classification, regression, or anomaly detection. When a value is being predicted, for example car prices next year, the supervised learning is called regression. The approach that anomaly detection takes is to simply learn what normal activity looks like and identify anything that is significantly different
When the data is being used to predict a category, supervised learning is also called classification. This is the case we need, as we want to predict the damage grade which – as you already know – has 3 different values, or categories. When there are only two choices, it’s called two-class or binomial classification. When there are more categories, as we have now, this problem is known as multi-class classification.
So, we are getting closer to the chosen algorithm, now we can decide from only a few different algorithms.
When you are not entirely sure which algorithm works better for your specific dataset, you can investigate that by pulling in two – or more – algorithms in the same time. After testing the different algorithms on my dataset, I found out that the MultiClass Boosted Decision Tree works best for me.
A decision tree (as a predictive model) is used to go from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves).
A boosted decision tree is a method in which the second tree corrects the errors of the first tree, the third tree corrects the errors of the first and second trees, and so forth. Predictions are based on the entire collection of trees together that makes the prediction.
Generally, when properly configured, boosted decision trees are the easiest methods with which to get top performance on a wide variety of machine learning tasks. However, they are also very memory-intensive learners.
Instead of trying some random variable for your algorithms, use the Tune Model Hyperparameters module, which will be able to return the best possible configuration suggestions for the dataset available.
This module also trains your model, and when the run is finished, the sweep result can be visualized (right click on the module), which shows you different trials made during the training.
We need to use the Score Model module to perform the prediction. We connect the data for validation (which is the second result dataset of the Split data module), and the trained model.
You can see the predicted values, and the possibility for each damage rate categories. You can also see the original damage rate that is coming from the label dataset.
You are welcome to compare the original damage rate and the predicted one, but you can also just take a look at the evaluation result. Connect the Evaluate Model module and Submit your experiment, and let it run. Your experiment should look like this:
Visualize the evaluation result and review how confident is your model about its decision.
When you are satisfied with the results of your model, it is time to use it for something good. You can hand over a report as a CSV file (USE MODEL section), or you can deploy your model that can be used by your customers. This model is now able to decide whether a building needs strengthening before the next earthquake.
By this not only memorial buildings or homes could be saved but also thousands of lives. Get started with machine learning today, and save more lives with AI!
Follow me 🙂