I find it really important to talk a bit about the difference between tutorials and real life. I focus mostly on data science tutorials in this post, but I’m sure that similar issues can be spotted in other fields as well.
My motivation are all those conversations I had with people who just got into the field of Data Science, or just read two articles about it. I work actively in this field for around 5 years, which was called data mining 10 years ago. A big bang happened in the world of data nowadays, every person would like to get a piece of it. I found out how Data Science and Machine Learning tutorials spread in the past years. People searched for these articles 6-8 times more in the past 5 years, than before. I don’t have actual numbers about how many people study data science today, but that number is way higher than it was 5 years ago as well.
I meet a lot of people at conferences or webinars where I participate at as speaker, or when I consult at a customer, and I often have the feeling that some people are very confused about what is involved in the everyday work of a data scientist. Sometimes I ask these people where they heard all that strange information, and they claim to read it in some tutorials. It turns out, that some tutorials include information that is not necessary a lie or stupid, but it easily can be misleading for someone who just would like to get started in this field.
I collected some information that me, and other data scientists found at different tutorials…
We call a dataset pretty, because everything just works on them. Maybe you have heard about the Iris dataset, which is our forever favorite. This includes all together 250 rows and it provides data about some flowers and their leaves. I’m pretty sure that this is an artificial dataset and not a single botanist has ever looked at it before. For each row it includes 2 parameters: the length and the size of a petal. Then these ones can be categorized by botanists into 3 groups: red, green and blue.
When you look at the dataset or on this image, it is already visible that it won’t be a big problem, to put the elements into categories, it’s very easy to identify the differences. But in these tutorials, people apply machine learning algorithms on the data, which then returns with 96-99% accuracy. In such tutorials I would suggest you to just focus on the code, and ignore the effectivity of this solution, because this would only teach you that you need some data and a cool model then these would work like a dream forever together, already at the first iteration. Believe me, that’s not the case with real data, not even close…
This dataset is really popular, since this is the one that has been given to beginners to learn about data science. You might know this site: Kaggle, it provides datasets and challenges for nerds like me, who find it funny to spend their free time on coding competitions…
This dataset includes attributions about people who boarded Titanic, and we would predict – based on this data -, who would survive the tragedy. This is really.. practical. I mean the story is close to us, has an effect on our feelings which maybe makes us be more motivated to learn and understand the data science process. The reason we don’t like this dataset is that it teaches us the wrong thing.
It tells us to divide the dataset equally to train and test data. It is true, we would use training dataset for teaching the machine and testing dataset for validation in the end of each iterations, but we would not necessary “divide” a dataset for a production version of a model.
Especially, when you want to predict the future, you should look at the past for samples and experience first. In this case, a data scientist would investigate similar catastrophes of the previous years, and would build up a model based on this information. So instead of dividing the dataset into two halves, in this case you should predict the outcome of a similar situation in the future, based on past data (or experience).
Divide dataset randomly
I know someone who would actually forbid dividing a dataset randomly. All or most of the tutorials state – since it’s only one line of code – that when you create training and test dataset from your data, you should divide it randomly. In this case some data would be used for training a machine learning model, and the rest would be used to validate it.
The tutorials say for example, we have red and green balls in a hat, and we want to predict which color will we pick, for which we divide this dataset randomly and generate dataset for training and testing. The machine is learning with the training dataset, and then on the test dataset we let the machine decide whether the ball is going to be red or green. A machine is learning from experience, which is why it is important to provide a balanced data, and in tutorials we see that it is, and the prediction is returning a good answer with high accuracy.
But in real life, if you divide red and green balls randomly, there is a very high probability that the training dataset will have for example more red balls than green ones, and a machine that is trained on this data will most of the time return that the ball is red, even though it is actually green.
In tutorials you often see a diagram like this suggesting that each elements of the cycle have the same length, and that it is easy to define this time. I heard a story from a data scientist consulted who met the project owner of a company and they had a 3 hours long session where the consultant showed how a specific problem could be solved. Then the project owner asked how long this work should take and the data scientist answered: “Like 2-3 months.”
Then the project owner was going like “but how?? You just did it now in 3 hours…” That’s because for a meeting, a lecture or a session, or even for a tutorial you prepare the data so it’s ready to use, you implement the model, get the resources ready and make the whole thing squeezed into this 3 hours. In such cases you act like a producer making a new movie, you organize the steps in a way that brings you somewhere before the session’s time is up.
In real life this is not how it works. The dataset is never available without spending long time on fixing issues, do some featurization, so you know, prepare it for training. And you make the machine learning model, that returns a result, but then you might go back and do some more fixes on the data, the parameters of the algorithms and so on. So you see, it takes a lot of time in real life, while you are working with a tutorial it seems that you can easily get to a good result in just a few hours.
If you get a task like this from your boss, ask for a few days to make some investigation, so that you can come back with a legit estimation about the project length.
Business to ML
Many tutorials use a business scenario that seems like the simplest thing to turn it to a machine learning problem. My favorite one was when there was this bank – in the scenario it was a pretty huge one with a lot of customers and transactions -, and the data scientist consultant had to provide a solution on whether the customers would stay or leave. The data provided sort of a balanced information, so it offered similar amount of user data who is still a customer of the bank, and users who left the bank. From this in the tutorial, the data scientist could build a simple classification model based on the dataset, and that returned whether the user would stay or leave the bank, with a very good accuracy.
Now in real life banks don’t have such a balanced dataset. Most of the time, 90% of the users are still customers of the bank, and only 10% have left the bank. This data is so skewed, without proper fixes and featurization, this model would most of the time return that the user is going to be a customer in the future too, because it doesn’t have enough examples of people who left.
Tutorials often state that you should not include outlier values in your dataset when you want to train a model. Let me tell you a secret: You should not – in any circumstances – do that in real life!! Ever!!!
Let’s say you would have to make a prediction based on a dataset about buildings and their properties, whether they would collapse in an earthquake or not. Most of the buildings are between 5-10 years old, but there are a few that is more than 50 years old. If we would follow the tutorial to solve this problem, we would only keep buildings that are under 10 years, because the older ones count as outlier values. We train our model on fairly new buildings, and we know that there is a quite small risk of damage for these ones. We also validate this dataset – without outlier values -, and it returns with 90% accuracy, defining that the buildings in the test dataset have low risk of damage.
But then you get a request from your customer to see the prediction on new data which includes historical buildings as well – and if you think about them with using your own intelligence, not an artificial one – these ones should have a high risk of damage, since these are old buildings, went through a lot of corrosions. Now your artificial intelligence model though trained on quite strong buildings – new ones, using stronger materials. Your model would return low risk of damage for all the buildings. Then an earthquake comes, and no one would understand how that statue collapsed, no one excepted it.
Never throw out outlier values, do some normalization, or other different algorithms on your dataset, and include every possible scenarios!
Models work with numbers better
Many tutorials focusing a lot on making sure that the dataset given to the machine learning model to train on should only include numerical data. For this for example, the writers turn string categories to numbers from 1 to n, depending on how many categories there are. But my favorite is when they turn some information to numbers, when that actually cannot be measured by numbers.
For example, let’s go back to the bank scenario, when we wanted to define how reliable are their customers. How would you define reliability with numbers? Is there maybe a range of reliability, can it be defined between 1 and 10? Or two: yes or no?
Some parameters cannot be turned into numerical data in real life, because it might occur more problems than solutions, especially if you don’t define the range well enough.
One model to rule them all
Many tutorials tell you that you just do these steps once, or you just have to implement your machine learning model well, and that could be used for other similar problems. I suggest you to don’t believe this too much. A model is usually trained on a dataset which works together very well. As soon as the data changes, or get’s more rows since the last training, the model might need some fixes too. You can try to use your previously built models as schema, but keep that in mind, it won’t return good result for all kind of different scenarios.
You must have enough data to build a good model
Tutorials usually suggest that if you have enough data, your model will work better. Again, every time your dataset gets updated, you might need to retrain your model. Also, there is no such thing as enough data. What is enough? 1 million rows? 10 million? In real life your dataset will never have a dataset including all possible situations, and you often meet new issues throughout the evaluation step of the iteration. I’d rather say that you should always know your data as much as possible, which allows you to build a better fitting model for your dataset.
You need only 2-3 iterations
The last point I see very often in tutorials – even in my own ones -, that it shows you that your project can be finished in only 2 or 3 iterations. But remember, in a tutorial you want to show some progress, the big changes in the result after an iteration. Spoiler alert: in real life you would spend a lot more iterations for a problem. The reason behind the numerous cycles is that you prepare the data in a way that it can be used for a model, you build the model, define the parameters required, and train, score and evaluate. After each evaluation you go back to the beginning and do some improvement and see how the results look after that change. And you never want to do two different fixes in one cycle, otherwise how would you know what has improved or messed up your evaluation results? The safest way is to make your fixes step by step.
Also, there is nothing as perfectly working model. You spend long months maybe on doing these few minutes or hours long cycles, then after delivery you can keep improving it.
I hope I could give you some ideas about how the real life of a data scientist looks like compared to what you see in tutorials. After all this I still highly recommend to get started with all these tutorials, but remember to focus on the relevant information, and don’t let yourself led to the wrong way.
Follow me 🙂