“A cool data scientist gets paid like the CEO of an average company.” Yandex.Taxi machine learning expert on how data predicts the future and shapes the world
Almost a year has passed since the moment when an unusual subject – an innovative workshop – started at FIVT. Its essence is the creation of IT startups by student teams under the guidance of experienced mentors. It turned out well: thanks to the course, someone spent part of the summer in Cream Valley, someone received a grant in the amount of 800,000 rubles for the development of the project, and ABBYY is ready to completely buy out the project from someone. And this is not all the results of the workshop!
At the beginning of 2011, third-year students of FIVT were gathered in the Assembly Hall and told: over the next year you will need to create your own startup. The students did not accept this idea ambiguously: it was not clear how to do it at all, and the responsibility was unusual - after all, it was necessary to create a technological business, and not just another educational project. Here’s what the winner of the MIPT student Olympiad in physics, student of the Yandeska department, Viktor Kantor, thinks about it:
When I chose FIVT upon admission, I hoped that we would have something similar. So I'm glad that my hopes were not in vain. During the year, it was felt that the course was still being formed, much was new in it, many issues were controversial not only for students, but also for the organizers, but in general, I think the trends are positive. I liked this course.
To facilitate the work of students, various curators were invited to offer their ideas for building innovative businesses. Among them were completely different people: from undergraduate and graduate students of MIPT to Ernst&Young innovation advisor Yuri Pavlovich Ammosov (he was the leader of the entire course) and Mikhail Batin, who deals with regenerative medicine and life extension issues. As a result, the physics and technology students chose the ideas that were most interesting to them, curators joined the teams, and the hard but exciting work began.
In almost a year that has passed since then, the guys have faced many problems, some of which have been resolved. Now we can evaluate their results - despite the difficulties, the guys coped. MIPT students (in addition to the Faculty of Philosophy, some students from the Faculty of Philology and other faculties joined the process) managed to prepare several quite interesting and viable projects:
Askeroid (formerly Ask Droid) – search for smartphones ( Anastasia Uryasheva)
An Android application that allows you to conveniently search in a large number of search engines. Some experts showed interest in the development, and as a result, Anastasia spent the entire past summer in one of the most famous incubators in Silicon Valley - Plug&Play. learning the basics of technology entrepreneurship and speaking with international venture experts.
1minute.ru – one minute for good (Lev Grunin)
This project gives anyone the opportunity to do charity work simply, quickly and completely free of charge. The model is simple: advertisers offer a certain set of activities on the site, users voluntarily participate in them, and all money from advertising is transferred to a charitable foundation. Within a week of its launch, the project gathered more than 6,500 users and is not going to stop there. As a result, thanks to Lev and his team, 600 children from orphanages will receive cherished gifts from Santa Claus for the New Year. Have you already spent one minute on a good deed?!
Embedded Desktop - a computer in your phone (Alexey Vukolov)
An application that allows you to combine the capabilities of a computer and the mobility of a phone in one package - an extremely useful product for busy people who often travel on business trips. It is enough to install it on a smartphone, and the user will be able to “get” his own computer in any hotel, office, and indeed anywhere where he can find a monitor (a TV is also suitable), a keyboard and a mouse. The project received a grant for the development of the idea and was presented at the Technovation Cup exhibition, and with the money received, the team is already actively purchasing equipment. The American processor manufacturer MIPS is extremely interested in the development.
Smart Tagger – semantic search through documents (Victor Kantor)
What should you do if you remember that somewhere in your mailbox there was a very important letter that talked about the latest episode of Big Bang Theory, but you don’t remember any key words from the text? Yandex and Google search are powerless. The development of Smart Tagger will come to the rescue - a “smart” program that uses semantic search will give you all the texts whose meaning is intertwined with the popular TV series. The project won a grant at the U.M.N.I.K. competition. totaling 400,000 rubles!
MathOcr – formula recognition (Viktor Prun)
ABBYY proposed an interesting task for implementation - to create a program that would recognize mathematical formulas of any complexity. FIVT students, having cooperated with interested students, completed the task - the module actually recognizes formulas scanned from textbooks on mathematics or physics. Result: ABBYY is ready to purchase this product for a lot of money.
As part of the joint project “ABC of AI” with MIPT, we have already written about the so-called ones, which allow you to “grow” programs according to the principles and laws of Darwinian evolution. However, for now, this approach to artificial intelligence is certainly a “guest from the future.” But how are artificial intelligence systems created today? How are they trained? Victor Kantor, senior lecturer at the Department of Algorithms and Programming Technologies at MIPT, and head of the user behavior analysis group at Yandex Data Factory, helped us figure this out.
According to a recent report from research firm Gartner, which regularly updates its “technology maturity cycle,” machine learning is currently at the peak of expectations in all IT. This is not surprising: over the past few years, machine learning has moved out of the sphere of interests of a narrow circle of mathematicians and specialists in the theory of algorithms and has penetrated first into the vocabulary of IT businessmen, and then into the world of ordinary people. Now, anyone who has used the Prisma app, searched for songs using Shazam, or seen images passed through DeepDream knows that there is such a thing as neural networks with their special “magic.”
However, it is one thing to use technology, and another to understand how it works. General words like “a computer can learn if you give it a hint” or “a neural network consists of digital neurons and is structured like the human brain” may help someone, but more often they only confuse the situation. Those who are going to seriously study mathematics do not need popular texts: there are textbooks and excellent online courses for them. We'll try to take the middle route: explain how learning actually happens on a very simple task, and then show how the same approach can be applied to solve real interesting problems.
How machines learn
To begin with, to understand exactly how machine learning occurs, let’s define the concepts. As defined by one of the pioneers of this field, Arthur Samuel, machine learning refers to methods that “allow computers to learn without directly programming them.” There are two broad classes of machine learning methods: supervised learning and unsupervised learning. The first is used when, for example, we need to teach a computer to search for photos of cats, the second is when we need the machine, for example, to be able to independently group news into stories, as happens in services like Yandex.News or Google News. That is, in the first case we are dealing with a task that implies the existence of a correct answer (the cat in the photo is either there or not), in the second case there is no single correct answer, but there are different ways to solve the problem. We will focus specifically on the first class of problems as the most interesting.
So we need to teach the computer to make some predictions. Moreover, it is desirable to be as accurate as possible. Predictions can be of two types: either you need to choose between several answer options (whether there is a cat in the picture or not is a choice of one option out of two, the ability to recognize letters in images is a choice of one option out of several dozen, and so on), or make a numerical prediction . For example, predict a person's weight based on his height, age, shoe size, and so on. These two types of problems only look different, but in fact they are solved almost identically. Let's try to understand exactly how.
The first thing we need to make a prediction system is to collect a so-called training sample, that is, data on the weight of people in the population. The second is to decide on a set of signs on the basis of which we can draw conclusions about weight. It is clear that one of the “strongest” such signs will be a person’s height, so as a first approximation it is enough to take only this. If weight depends linearly on height, then our prediction will be very simple: a person’s weight will be equal to his height multiplied by some coefficient, plus some constant value, which is written by the simplest formula y=kx+b. All we have to do to train a machine to predict a person's weight is somehow find the correct values for k and b.
The beauty of machine learning is that even if the relationship we are studying is very complex, essentially nothing will change in our approach. We'll still be dealing with the same regression.
Let’s say that a person’s weight is affected by his height not linearly, but to the third degree (which is generally expected, because weight depends on body volume). To take this dependence into account, we simply introduce another term into our equation, namely the third power of growth with its own coefficient, thereby obtaining y=k 1 x+k 2 x 3 +b. Now, to train the machine, we will need to find not two, but three quantities (k 1, k 2 and b). Let’s say that in our prediction we also want to take into account the size of a person’s shoes, his age, the time he spent watching TV, and the distance from his apartment to the nearest fast food outlet. No problem: we simply add these features as separate terms into the same equation.
The most important thing is to create a universal way to find the required coefficients (k 1, k 2, ... k n). If it exists, it will be almost indifferent to us which features to use for prediction, because the machine itself will learn to give large weight to important ones, and small weight to unimportant features. Fortunately, such a method has already been invented and almost all machine learning successfully works on it: from the simplest linear models to face recognition systems and speech analyzers. This method is called gradient descent. But before explaining how it works, we need to make a small digression and talk about neural networks.
Neural networks
In 2016, neural networks entered the information agenda so tightly that they became almost identified with any machine learning and advanced IT in general. Formally speaking, this is not true: neural networks are not always used in mathematical learning; there are other technologies. But in general, of course, such an association is understandable, because it is systems based on neural networks that now provide the most “magical” results, such as the ability to search for a person by photo, the emergence of applications that transfer the style of one image to another, or systems for generating texts in the manner of speech of a certain person.
The way neural networks are structured, we already... Here I just want to emphasize that the strength of neural networks compared to other machine learning systems lies in their multi-layered nature, but this does not make them something fundamentally different in the way they work. Multi-layering really allows you to find very abstract general features and dependencies in complex sets of features, like pixels in a picture. But it is important to understand that from the point of view of learning principles, a neural network is not radically different from a set of conventional linear regression formulas, so the same gradient descent method works great here too.
The “power” of a neural network lies in the presence of an intermediate layer of neurons, which summarily combine the values of the input layer. Because of this, neural networks can find very abstract features in the data that are difficult to reduce to simple formulas like a linear or quadratic relationship.
Let's explain with an example. We settled on a prediction in which a person's weight depends on his height and cubed height, which is expressed by the formula y=k 1 x+k 2 x 3 +b. With some stretch, but in fact even such a formula can be called a neural network. In it, as in a regular neural network, there is a first layer of “neurons”, which is also a layer of features: these are x and x 3 (well, the “unit neuron” that we keep in mind and for which the coefficient b is responsible). The upper, or resulting, layer is represented by one “neuron” y, that is, the predicted weight of the person. And between the first and last layers of “neurons” there are connections, the strength or weight of which is determined by the coefficients k 1, k 2 and b. Training this “neural network” simply means finding these same coefficients.
The only difference from “real” neural networks here is that we do not have a single intermediate (or hidden) layer of neurons, whose task is to combine input features. The introduction of such layers allows you not to invent possible dependencies between existing features “out of your head”, but to rely on their already existing combinations in the neural network. For example, age and average time in front of TV can have a synergistic effect on a person’s weight, but, having a neural network, we are not required to know this in advance and enter their product into the formula. In a neural network there will definitely be a neuron that combines the influence of any two features, and if this influence is really noticeable in the sample, then after training this neuron will automatically receive a large weight.
Gradient Descent
So, we have a training set of examples with known data, that is, a table with an accurately measured person’s weight, and some hypothesis of the relationship, in this case linear regression y=kx+b. Our task is to find the correct values of k and b, not manually, but automatically. And preferably, a universal method that does not depend on the number of parameters included in the formula.
In general, this is not difficult to do. The main idea is to create a function that will measure the current total error level and “tweak” the coefficients so that the total error level gradually falls. How can I make the error level drop? We need to tweak our parameters in the right direction.
Imagine our two parameters that we are looking for, the same k and b, as two directions on a plane, like the north-south and west-east axes. Each point on such a plane will correspond to a certain value of the coefficients, a certain specific relationship between height and weight. And for each such point on the plane, we can calculate the total error level that this prediction gives for each of the examples in our sample.
It turns out something like a specific height on the plane, and the entire surrounding space begins to resemble a mountain landscape. Mountains are points where the error rate is very high, valleys are places where there are fewer errors. It is clear that training our system means finding the lowest point on the ground, the point where the error rate is minimal.
How can you find this point? The most correct way is to move all the time down from the point where we initially found ourselves. So sooner or later we will come to a local minimum - a point below which there is nothing in the immediate vicinity. Moreover, it is advisable to take steps of different sizes: when the slope is steep, you can take wider steps; when the slope is small, it is better to sneak up to the local minimum “on tiptoe,” otherwise you may overshoot.
This is exactly how the gradient descent method works: we change the weights of the features in the direction of the largest drop in the error function. We change them iteratively, that is, with a certain step, the value of which is proportional to the steepness of the slope. What’s interesting is that when the number of features increases (adding a cube of a person’s height, his age, shoe size, and so on), essentially nothing changes, it’s just that our landscape becomes not two-dimensional, but multidimensional.
The error function can be defined as the sum of the squares of all the deviations that the current formula allows for people whose weight we already know exactly. Let's take some random variables k and b, for example 0 and 50. Then the system will predict to us that the weight of each person in the sample is always equal to 50 kilograms y=0×x+50 On the graph, such a dependence will look like a straight line parallel to the horizontal. Clearly, this is not a very good prediction. Now let's take the deviation in weight from this predicted value, square it (so that negative values are also taken into account) and sum it up - this will be the error at this point. If you are familiar with the beginnings of the analysis, then you can even clarify that the direction of the largest drop is given by the partial derivative of the error function with respect to k and b, and the step is a value that is chosen for practical reasons: small steps take a lot of time to calculate, and large ones can lead to the fact that we will slip past the minimum.
Okay, what if we don’t just have a complex regression with many features, but a real neural network? How do we apply gradient descent in this case? It turns out that gradient descent works in exactly the same way with a neural network, only training occurs 1) step by step, from layer to layer and 2) gradually, from one example in the sample to another. The method used here is called backpropagation, and was independently described in 1974 by Soviet mathematician Alexander Galushkin and Harvard University mathematician Paul John Webros.
Although for a strict presentation of the algorithm it will be necessary to write down partial derivatives (as, for example), at an intuitive level everything happens quite simply: for each of the examples in the sample, we have a certain prediction at the output of the neural network. Having the correct answer, we can subtract the correct answer from the prediction and thus obtain an error (more precisely, a set of errors for each neuron of the output layer). Now we need to transfer this error to the previous layer of neurons, and the greater the contribution this particular neuron of this layer made to the error, the more we need to reduce its weight (in fact, we are again talking about taking the partial derivative, about moving along the maximum steepness of our imaginary landscape) . When we have done this, the same procedure must be repeated for the next layer, moving in the opposite direction, that is, from the output of the neural network to the input.
By going through the neural network in this way with each example of the training sample and “twisting” the weights of the neurons in the desired direction, we should eventually get a trained neural network. The backpropagation method is a simple modification of the gradient descent method for multilayer neural networks and therefore should work for neural networks of any complexity. We say “should” here because in fact there are cases when gradient descent fails and does not allow you to do a good regression or train a neural network. It can be useful to know why such difficulties arise.
The Difficulties of Gradient Descent
Wrong choice of absolute minimum. The gradient descent method helps to search for a local extremum. But we cannot always use it to achieve the absolute global minimum or maximum of the function. This happens because when moving along an antigradient, we stop at the moment when we reach the first local minimum we encounter, and the algorithm stops working.
Imagine that you are standing on the top of a mountain. If you want to descend to the lowest surface in the area, the gradient descent method will not always help you, because the first low point on your way will not necessarily be the lowest point. And if in life you are able to see that if you go up a little and you can then go even lower, then the algorithm in such a situation will simply stop. Often this situation can be avoided if you choose the right step.
Incorrect step selection. The gradient descent method is an iterative method. That is, we ourselves need to choose the step size - the speed at which we descend. By choosing a step that is too large, we may fly past the extremum we need and not find the minimum. This can happen if you find yourself facing a very sharp descent. And choosing a step that is too small risks making the algorithm extremely slow if we find ourselves on a relatively flat surface. If we again imagine that we are at the top of a steep mountain, then a situation may arise where, due to a very steep descent near the minimum, we simply fly over it.
Network paralysis. Sometimes it happens that the gradient descent method fails to find a minimum at all. This can happen if there are flat areas on both sides of the minimum - the algorithm, when it hits a flat area, reduces the step and eventually stops. If you are standing at the top of a mountain and decide to move towards your home in the lowlands, the journey may be too long if you accidentally wander onto a very flat area. Or, if there are almost vertical “slopes” along the edges of the flat areas, the algorithm, having chosen a very large step, will jump from one slope to another, practically not moving towards the minimum.
All these complex issues must be taken into account when designing a machine learning system. For example, it is always useful to track exactly how the error function changes over time - does it fall with each new cycle or mark time, how the nature of this fall changes depending on the change in the step size. To avoid falling into a bad local minimum, it can be useful to start from different randomly selected points of the landscape - then the likelihood of getting stuck is much lower. There are many more big and small secrets of using gradient descent, and there are also more exotic ways of learning that are faintly similar to gradient descent. This, however, is a topic for another conversation and a separate article within the framework of the ABC of AI project.
Prepared by Alexander Ershov
- Can you use a completely primitive example to tell us how machine learning works?
Can. There's an example of a machine learning technique called Decision Tree, which is one of the oldest things. Let's do it now. Let's say an abstract person asks you out on a date. What is important to you?
- First of all, whether I know him or not...
(Victor writes this on the board.)
...If I don’t know, then I need to answer the question of whether he’s attractive or not.
And if you know, then it doesn’t matter? I think I get it, this is the friend zone thread! In general, I’m writing, if you don’t know and it’s unattractive, then the answer is “no, probably.” If you know, the answer is “yes”.
- If I know, that’s also important!
No, this will be a friend zone branch.
Okay, then let's indicate here whether it's interesting or not. Still, when you don’t know a person, the first reaction is to appearance; with an acquaintance, we already look at what he thinks and how.
Let's do it differently. Whether he is ambitious or not. If he is ambitious, it will be difficult to friendzone him, because he will want more. But the unambitious will endure.
(Victor finishes drawing the decisive tree.)
Ready. Now you can predict which guy you're most likely to go on a date with. By the way, some dating services predict such things. By analogy, you can predict how many goods customers will buy, and where people will be at that time of day.
The answers can be not only “yes” and “no”, but also in the form of numbers. If you want a more accurate forecast, you can make several such trees and average them. And with the help of such a simple thing you can actually predict the future.
Now imagine, was it difficult for people to come up with such a scheme two hundred years ago? Not at all! This scheme does not carry any rocket science. As a phenomenon, machine learning has existed for about half a century. Ronald Fisher began making predictions based on data at the beginning of the 20th century. He took irises and distributed them according to the length and width of the sepals and petals, using these parameters he determined the type of plant.
In the industry, machine learning has been actively used in recent decades: powerful and relatively inexpensive machines that are needed to process large amounts of data, for example, for such decision trees, have appeared not so long ago. But it’s still exciting: we draw these things for every task and use them to predict the future.
- Well, definitely no better than any octopus predictor of football matches...
No, what do we care about octopuses? Although we have more variability. Now, with the help of machine learning, you can save time, money and improve the comfort of life. Machine learning beat humans a few years ago when it comes to image classification. For example, a computer can recognize 20 terrier breeds, but an ordinary person cannot.
- And when you analyze users, is each person a set of numbers for you?
Roughly speaking, yes. When we work with data, all objects, including user behavior, are described by a certain set of numbers. And these numbers reflect the characteristics of people’s behavior: how often they take a taxi, what class of taxi they use, what places they usually go to.
We are now actively building look-alike models to use them to identify groups of people with similar behavior. When we introduce a new service or want to promote an old one, we offer it to those who will be interested.
For example, we now have a service - two child seats in a taxi. We can spam everyone with this news, or we can specifically inform about it only to a certain circle of people. Over the course of the year, we have accumulated a number of users who wrote in the comments that they needed two child seats. We found them and people similar to them. Conventionally, these are people over 30 years old who regularly travel and love Mediterranean cuisine. Although, of course, there are many more signs, this is just an example.
- Even such subtleties?
This is a simple matter. Everything is calculated using search queries.
Could this somehow work in an application? For example, do you know that I am a beggar and subscribe to groups like “How to survive on 500 rubles a month” - they only offer me beat-up cheap cars, I subscribe to SpaceX news - and from time to time they sell me a Tesla?
It may work this way, but such things are not approved at Yandex, because it is discrimination. When you personalize a service, it is better to offer not the most acceptable, but the best available and what the person likes. And distribution according to the logic “this one needs a better car, and this one needs a less good one” is evil.
Everyone has perverted desires, and sometimes you need to find not a recipe for a Mediterranean dish, but, for example, pictures about coprophilia. Will personalization still work in this case?
There is always a private mode.
If I don’t want anyone to know about my interests or, let’s say, friends come to me and want to watch some trash, then it’s better to use incognito mode.
You can also decide which company’s service to use, for example Yandex or Google.
- Is there a difference?
It's a difficult question. I don’t know about others, but Yandex is strict with the protection of personal data. Employees are especially monitored.
- That is, if I broke up with a guy, I won’t be able to find out whether he went to this dacha or not?
Even if you work at Yandex. This is, of course, sad, but yes, there is no way to find out. Most employees don't even have access to this data. Everything is encrypted. It's simple: you can't spy on people, this is personal information.
By the way, we had an interesting case on the topic of breaking up with guys. When we made a forecast for point “B” - the destination point in the taxi, we introduced hints. Here, look.
(Victor logs into the Yandex.Taxi application.)
For example, the taxi thinks I'm at home. He suggests that I go either to work or to RUDN University (I give lectures there as part of the machine learning course Data Mining in Action). And at some point, while developing these tips, we realized that we needed to avoid compromising the user. Anyone can see point B. For these reasons, we refused to suggest places based on similarity. Otherwise, you sit in a decent place with decent people, order a taxi, and they write to you: “Look, you haven’t been to this bar yet!”
- What are those blue dots blinking on your map?
These are pickup points. These points show where it is most convenient to call a taxi. After all, you can call to a place where it would be completely inconvenient to go. But in general, you can call anywhere.
- Yes, any time. I somehow flew two blocks with this.
Recently there have been various difficulties with GPS, this led to various funny situations. People, for example, on Tverskaya, were transported by navigation across the Pacific Ocean. As you can see, sometimes there are misses and more than two blocks.
- And if you restart the application and click again, the price changes by several rubles. Why?
If demand exceeds supply, the algorithm automatically generates an increasing coefficient - this helps those who need to leave as quickly as possible to use a taxi, even during periods of high demand. By the way, with the help of machine learning you can predict where there will be greater demand in, for example, an hour. This helps us tell drivers where there will be more orders so that supply matches demand.
- Don’t you think that Yandex.Taxi will soon kill the entire taxi market?
I think not. We are for healthy competition and are not afraid of it.
For example, I myself use different taxi services. Waiting time is important to me, so I look at several apps to see which taxi will arrive faster.
- You teamed up with Uber. For what?
It is not my place to comment. I think uniting is a deeply sensible decision.
In Germany, one guy installed a bathtub on drones and flew off for a burger. Have you thought that the time has come to master the airspace?
I don't know about airspace. We are following news like “Uber has launched taxis on boats,” but I can’t say anything about the air.
- What about self-driving taxis?
There's an interesting point here. We are developing them, but we need to think about how exactly they should be used. It is too early to predict in what form and when they will appear on the streets, but we are doing everything to develop the technology for a fully autonomous car, where a human driver will not be needed at all.
- Are there fears that the drone software will be hacked in order to control the car remotely?
There are risks always and everywhere where there are technologies and gadgets. But along with the development of technology, another direction is also developing - their protection and safety. Everyone who is in one way or another involved in technology development is working on security systems.
- What user data do you collect and how do you protect it?
We collect anonymized usage data, such as where, when and where the trip was made. Everything important is hashed.
- Do you think the number of jobs will decrease because of drones?
I think it will only get bigger. Still, these drones also need to be maintained somehow. This, of course, is a bit of a stressful situation, changing your specialty, but what can you do?
- At each of his lectures, Gref says that a person will change his profession at least three times radically.
I can’t name any specialty that will last forever. A developer does not work in the same language and with the same technologies all his life. Everywhere we need to rebuild. With machine learning, I can clearly feel how guys who are six years younger than me can think much faster than me. At the same time, people at 40 or 45 years old feel this even more strongly.
- Experience no longer plays a role?
Playing. But methods change, you can come to an area where, for example, deep learning has not been used, you work there for some time, then deep learning methods are introduced everywhere, and you don’t understand anything about it. That's all. Your experience can only be useful in planning the work of the team, and even then not always.
- And your profession is data scientist, is it in demand?
The demand for data science specialists is simply off the charts. Obviously, this is a period of crazy hype. Thank God, the blockchain helped this hype subside a little. Blockchain specialists get picked up even faster.
But many companies now think that if they invest money in machine learning, their gardens will immediately bloom. This is wrong. Machine learning should solve specific problems, not just exist.
There are times when a bank wants to make a recommendation system for services for users. We ask: “Do you think this will be economically justified?” They answer: “We don’t care. Do it. Everyone has recommendation systems, we will be in trend.”
The pain is that something really useful for business cannot be done in one day. We need to watch how the system will learn. But it always makes mistakes at the beginning; it may lack some data during training. You correct mistakes, then correct them again, and even redo everything. After this, you need to configure the system so that it works in production, so that it is stable and scalable, this is still time. As a result, one project takes six months, a year or more.
If you look at machine learning methods as a black box, you can easily miss when some crazy stuff starts happening. There is a bearded story. The military asked to develop an algorithm that can be used to analyze whether there is a tank in the picture or not. The researchers made it, tested it, the quality is excellent, everything is great, they gave it to the military. The military comes and says that nothing is working. Scientists are beginning to nervously understand. It turns out that in all the photographs with the tank that the military brought, a tick was placed in the corner with a pen. The algorithm flawlessly learned to find the checkmark; it knew nothing about the tank. Naturally, there were no checkmarks on the new pictures.
I have met children who develop their own dialogue systems. Have you ever thought that you need to collaborate with children?
I have been going to all sorts of events for schoolchildren for quite some time now, giving lectures about machine learning. And, by the way, one of the topics was taught to me by a tenth grader. I was absolutely sure that my story would be good and interesting, I was proud of myself, I started broadcasting, and the girl was like: “Oh, we want to minimize this thing.” I look and think, really, why, and the truth can be minimized, and there’s nothing special to prove here. Several years have already passed, now she listens to our lectures as a student at the Physics and Technology Institute. Yandex, by the way, has Yandex.Lyceum, where schoolchildren can get basic programming knowledge for free.
- Recommend universities and faculties where machine learning is currently taught.
There are MIPT, faculties of FIVT and FUPM. HSE also has a wonderful computer science department, and at Moscow State University there is machine learning at the computer science complex. Well, now you can listen to our course at RUDN University.
As I already said, this profession is in demand. For a very long time, people who received technical education did completely different things. Machine learning is a wonderful example when all the things that people with technical education taught are now directly needed, useful and well paid.
- How good?
Name the amount.
- 500 thousand per month.
You can, just without being an ordinary data scientist. But in some companies, an intern can earn 50 thousand for simple work. There is a very wide range. In general, the salary of a cool data scientist can be compared with the salary of the CEO of some medium-sized company. In many companies, in addition to the salary, the employee has many other benefits, and if it is clear that the person did not come to add a good brand to his resume, but to actually work, then everything will be fine for him.