Why?

Kaggle is the biggest Data Science community with over 2 million users. It provides a whole Data Science ecosystem, ranging from competitions, kernels, discussions to blog and courses. Whatever you need that is connected with Data Science or Machine Learning, you can probably find some clue about it on Kaggle. What is especially important, is the fact that it gathers the best Data Scientists and Machine Learning practitioners from all over the world. And this is exactly the reason why you should care about Kaggle - it gives you the unique possibility of learning from the best. Even though competitions differ from DS/ML projects in real life, because the problem is already defined and data already gathered and structured, there are many skills which transfer from competitions into projects perfectly. Whether is it about efficiency of experimentation, creating a proper validation scheme for your model or simply extending the set of known data processing or modeling methods, you can learn it on Kaggle. Some people also like to compete. For them, leader-board position is very important and they will be willing to spend a lot of time and effort to improve it. And, to be honest, this is one of the best motivation factors!

How?

  • If you have just started with Data Science or Machine Learning, it may be a good idea to take a look at courses - Kaggle Learn. There, you can start with python basics and end at Deep Learning.
  • If you have experience with python/R and basic data analysis/machine learning skills, the way that I recommend is diving straight into a competition. Pick one that you find interesting and just start applying your skills.

Competing

Participating in a competition requires a big amount of time and effort, especially if you want to either place high on LB or learn a lot of new things. But this effort is always worth it (IMO :) ! Some people like to start early in the competition to have time for experimentation and exploration. Some, on the other hand, enter later, make use of what has been already discussed and discovered and work intensively towards the end.

This brings us to the topic of using already shared information. It is very important to make use of what has been posted both in kernels and in discussion sections. What is more, solutions from previous similar competitions are an invaluable resource!

If you don’t like working alone, try finding a team. It is always much nicer to work on a problem together and you also learn more!

One more thing to remember, don’t be discouraged by the fact that everything seems difficult when you begin. After some time spent on a particular competition and with particular set of methods, you’ll get used to them. During the competition itself, sometimes it’s very hard to improve your score, this is equally true for beginners and experienced competitors. There are some where improvement seems to be coming easily, with each new idea that you try and there are those, where no improvement can be seen even after a few days of trials. Persistence is the key!

The same applies to hidden bugs in your pipeline. You think you have written a perfect pipeline, where everything is supposed to work smoothly, you process the data, train a model, prepare a submission and then you get significantly lower score than you should. This happens, many times. Sometimes you can spend even a few days trying to debug a part of your code because it does not work in an intended way (happened to me last week) and sometimes even this may not help for a certain issue. But one thing is guaranteed - you will learn a lot during the process.

And this is the best thing about Kaggle, you really learn a lot, with every new competition, you learn something else ;).

A few resources:

Tags:

Updated:

Leave a Comment