THE LEARNING CURVE

​A MAGAZINE DEVOTED TO GAINING SKILLS AND KNOWLEDGE

THE LEARNING AGENCY LAB’S LEARNING CURVE COVERS THE FIELD OF EDUCATION AND THE SCIENCE OF LEARNING. READ ABOUT METACOGNITIVE THINKING OR DISCOVER HOW TO LEARN BETTER THROUGH OUR ARTICLES, MANY OF WHICH HAVE INSIGHTS FROM EXPERTS WHO STUDY LEARNING. 

How I Joined a Kaggle Competition as a High Schooler

How I Joined a Kaggle Competition as a High Schooler

There are many remarkable things about the field of computer science, from the rise of social media to the power of machine learning. But one overlooked aspect is the degree to which young people can jump into computer science and compete at the highest level.

To explain, I am 15 years old and a sophomore in high school. In my free time, however, I compete in data science competitions against people of all ages, even computer science “grand masters.” To help others join the field, I wrote up why I competed in these competitions and how others can join.

But let me start at the beginning. As 2017 was coming to a close, Google again found a way to shock the world – well, the chess world at least. The tech behemoth developed AlphaZero, a chess engine reminiscent of DeepBlue, that saw past traditional engines. Rather than coding in evaluation systems and move searches, Google let AlphaZero take the reins, having it play millions of games against itself to understand chess – they had developed a chess engine powered by Artificial Intelligence (AI). What amazed chess fans like myself was the way AlphaZero played chess – it played like a human, with a fiery and captivating style. After admiring the games played by AlphaZero, I was immediately drawn to the intriguing field of AI.

If you do a quick search on chess-playing AI, many articles will pop up that share a common theme – Python. Almost all work related to AI is written in Python, which just so happened to be the only programming language I knew. One of these articles recommended looking through GitHub repositories and trying to contribute to open-source code. Amongst the most popular repositories was LeelaChessZero, an engine similar to AlphaZero, that was taking the world by storm. Although it happened a few years ago, I distinctly remember going through the GitHub repository of Leela and feeling lost. It felt like the code was written in a foreign language – and to some extent, it was! It was immediately evident that whatever little knowledge of programming I had acquired was not enough.

However, Youtube, made learning a programming language easier than ever. I watched countless videos explaining Python in a friendly, and easy-to-understand manner. Although there are many great Python YouTubers, the one that taught me the most was Corey Schafer. While improving one’s coding skills is necessary for learning AI, it was also clear that mathematical principles such as statistics and calculus were just as relevant. After browsing the library shelves, I decided to pick up a book on statistics – Statistics for Dummies (which seemed fitting). One nice part about statistics is that it is easier to understand compared to calculus or linear algebra.

After going through the book, which taught me a great deal, I tried to find Python code that used statistics, so I could get a feel as to how the two are used together. That was when I discovered Kaggle, a platform that almost seemed too good to be true. With discussions on every topic imaginable, notebooks that pushed the limits of our imagination, and competitions catering to novices and masters alike, Kaggle seemed like a utopia. As if this was not good enough, Kaggle also provided users with free GPU and TPU access, which is essential for programmers running more intensive scripts. After looking through countless statistical notebooks on the site, I started looking for competitions. And it seemed that I had started at the perfect time. January 2021 was when Kaggle launched its Playground Series, a monthly competition designed for beginners. This opportunity created the golden ticket to learning AI. I quickly signed up for the inaugural January Playground Competition.

What is a CSV file? What is linear regression? What is a loss function? Are neural networks robots? These were amongst the hundreds of questions that surrounded me as I typed one line of code and promptly gave up. The main problem was that for loops, if-else logic and basic functions were not sufficient. After the competition ended, the top-place finishers published their solutions, with detailed summaries of what they did, and code explaining how they did it.

Thanks to the amazing people who made it possible for beginners like myself to learn, I understood what algorithms they were using, and the type of code they were writing. It was obvious that I needed to learn some of these algorithms to even submit, much less do well. After pouring through articles detailing the inner workings of these algorithms, I figured the best way to improve my coding skills, and understand these methods better, was to implement them from scratch in Python. That was extremely important for me in becoming a better programmer because I learned so much through coding the algorithms, rather than just relying on a library implementation, which masked the true complexity. Both my parents are programmers and were able to offer me great advice the internet could never provide. By the end of January, I had learned so much in programming, and AI, that although I struggled immensely in the competition, I had learned more than I could have imagined.

After continually participating in these monthly competitions, I decided to try deep learning, a more complicated, yet more interesting method, than traditional machine learning methods. I had become quite confident in my skills for tabular data, but I wanted more exposure to the exciting field of Computer Vision. I found a few great books on the subject (Introduction to Statistics and Data Analysis, for example), and before I knew it, I was tackling the MNIST dataset on Kaggle.

One beautiful part of computer vision is that you can see and measure your model’s performance. For example, the MNIST dataset tasks the model with determining the digit based on a handwritten drawing of the digit. To test the legitimacy of my model, I inspected a side-by-side comparison of the image and the prediction. Sure enough, the model was right over 98% of the time, but unlike traditional supervised learning, you can visualize the results. Furthermore, for the 2% of images classified incorrectly by the model, I also couldn’t figure out the intended digit. Sometimes the writing was like a frantic high school procrastinator submitting his essay on the last day – unreadable (good thing I am typing this out). After exploring the basics of Computer Vision, I started looking for such competitions on Kaggle. I soon realized, however, that Computer Vision tasks tended to have a large amount of field data needed to understand the competition and its purpose. AI tasks that could change the world would naturally be field-specific (typically science related). It’s not like digit classification was revolutionary and on par with vaccines or other scientific breakthroughs. I was beginning to get discouraged as I worried that almost all Kaggle competitions would seem impossible, or incomprehensible. Fortunately, as the warm air of the summer set upon me, I saw a different type of competition – something called Natural Language Processing.

After sifting through past NLP competitions, and typical tasks in the NLP sphere, I soon realized this made sense to me. Classifying text into different categories, determining the part-of-speech for words, generating text based on some prompt, and assessing the grammatical correctness of student essays, amongst others, were tasks I could understand the importance and relevance of. The practical applications were clear – for example, text classification is used in so many different real-world situations, with the most obvious one being sentiment analysis.

Now I could understand what the competition was asking for, but I still had no clue how one could go from a text to a prediction. I slowly worked my way into the current state-of-the-art methods, starting with statistical-based models, and ending with the transformer. After reading these books throughout the summer, I was beginning to feel quite confident in my skills, although I was still not great at coding the complex functions and classes needed to train such models. Over the next school year, I worked a lot on becoming a better programmer, and understanding what I was coding. Rather than blindly importing some transformer from a library, I tried to understand the internals of how it worked. While I may not have achieved much during this school year, the knowledge that I gained allowed me to start coding at a higher level.

The following summer, a seemingly perfect competition on Kaggle opened up – FeedBack Prize v2. It was a choice competition for many reasons, including an easy-to-understand prompt, and the fact that you did not need too many resources to compete. While I initially enjoyed the success through the techniques I had learned, it was quite obvious that those at the top knew more. Many of the top finishers were using extremely advanced techniques in their solutions. However, this was beneficial to every other participant because it allowed us to learn new methods, and hopefully try them out in upcoming competitions. While I hit a roadblock towards the end of this competition, I think I learned enough to avoid such roadblocks in the future.

Having just completed the second edition of the Feedback Prize competition, I now look forward to participating in the last leg. The competition tasks participants with evaluating student writing on several metrics, including grammar, conventions, and vocabulary, among others. The max score a student can receive is 5’s across the board, and after passing my essay into a model I trained on the dataset, I am happy to report the model scored it with high 4.5s and 5s.

As for other young people who want to join Kaggle, I would advise that they keep an open mind, and never get discouraged. There are many difficult competitions on the site. However, exposure to such challenging topics will only strengthen your knowledge, and appreciation for AI. Luckily for us, however, there are also many competitions catered specifically towards beginners, and these are the perfect ones to get your feet wet. Additionally, Kaggle offers an amazing community that is not only knowledgeable but also very friendly. It truly is the perfect environment for anyone interested in AI.

As for me, I am still trying to learn more about the ever-growing field of Natural Language Processing. The NLP field will never stop advancing, and I hope to be a speck in its revolution. The most remarkable part of AI and computer science, in general, is that anyone and everyone can play a role if they simply put in the time. In addition to all the online resources available, countless books and free material are written every day, and the field as a whole is growing exponentially. I look forward to seeing you all on Kaggle.

-Ryan Barretto
Ryan is a U.S. high schooler with a passion for data science, especially the field of Natural Language Processing. He placed 78th out of 1,557 teams in the Feedback Prize – Predicting Effective Arguments Kaggle competition, in which the majority of competitors were adults.

Leave a Comment

Your email address will not be published.