Competition: CAFA 6 Protein Function Prediction (Kaggle)
Competition link: https://www.kaggle.com/competitions/cafa-6-protein-function-prediction/data
Role: Data Analyst / Machine Learning Engineer
Tools: Python, Scikit-Learn, Biopython, Pandas, Google Colab, PyTorch, Hugging Face (ESM-2)
Project Title: Bio-Analytics & NLP: Predicting Protein Function using Machine Learning | Protein Function Prediction using Meta's ESM-2 Transformer
Github: https://github.com/nihalunfc/cafa-protein-deep-learning/
Introduction: From the Cosmos to the Cell
Coming from an academic background in Physics and Cosmology, I am used to mapping high-dimensional spaces. But as I enter the final term of my Master of Data Analytics program, I wanted to apply these mathematical modeling skills to a different kind of universe: the human cell.
For my latest project, I tackled the CAFA 6 Protein Function Prediction challenge on Kaggle. The goal was to predict the biological functions of over 142,000 proteins.
This is a massive multi-label classification problem. However, the real challenge was not the biology—it was the Big Data bottleneck.
The Challenge: The Memory Crash
Google Colab offers a standard limit of 12.7 GB of System RAM.
When dealing with a 60+ million row biological graph dataset, traditional methods fail. At 7:51 PM during my first attempt, the system hit the RAM ceiling and crashed completely while trying to merge a 1.5 GB predictions array.
Here is how I engineered a 4-Phase memory-safe pipeline to bypass cloud limitations, combine Deep Learning with NLP, and improve my baseline score by over 220%.
Phase 1: The NLP Baseline ("Proteins as Language")
Score: 0.050 Instead of traditional alignment tools, I treated protein amino acid sequences as text. I used a TF-IDF Vectorizer with 3-mer (trigram) tokenization to break sequences into "words." I then trained a One-vs-Rest Logistic Regression model. It proved that NLP works on biology, but the score was too low.
Phase 2: Biological Graph Propagation (Fixing the Logic)
Score: 0.134 Machine learning models are dumb. My Phase 1 model was predicting that a protein was a "Dog" without realizing it also had to be a "Mammal." I integrated the Gene Ontology DAG (Directed Acyclic Graph) using Python Dictionaries. By propagating predictions to their biological "parents," I instantly boosted the score to 0.134.
Phase 3: The Deep Learning Upgrade (Meta’s ESM-2)
Score: 0.158 Text models cannot "see" 3D chemistry. I upgraded the pipeline using Meta’s ESM-2 Transformer (8-million parameters) via PyTorch and the Nvidia T4 GPU. By embedding the actual 3D structure of the proteins into 320-dimensional mathematical maps, the Deep Learning model crushed the text baseline.
Phase 4: The Ultimate Solution (Low-RAM Streaming)
Final Score: 0.160 The final challenge was combining the 3.5 GB Text model with the 1 GB Deep Learning model. Loading both into RAM caused an immediate system crash.
To solve this, I built a Hard-Drive I/O Streaming Pipeline. Instead of loading the massive TF-IDF file into memory, the script read it line-by-line from the hard drive, averaged the scores with the Deep Learning model dynamically (a 60/40 weighted split), and deleted data from the RAM the second it was used.
RAM usage stayed below 10.5 GB, and the pipeline scaled infinitely.
Final Results & Takeaways
The final Ensemble pipeline generated a 3.7 GB submission file containing millions of biological predictions, achieving a final Kaggle score of 0.160—a massive leap from the 0.050 baseline.
This project proved that you do not need a million-dollar supercomputer to run Big Data pipelines. With strategic I/O engineering, chunking, and model ensembling, massive datasets can be processed on restricted hardware.

Post a Comment