SafeSteps: Deep dive · Tayyab Ahmed

Overview

SafeSteps came from trying to deal with messy, user-generated data in a way that could actually be useful. The goal was to classify hazard reports, not in a clean benchmark setting, but in a way that could realistically support decision-making.

It sits somewhere between a simple text classifier and something that could actually matter in real-world scenarios.

What made it difficult

The data was all over the place. Short text, typos, inconsistent phrasing, and heavy class imbalance. The rare cases were often the most important, but also the hardest to model.

Another issue was that categories weren’t fixed. As I worked more with the data, I kept running into edge cases that didn’t fit cleanly into existing labels.

Approach

I built a simple but flexible pipeline using MongoDB for storage and lightweight NLP models for classification. Instead of jumping into complex architectures, I focused on iteration speed and understanding errors.

The goal was to stay close to the data and make improvements based on actual failure cases, not just metrics.

Iteration and debugging

A lot of progress came from grouping errors. I looked at which phrases consistently confused the model and tried to understand whether it was a data issue or a modeling issue.

Some augmentations didn’t work as expected because they didn’t match how users actually write. I had to adjust both the data and the labeling strategy as I went.

Evaluation

I paid more attention to per-class performance, especially recall for rare hazards. Missing those matters more than getting common cases right.

I also reviewed false positives manually, since too many of those could make the system less useful in practice.

Takeaways

This project showed me how important the data and labeling process is. In a lot of cases, improving the dataset gave better results than changing the model.

It also reinforced that real-world NLP is messy, and you have to design systems that can handle that.