Applied research failure modes
Applied research teams slow down not because they lack good ideas but because of simpler issues: unclear metrics, confusing experiment design, poor iteration loops, and vague takeaways. I've been working on applied research1 in some form for my entire career so far. These are notes for early-career quants, ML researchers, and other applied computer scientists trying to improve their research process inside an organization.
Continuously estimate value
Before attempting an idea, estimate both the probability of it working and the upside if it does. Be wary of attempting high risk, low return work, even if it sounds cool. One important tool here is to upper bound the value of the project in some way if possible ("What if I had an oracle for X? What would happen? Is it even worth predicting X?").
Obviously you have to do some cool stuff too, but there's little worse than working on a hard project for a long time, succeeding, and realizing that no one cares.
Do not make every project high risk
There is one thing worse. Research is almost by definition high risk. If you only choose high risk, high return projects, and you continuously get burned (with a 5% hit rate you will fail on average 19 projects before you get a hit), you are likely to burn out. Some people are able to deal with this, but most people cannot.
Keep some low risk, low reward tasks around to keep your momentum up. I like to keep some optimization tasks in my queue for this reason, since it creates visible progress even when we're stuck as a team on the main research metrics.
Aggressively chunk your work
Research is full of ideas that feel interlocked. Figuring out split these ideas into separate chunks that can be parallelized or landed independently is hugely valuable. This is traditionally the area of a project manager, but often researchers need to be their own project manager.
Here's an example: let's say you think that incorporating some new training dataset will improve your model. Ingesting, cleaning, and organizing that dataset is a useful chunk that could be repurposed even if your model doesn't improve on the goal task!
Design convincing experiments
In applied research, you know your audience: it's your peers and internal decision makers. You can ask them questions before you do work. Examples:
- Before running an experiment, check whether other people would draw the same conclusion from the hypothetical results. (if they won't be convinced, design a different experiment).
- Register their hypotheses if possible. Treat all surprises to that prior as a useful takeaway.
Agree on metrics
It's very frustrating to realize that you're working on metric A, but your coworker only cares about metric B. Agree on what you're measuring, and how. It's easiest if everyone uses the metric calculation code. Reality is rarely simple enough for one metric, but running an experiment and cherry-picking whichever metric improved is going to meet a lot of skepticism.
Build a reputation for calibrated and justified claims
It's very frustrating to realize that a core assumption you'd been making is based on a claim someone else made that is just inaccurate. This can seriously derail research projects. In applied research, precision is a form of respect for other people’s time.
Here's an example: if you say feature X and Y are correlated, and you draw that conclusion using some dataset D, it's useful to state that precisely - "I think X and Y are correlated when looking at dataset D". For example D might be data from the year 2024, which doesn't generalize to 2025. Maintain high skepticism of generalization. Narrow claims are not weaker, they are easier to trust.
Negative results can be useful, but false negatives are worse than false positives
Depending on what you work on, you might have a true hit rate on projects of less than, say, 10%. Save the negative outcomes so that you can build intuition, or if possible, communicate this intuition to others.
A false positive usually burns a bounded amount of time: someone tries to productionize the idea, the live or more realistic evaluation often catches it (in a good org), people are annoyed, and you might lose some credibility.
A false negative can be more expensive because it can remove an idea from the team’s search space. Paradoxically, the more influential or trusted you are, the longer it will be before someone attempts it again. This isn't necessarily a disaster, but keep this in mind when you run a "quick test" of an idea. If you don't give it a serious attempt, don't overclaim that the whole set of ideas doesn't work.
A common example is "I tried that feature before, it doesn't do anything". Be specific! What did you try?
Look at your data, not just summary statistics
Many research mistakes are hidden by aggregation, especially in machine learning.
Look at your features. Look at the response. Plot some things. Look at a specific example that is being passed through your network, and look at embeddings. Look at the loss.
Examples I've caught before: two hand-engineered features were computed by two different functions, but were actually identical. Another feature wasn't available after a certain point, and was just getting ffilled for years.
Know if you are compute-bound or thinking-bound
All of my work so far is in machine learning or quant finance. Running experiments is automated, but can sometimes have a long runtime. Are you compute-bound (not enough GPUs?) or thinking/analysis bound (can't set up experiments quickly enough)? Tooling that automates experiment setup, babysitting, and analysis can move you from thinking-bound to compute-bound surprisingly quickly.
If you're compute bound, take note of when resources are free. Keep a notebook (or set of git branches) so that you can launch an experiment quickly. Even in resource constrained organizations there's a lot of wasted compute e.g. overnight or over the weekend. Good queue systems help a lot. Make your work easily preemptible and resumable. Make it robust to being restarted on another machine.
Decrease iteration time
If your experiments need to convert a lot of data from disk into an in-memory format to start, can you decrease that? Can you use faster disks or parallelize harder? Can you get good at generalizing from small changes (small dataset, small model) to the "full" set of changes?
There's a serious slowness introduced into an organization when a basic experiment takes longer than a few hours (or god forbid a day).
Standardize your plots and evals
Making charts via custom notebooks is slow, surprisingly error prone, and (most critically) forces your reviewer to do extra work understanding what you're showing. Often they'll request another evaluation or plot. Better to agree on these ahead of time and just run them every time on every experiment. I like papermill.