ML In Startups: Some Observations
In my role at AI2 incubator as a technical advisor to entrepreneurs building AI-first companies, I often observe the challenge of building minimum viable products (MVPs) where AI plays an important role. In this blog post I attempt to articulate and share a few key learnings. While the context is NLP-focused pre-seed startups with technical teams consisting of three or less engineers and scientists, this post may be relevant to a broader audience as well. The TLDRs, which form the outlines of this post, are:
- (Bootstrap Quickly) Take advantage of pre-trained models from spaCy, Hugging Face, OpenAI, etc. to quickly build the first model.
- (MAP) Understand the concept of minimal algorithmic performance. Resist the temptation to get the accuracy “just a bit higher” or to fix “just one more failed test case”.
- (Keep It Real) In measuring performance, the evaluation set that matters the most is the one based on product use. Instrument the product to collect data from product use as soon as possible.
- (Lower Expectation) If getting to MAP is uncertain, consider a lesser AI feature. Using the framework of five levels of autonomy in self-driving cars, aim for a level below.
- (Weak Is Good) Use Snorkel Flow as a data-centric, weak supervision-based NLP framework to scale up the models (or build something similar in-house if necessary). Do this in the medium term, once the product has seen some initial traction that generates a meaningful volume of data and user feedback.
Bootstrap Quickly
During the early days of building an AI-first product in a startup, a technical team’s mental roadmap typically consists of two phases: bootstrap and scaling. In the bootstrap phase, we need to build a model (such as a text categorizer) that works reasonably well (e.g. having sufficiently high accuracy). We need to do this with limited labeled data, under an aggressive deadline, and with limited resources (e.g. the team may consist of a single person who also functions as a full-stack engineer). In the scaling phase, as the product is used by more users, we can take advantage of users’ data to improve the model’s performance, creating a virtuous cycle and building a competitive advantage with a unique proprietary dataset (the data moat).
The past decade has seen tremendous progress in AI, particularly in speech, vision, and language. This rising tide has lifted many boats, startups included. Need a production-grade NLP toolkit? Use
spaCy. A text categorizer in a hurry? Check out Hugging Face’s
zero-shot models. A customer service product? Try OpenAI’s
API. If your NLP-focused product needs common capabilities such as named entity recognition, text categorization, sentiment analysis, information extraction, question answering, etc., you could have the first versions of those features within the first week. You can then in the second week aim to have an improved V2 with some light labeling, experimenting with different hypotheses (for natural language inference-based zero-shot categorizers) or prompts (for OpenAI’s API). Check out the Appendix for an example of a zero-shot categorizer we recently used.
What should we then focus on from the third week and beyond? To answer this question, we need to discuss minimum algorithmic performance (MAP), a term
coined by the folks at Zetta Ventures.
Minimum Algorithmic Performance
As we zoom into the bootstrap phase, a natural question to ask is what constitutes a “model that works reasonably well”, i.e. a model that satisfies some minimum performance requirement. In searching the Web with the term minimum viable AI products, I’ve come across a handful of articles. The majority of them are not relevant, discussing the adaptation of
Eric Ries’ lean startup principles to avoid common, expensive failures in AI projects in the enterprise,
where a team of 10 engineers/scientists may be involved. The two articles that are relevant for startups are:
The former introduced the term minimum algorithmic performance (MAP), while the latter used the term minimum viable performance (MVP) to define essentially the same thing: given the target task, how well does the product have to perform in order to be useful. Davenport & Seseri wrote:
The minimum bar is problem specific rather than a simple accuracy number. In some applications, being 80% successful on Day Zero might represent a large and valuable improvement in productivity or cost savings. But in other applications, 80% on Day Zero might be entirely inadequate, such as for a speech recognition system. A good standard may be to simply ask, “How can a minimum viable AI product improve upon the status quo?”
The idea of aiming for MAP (the term I use in this post) is thus self-explanatory. However, knowing whether an AI product has achieved MAP is not always easy. One notable exception is WellSaid Labs. When Michael Petrochuk, CTO and co-founder of WellSaid Labs, first showed us his synthetic voice demo, we were blown away by its life-like quality that was a marked improvement compared to what was available on the market. We knew Michael was on to something (and more importantly, WellSaid Lab’s customers agreed).
More often than not, the answer to whether a product has reached MAP is less clear. In NLP-focused startups, a common goal is to glean information from textual data using techniques such as information extraction, entity linking, categorization, sentiment analysis, etc. It’s about extracting structure and insight from unstructured textual data. While these NLP tasks have commonly accepted and clear metrics, the challenge for the AI practitioner at an early stage startup is not only in determining the performance threshold, the proverbial 80%, but also which evaluation dataset to use for performance measurement.
Keep It Real: Measure With Data From Real Product Use
The best way to measure the performance of an AI feature is to use data from real-world use of the product. For a pre-seed startup, the product may only be used by a small number of early adopters or pilot customers. As a consequence, we sometimes rely on proxy datasets which have varying degrees of being indicative of the true performance. A named entity recognizer trained on a news article corpus may not perform well on Slack messages. A text categorizer trained on the Enron email corpus and used on today’s tech company emails may have issues as well given the difference between the industries (energy vs tech) and the eras (Boomers vs Gen Z). Transfer learning is just fine and dandy, isn’t it.
Without a strong understanding of how good a given proxy evaluation dataset is, startups should carefully consider whether to spend cycles on improving the performance of an AI feature. The danger here is premature optimization. While we engineers know it is “the root of all evil”, we may still feel tempted. I did. For folks with research backgrounds, we often seek comfort in the ritual of setting up train/test (or even train/dev/test) data splits and experimenting with different models to climb the performance curve. We do this without fully internalizing that we may be using a poor proxy dataset. For folks with engineering backgrounds, we enjoy writing unit tests and making sure they are all green (pass), and thus may be tempted to fix “just one more failed test case”.
Why is spending large efforts in these local minima evil? In startups, our understanding of what the customer wants can change quickly. It’s wasteful for a young company to fine tune models that do not turn out to solve the customer’s problems. In the early days of customer discovery, consider taking the MAP concept to another level by doing just enough feasibility work to assess whether a given AI feature can be built with the current state of the art without actually getting to MAP. This is the AI equivalent of wireframes/mockups: we know we can build the real thing, it’s just not time yet. It’s OK to cherry pick examples for customer demos and investor pitches.
Lower Expectation: When Getting to MAP is Uncertain
During customer discovery, entrepreneurs may home in on an AI problem that matters a lot to the customer. However, getting the AI to the minimum performance level is uncertain. We may get to 80% of the way there but would need product-generated data to complete the last 20%. This is a tricky catch-22 situation that tends to occur where there’s a large difference between public data (e.g. news articles, Wikipedia pages, tweets, etc.) and problem-specific data (e.g. Slack messages, emails, meetings transcripts, HIPAA-compliant health data). The proxies are poor. There are two main ideas how to deal with such situations:
- Have humans in the loop to validate cases where AI has low confidence
- Aim a bit lower, following human-AI interaction’s (HAI’s) best UX design practices to work around the MAP gap
The second idea may sound familiar to those exposed to the concept of the five levels of autonomy in self-driving cars (see, for example this
post for an overview). Full autonomy is hard, but we can provide Day Zero value with features such as parking/lane assist, automated breaking, etc. Similarly, an in-training virtual assistant may not be very good initially in anticipating the user’s needs, but would still be helpful if provided with a hint/prompt.
Great recommendations are hard, but search is easier. Let the user drive/pull if the AI is not good enough yet to push. Graduate to the next level when sufficient data is captured to effectively train the AI. Designed and built properly, an HAI system will double as the golden picks and shovels that mine data labels toward a powerful data flywheel/moat.
There is another reason why HAI is a really important idea in certain scenarios: privacy. Largely everyone nowadays knows the GDPR acronym and routinely accepts/rejects cookies on the Web, dozens of times a day. It is unacceptable to ship health data to mechanical turks, or give your data science team full access to the user’s emails. An interaction design that allows the user to give the system labeled data in a zero-friction way on a regular basis can in the long run help the AI get to the next level.
Weak Supervision & Snorkel Flow
You’ve got this far in this post? Congratulations on getting your product into the hands of early adopters. Also kudos for instrumenting the product to capture user feedback seamlessly. It’s now time to figure out how to get the data flywheel going.
The concept of a data flywheel is straightforward: with more data generated by product use, AI will get better. Google, for example, has had a huge head start against Bing in collecting click data in the battle of search relevance. Search engine click data is what ideal product-generated data looks like from the perspective of an AI practitioner: it is plentiful and generated by the user using the product with zero friction. Click data is however not quite hand-labeled data in the usual sense: the user may click through to a link, realize it’s not what they want, backtrack, and try again. The supervisory signal is weak, which means we may need to use weakly supervised learning (WSL) techniques.
In a
recent AI2 incubator newsletter, I briefly covered the trend to wean ourselves from relying on a large amount of hand-labeled data, highlighting techniques such as semi-supervised learning, self-training, data augmentation, regularization, etc. If hand-labeled data is processed meat, then data that supplies weak supervision is the organic, good stuff that can be cooked with WSL. One of the graduates of the incubator, Lexion, uses WSL extensively. Lexion’s co-founder and CTO, Emad Elwany, recently
shared:
Weak Supervision has enabled us to create massive high-quality training sets for hundreds of different concepts with a small team.
Our investment in NLU tooling for weak supervision, model training, and deployment would also make it viable for us to build training sets and develop models targeted at other languages at a low cost.
Lexion built its in-house WSL tooling, because Emad is a
star. For other startups, I recommend taking a look at Snorkel.ai, perhaps the best known proponent of WSL in the AI tooling space. It just reached unicorn status, a pretty solid indication that it is rapidly gaining traction in the enterprise. A core WSL innovation from Snorkel is the idea of learning labeling functions (e.g. heuristics) using a generative approach. See this post for an
overview.
Weak is good.
Closing Thoughts
Building AI products is hard, and it is harder still in a startup environment where our understanding of the customer is a moving target. What I found to work well is to strive for maximum agility so we can adapt to evolving customer understanding. For example, too much hand labels and model fine tuning introduce AI debt if those labels and fine-tuning efforts need significant updating during customer discovery. Weak supervision, especially with Snorkel's labeling functions, is an important tool since the approach is to use large quantities of noisy labels that can be obtained/discarded quickly.
Finally, getting to the right MAP for the right AI feature requires threading a needle through a challenging candidate maze. Navigating this requires a strong team with both AI and product/business experiences. The product lead and the AI lead need to be in close communication to iterate and find that sweet spot beachhead. As the representative for the customer, the product lead can sometimes have inflated expectations around what AI can do, given all the hype. The AI lead should provide the counterbalance to shorten the trough of disillusionment as much as possible.
Appendix
In the section on Bootstrap Quickly, I briefly mentioned new NLP capabilities from Hugging Face and OpenAI. The latter has received lots of attention, so here I share a bit of my experience with Hugging Face’s zero-shot models.
A good starting point for practical zero-shot learning is a
blog post, dated May 29, 2020 by Hugging Face’s engineer Joe Davidson. For text categorization, the key technique, proposed by
Yin et al. (2019), uses a pre-trained MNLI (multi-genre natural language inference) sequence-pair classifier as an out-of-the-box zero-shot text classifier. The idea is to take the sequence we're interested in categorizing as the "premise" and to turn each candidate category into a "hypothesis." If the NLI model predicts that the premise "entails" the hypothesis, we take the label to be true.
Let’s say we want to build a prototype of a text categorizer that detects if a given email implies the need to schedule a meeting. Examples of texts that are indicative of such need include “let’s huddle to discuss next quarter’s deliverables”, “I’d love to grab coffee next week to catch up”, etc. A potential hypothesis is “meeting request”. Using Hugging Face’s transformers library, we can build such as categorizer with a few lines of code:
from transformers import BartForSequenceClassification, BartTokenize
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-mnli')
model = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
premise = 'I’d love to grab coffee sometimes next week'
hypothesis = 'meeting request'
input_ids = tokenizer.encode(premise, hypothesis, return_tensors='pt')
logits = model(input_ids)[0]
entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)
true_prob = probs[:,1].item() * 100
print(f'Probability that the label is true: {true_prob:0.2f}%')
The model used above is large BART, pretrained on the MNLI corpus, but other MNLI models can also be used that offer different accuracy vs inference cost tradeoffs. To get a (limited) sense of the accuracy of this approach, I ran this categorizer on about 364,000 sentences from the Enron email corpus, resulting in 262 sentences scoring 0.99 or higher. Below is a random sample of 12 sentences from those 262.
- (0.991): High Dan, Are you available to
meet
with Tracy and me tomorrow, August 7, for about 30 minutes between 11:30 and 2, in Tracy's office? - (0.994): Let's
sit down
on Thursday am and talk
. - (0.995): Letters A- K 2pm Session Letters L-Z PLEASE
RSVP
(BY REPLYING TO THIS E-MAIL) - (0.992): Mark -- If possible, I'd like to
schedule some time with you
(10 min) to discuss approval of a certificates transaction that involves ECE and Elektro. - (0.991): Folks: Can we
get together
at 3 PM in ECS to talk about the D5A. - (0.993): Let's
get together
on Tuesday. - (0.991): please
get with me
tomorrow to discuss. - (0.997): 06752 Greg Whalley has
requested a Trader's Meeting
tomorrow morning @ 7:45 a.m. (CST). - (0.993): I have asked that ERCOT matters
get 1 hour to discuss
the necessary contracts that UBS will need to sign to participate in the ERCOT market. - (0.996): > Let's
get together
. - (0.996): Hi Eric,
Would you and Shanna like to meet us
for dinner at McCormick and Schmidts tomorrow night? - (0.992): I wanted to see if you would be
interested in meeting me
for lunch or a drink afterwork!
All of the above 12 sentences contain a request to meet, with fairly diverse surface form variations. Limited sample size aside, this is pretty good precision, and pretty magical.
Hugging Face built a streamlit-based demo of ZSL text categorizer here:
https://huggingface.co/zero-shot/. In addition to BART, this demo also includes a model further fine-tuned on Yahoo! Answers dataset, as well as the cross-lingual XLM-Roberta model.
Stay up to date with the latest A.I. and deep tech reports.