Leverage AI to Supercharge Your B2B Data
by Clay Turner, Co-Founder
There is a lot of talk about how AI can offer new insights by interpreting existing “internal” data. But what about finding and appending data that exists “externally?” Or further still, what about interpreting external data in all of its - often ugly - forms in order to “reason” a new insight into being by inference that simply didn’t exist before? These are two ways that AI can now supercharge your B2B data.
Think about this for just a moment... When you do research, you have an initial set of questions and assumptions that you want to test. The first assumption is typically that the data must be out there “somewhere” and in “some” measure. It’s at least a high hope, right? If you find existing data and believe it to be credible and affordable, great! You capture it, append it, often aggregate it, and act on it accordingly. If not, then you’re left to infer insight from what does exist… You form better questions and assumptions as you iterate through your research. When you arrive at a defensible position - though usually imperfect - you act on it. And why? Because “perfection is the enemy of progress.” It is your ability to reason and to make value judgments that can create actionable insights where orthodox data doesn’t already exist. Put differently, you are able to reduce the complexity of the problem iteratively until you arrive at an answer that makes sense enough to act on. Then you judge the outcome and adjust accordingly.
We must now recognize that we are no longer alone in our ability to reason our way to a defendable and actionable position - or even that we alone are capable to act on it. The models are now able to formulate and offer educated opinions when asked properly and given the right context - even what it thinks about your company’s reputation and brand relative to competitors’. Furthermore, these models can give you their inferred insights in the form of a structured data append, according to a schema that you define, along with a rational justification for the inference. There is a cost to use AI models for data appends in these ways that we’ll explore a bit, but the bigger question that each business will have to consider for itself is AI’s price relative to a full time employee, and the revenue potential when AI finds a high-value contract that might never have been found (fast enough) otherwise. Whether for customer discovery, an engaging content strategy, or outbound prospecting, AI can enrich your data with compelling insights that might otherwise require a team of research analysts to complete.
So can these models actually reason?
Any philosophical implication of the question is simply outside the scope of this post. I’m not going to touch that one with a ten foot pole. Likewise, it’s simply not helpful in a business context to debate the mechanics of what’s happening under the hood of a machine in contrast to what’s happening between our two ears in a remarkable and irreplaceable brain. In a business context, the best question to debate is: “Can we use it, and use it responsibly?” To this end, the answer is, and simply must be, “Yes.” If - when applied to machines - a verb like “to reason” is uncomfortable, then let’s just consider it “artificial” reasoning, at best, that amplifies our own ability to get things done. These Large Language Models (LLMs) are trained on the expanse of human knowledge and that has given them an ability to evaluate conditions beyond their training data to arrive at a defensible position when asked the right question, in the right way, and with the right context… And “yes,” they do seem to favor a polite and formal tone when asked.
The right question, in the right way, and with the right context?
If you are asked a question, but the question is not well-formulated, or lacks context, then you are apt to answer the question incorrectly. The same is true with a LLM like the “GPT-4 Turbo” model that now powers ChatGPT - among an increasing number of other things. If a poorly formed question is restated in a better manner, then you might provide a better follow-up answer. Likewise, if you are given more context, then you might offer a better response still. This also applies to conjecture and that’s important to understand. I often hear the argument, “Well, ChatGPT is limited to past events, right? How can it give me relevant data today?” This is thinking of an AI model like a tape recorder or a phonograph - 19th and 20th century innovations. That’s simply not what we’re dealing with now. Here’s what I mean… If you ask me an opinion about a current or future event, I can offer a valid opinion based on my past experience - even if I have no direct experience with the event itself. I can hypothesize. I might be wrong, but I can formulate an opinion nonetheless and provide some justification for my thinking. Given a bit more context, I might refine my conjecture and its justification accordingly to offer a better insight. The same is true for an AI model. Its predictive capabilities are present in its generative capacity as well. You can’t ask a tape recorder or a phonograph for conjecture. You can ask AI and increasingly receive a highly reasonable response.
So what does all of that have to do with supercharging my B2B data?
Well, a lot. Again, think about how data is often collected. You usually start with some existing data that you want to “enrich.” Enriching B2B data with outside insight is usually used in marketing; both inbound and outbound - outbound for obvious reasons of segmentation, but also inbound for nurturing captured leads. “Data Enrichment” shouldn’t be confused with “Data Augmentation,” which usually applies to Machine Learning, i.e. to train models with more diverse datasets. For the most part, “enriching data” has involved obtaining it from somewhere else and appending it to our B2B datasets; perhaps a CRM or other analytics tool. To this end, AI can (a) provide structured data for an append from its existing, pretrained knowledge. And, AI can also (b) help you to identify sources of existing data in structured forms that exist elsewhere, based on pretrained awareness of such sources. Furthermore, AI can (c) help you to parse and provide structured data to append to your B2B data from unstructured content it evaluates, i.e., extracting content from a web page or a PDF and giving you JSON or CSV as an output - not just a natural language “chat” response. You see this ability somewhat now with most of the consumer facing models.. OpenAI’s ChatGPT, Anthropic’s Claude, and Google’s Bard with Gemini have the ability to upload documents and to perform “limited” web searches (web access) for websites having current events to evaluate and parse data from. When I say “parsing” data from content external to the model, it’s important to explain what I really mean… This is to say that (1) the answer already exists in the external content, but just needs to be extracted, and (2) that extraction can be provided in an expected, structured format for easy appending to an existing dataset.
Using AI to extract empirical data from a source does rely on its “reasoning” faculties because a direct map isn’t already established like we see in traditional ETL (extract, transform, and load) routines or “scraping” methods. Can this reasoning go further? What if empirical data do “not” exist in source content for direct identification and extraction? What if an educated guess has to be made by evaluating elements across research artifacts? In other words, can an insight be inferred if not directly identified? AI can do this too, and quite impressively in fact - particularly perhaps with GPT-4 Turbo and Google Gemini at time of writing. AI can do this because it is trained with so much existing knowledge. It can provide conjecture, and do so in metric form. These “inferred” insights can just as easily be provided as a structured data append in a format of your choosing.
AI Data Enrichment Scenarios
There are probably many more scientific ways to think about this, but it was helpful for me to look at it this way:
- When Answers Exist Already
- In The Model’s Training Dataset:
- This can be helpful for getting quick answers that the model has been trained on natively from public sources as of its training date. Usually, this data exists and can be obtained elsewhere, and “yes,” this scenario lends itself better to data that is not time-sensitive. Where the data might already exist elsewhere, the value of a model is that you can ask it to consider other data it’s been trained on to produce new, composite data lists that did not perhaps exist before. And, there is the convenience factor to consider.
- I’ve used this primarily for creating good, canonical lists or enumerations that help me to standardize other data that I want to query. AI is good at recommending classifications provided that you describe well how you intend to use them.
- In External Sources:
- The sky's the limit with this one, but it requires more work - more coding; coding that AI can help you to do too. The idea, here, is that the data exists in a document, web page, series of pages, etc. You just need to extract it. You know this, and where you could extract it yourself, you don’t want to. You’d rather be busy doing something else productive. Since models cannot consume “all” of the contents of these documents or web pages at one time, you’ll need a process that reduces it into the bite-sized chunks that AI can consume. You probably want to automate this collection and reduction of documents, web pages, etc.. In essence, you have a workflow that starts with one or more artifacts and ends with the data you want. In this process, you might automate web searches, discovery of content, i.e. to find and summarize ratings and reviews, reading and capturing data in PDFs and white papers, and so on. This usually involves multiple steps and multiple calls to AI models. The right prompt is everything. In its simplest expression, though far from precise: “I’m going to give you ‘this’ and I want you to give me back ‘that’.” Again, here, the data is empirical. It exists in a hard form. You’re just collecting it, extracting it, and appending it.
- In The Model’s Training Dataset:
- When Answers Must Be Inferred Across Research Artifacts
- In The Model’s Training Dataset:
- Here it is important to ask for a “justification” in your prompt. If you ask AI to give you an opinion based on knowledge that it already has, then you need to understand why it rationalized the insight inferred. It can also be helpful to ask it for a confidence level, i.e., on a scale of 1 to 5. If you believe its justification to be rational, then perhaps you’ve already arrived at an assumption that can be appended to your B2B data and tested in practice.
- In External Sources:
- Again, here there should be no limit to the imagination. The process is very similar to finding and parsing empirical data from existing sources. The difference is that you are giving the model “food for thought.” You supply relevant, supporting research that you collect in your workflow - a workflow that, again, usually involves multiple steps and other calls to AI models that ultimately arrive at a good context. Here, you define the evaluation criteria, carefully craft a prompt with the right instructions, and ask AI to give you its informed opinion. It bases that opinion on the information you provide in context, but it intuits and justifies a response based on its knowledge of human history and relevant training data. Again, capturing this justification is important. It helps you to (a) refine your prompt to get a better response, and (b) know when you’ve arrived at the same kind of conclusion you might come to and act on yourself.
- In The Model’s Training Dataset:
Some AI Data Enrichment Examples
- Predictive Analytics: When prompted correctly, AI can use historical data to forecast future trends, behaviors, and outcomes. This is particularly useful in areas like sales forecasting, market analysis, and risk assessment. Again, this data might be from its training dataset or content you provide in context, but can be appended to your existing B2B data.
- Customer Segmentation: By analyzing customer data, AI can identify and codify distinct segments based on behavior, preferences, or demographics obtained externally. This aids in targeted marketing and personalized customer experiences. Worried about PII? You probably don’t need to provide it, or can redact it beforehand. Use an opaque ID to reconnect the append to the original customer or prospect record.
- Sentiment Analysis: AI can evaluate the sentiment of customer feedback, reviews, or social media mentions, providing insights into public perception and customer satisfaction. In fact, it can formulate an opinion about your brand and the brand strength of competitors already. This has huge implications for companies as people begin to build trust relationships with AI and turn to it for advice. Perhaps your B2B data can be “enriched” with an understanding of how your business customers are fairing with their own reputations.
- Competitive Analysis: By gathering and analyzing data from various sources, AI can provide insights into competitors' strategies, market positions, and potential areas of opportunity or threat. It can segment these for you and provide very compelling data appends, perhaps helping you to identify customers likely to churn with changes in the competitive landscape.
- Regulatory Compliance: AI can help track various industry-specific regulations and changes, appending attributes to your B2B data such as likelihood to be affected in certain segments based on characteristics.
- Product Development Insights: AI can analyze customer feedback and market trends to suggest improvements or new product ideas, correlated to customer segments most likely to upgrade.
- Content Strategy Ideas: AI can analyze conversations, trending news, and more to answer your questions like, “What are people talking about that might make a good blog post, webinar, or workshop?” When connected to your B2B customer segments, you can better craft and target content.
Cost Considerations
AI Data Enrichment - if I may call it that - has a cost. That cost can be very high if you’re not careful to mitigate it. “Price” as a “number” has absolute meaning, but “value” is subjective and a matter of perspective. In other words, one or more chained API calls to one or more AI models for a key insight may cost more or less than purchasing the same insight - typically per record - from a traditional data provider. We’ve been purchasing data lists for years, right? Likewise, scaling AI agents to obtain and append insights may scale better or worse than scaling headcount - likely much better if well-architected. And, the price of AI Data Enrichment might be quite immaterial in contrast to new revenue potential. Regardless, the “compute” requirements of AI have a cost, and those costs are likely to rise, at least for the latest/greatest model versions as vendors come to better understand their own costs and capacity - and after they have us all hooked. That doesn’t mean that you need the latest-greatest model to accomplish a defined data enrichment task. With that in mind, here are some things to consider for mitigating costs:
- Efficient Workflow Definition:
- Number of Tasks: This is like anything else. Are there extra tasks adding extra costs that simply do not justify themselves? Might removing one task and incorporating its function into another call provide the same or better result? In other words, can you consolidate tasks by having one AI query do the work of two or more with the same data context?
- Number of Tokens: Anything you pass into context for an AI model to evaluate gets “tokenized.” The more data you pass into context, the more tokens result and the higher price you’ll pay. This is where Retrieval Augmented Generation (RAG) plays a key role. Explaining it is outside the scope of this post, but there is plenty of great documentation on it. Basically, it allows you to find relevant data from a much larger repository of data so that you pass less text into context for an AI query. A vanilla implementation of RAG, however, may remove important elements that “should” be passed into context. It’s not an exact science. Even if you get the implementation right for retrieving relevant data for context, you should still experiment with “how much” of that “relevant data” is required to get the answer you need. In other words, your retrieval process might give you 50 relevant documents to pass into context, but you might only need the top 10 to consistently get the insight that you need from the AI model.
- General Model Selection: Data enrichment with AI typically involves more than one call to a model, to iteratively arrive at an insight for an append. You likely will have different compute requirements for each call in the workflow. In other words, GPT-3.5 Turbo might give you an equally valid answer for 3 out of 4 tasks in your workflow, at 10% of the cost with each call per token passed into context. In our initial experimentation, it’s usually been the “final” call that requires the strongest and priciest model.
- Specific Model Selection: There are some tasks that might not require a general model like GPT-4 at all. There are plenty of free, open-source models that can accomplish tasks, even with CPUs instead of expensive GPUs. Some of the preprocessing required to arrive at a payload appropriate for a general model query can be accomplished without costly external API calls, but rather by running a more specific AI model locally and privately. That might be all-the-more appealing if you need to preprocess sensitive data privately before passing redacted data to a public, general model.
- Traditional Compute Incorporation: In our experimentation, the temptation was to use AI for every task in the workflow to provide a data append. That’s not a bad default position… We need to ask, “Can AI do it better?” The answer may be “yes,” but do we need it to? There are plenty of great, traditional compute algorithms that are available in the libraries we import locally that handle preprocessing tasks. For example, does it make sense to ask an AI model to “sort” data that you pass to it simply to return a sorted list? Probably not. If you can write a simple function that removes an API call to an AI model, then consider doing it. But… Also consider letting AI write the function for you!
- Cost - Benefit Analysis: How much of your data do you need to enrich, really? If you have a database of 100,000 items and your workflow costs $.25 to enrich a single item, then that’s $25,000! Hopefully you’ve used some of the other mitigation strategies above to get that per-item cost down considerably, but some of our experimentation got as high as $.75 per item! It might be worth that and more, but prohibitive unless you carefully choose a segment of your data to enrich or perhaps even a small, random sample just to test the value of the append.
If you would like to learn more about this topic and how you can supercharge your B2B data with any of these AI Data Enrichment techniques, please contact us.