The Segment of No Data

Submitted by Paul Legutko on Tue, 2021-09-14 00:12

The Segment of No Data There is more data about consumers available today than ever before. But data scientists should be wary about a growing subset of the population for whom little data exists—this segment with no data can deceptively skew AI-based models in ways that may be difficult to uncover.

Insurers have become more sophisticated in customer data enrichment and the AI-powered analysis of this data to build predictive models and customer segmentations. These models can be used to describe customer personas for marketing purposes, predict fraudulent claims, assess likelihood for cross-sell, determine underwriting straight-through processing, or complete a dozen other advanced analytics capabilities.

To create these scores and produce these segmentation models, AI needs to be “trained” using existing, historical data with known outcomes. When training data sets, the more data you have, the better. It’s not only the number of observations you have in the data set (the “n=”), but also the number, variety, and diversity of the dimensions about the customer that have been collected—demographics, psychographics, purchase behavior, financial data, etc. That way, AI algorithms have more dimensions to test, are more likely to find significant correlations or covariances between dimensions, and produce higher goodness-of-fit coefficients.

When one considers the kinds of data sources that feed many of these customer data platforms, variety of data does not seem to be a problem. Data behemoths such as Google, Facebook, and Amazon are only the most well-known of the behavioral data brokers. Insurers routinely enrich their data using companies like Oracle, Salesforce, LexisNexis, Neustar, or Acxiom, to name a few.

But there is a significant segment of the population about whom there isn’t much data at all, or at least there is much less data than other segments. There are a few possible reasons for this lack of data:

Consumers are concerned with data privacy. In line with the same consumer trends that produced GDPR, CPRA, and CCPA, these individuals reject cookie notices, have high security settings, or may even request their data be deleted.
Consumers’ data signatures have changed. Relocating to another state, getting married, getting divorced, or even just changing jobs or buying a new phone or laptop can interrupt the flow of data and make it difficult for identity resolution algorithms to correctly match data streams. An individual effectively becomes a brand new person.
Consumers are young. Younger consumers—Gen Z or Gen Alpha—have not had as much time to develop a data footprint as Millennials or Gen X. They may be on devices quite frequently, but using their parents’ login credentials, making them invisible (and skewing their parents’ footprint in the process). They have also not yet entered public databases that are being scraped by data brokers.

It may seem like if there’s no data, it shouldn’t disturb the AI algorithms producing the predictive models, segmentations, and personas. But the opposite is true—an observation with little or no data is still an observation, and it will be counted. This means that the “Segment of No Data” can skew results, producing false correlations, magnifying effect sizes, and reducing overall predictability.

Since one observation with no data looks exactly like another observation with no data, algorithms might see them as correlated, skewing look-alike models or cluster analysis. One might end up with a significant percentage of observations being considered “Other” because a lack of data can’t establish correlations or similarities.

Insurer data science teams should consider carefully establishing data quantity and diversity thresholds before running AI. They should run descriptives against identified segments or personas to validate closeness of fit. They should look out for “weird” correlations to ensure that they are not due to abnormally high cardinality because only a handful of fields are populated. In short, they should be aware that consumer data is not all equally rich and diverse.

To learn more about best practices for third-party data usage in insurance, read our report Third-Party Data in Insurance: Overview and Prominent Providers or reach out to me at [email protected].

Add new comment

How can we help?

If you have a question specific to your industry, speak with an expert. Call us today to learn about the benefits of becoming a client.

Talk to an Expert

Receive email updates relevant to you. Subscribe to entire practices or to selected topics within
practices.

Get Email Updates