Document Clustering Through Hybrid NLP

A Complex Use Case

It is common knowledge that up to 87% of data science projects fail to go from Proof of Concept to production; NLP projects for the Insurance domain make no exception. On the contrary, they must overcome several hardships inevitably connected to this space and its intricacies.

The most known difficulties come from:

  • the complex layout of Insurance-related documents
  • the lack of sizeable corpora with related annotations.

The complexity of the layout is so great that the same linguistic concept can greatly change its meaning and value depending on where it is placed in a document.

Let’s look at a simple example: if we try to build an engine to identify the presence or absence of a “Terrorism” coverage in a policy, we will have to assign a different value whether it is placed in:

  1. The Sub-limit section of the Declaration Page.
  2. The “Exclusion” chapter of the policy.
  3. An Endorsement adding a single coverage or more than one.
  4. An Endorsement adding a specific inclusion for that coverage.

The lack of good-quality decently sized annotated insurance documents corpora is directly connected to the inherent difficulty of annotating such complex documents as well as the amount of work it would be required to annotate tens of thousands of policies.

And this is only the tip of the iceberg. On top of this, we must also consider the need for the normalization of insurance concepts.

 complex documents

An Invisible, Yet Powerful, Force in the Insurance Language

The normalization of concepts is a well-understood process when working on databases. Still, it is also pivotal for NLP in the Insurance domain, as it is the key to applying inferences and increasing the speed of the annotation process.

Normalizing concepts means grouping under the same label linguistic elements, which may look extremely different. The examples are many, but a prime one comes from insurance policies against Natural Hazards.

In this case, different sub-limits will be applied to different Flood Zones. The ones with the highest level of risk of flood are usually called “High-Risk Flood Zones”; however, this concept can be expressed as:

  1. Tier I Flood Zones
  2. SFHA
  3. Flood Zone A
  4. And so on…

Virtually any coverage can have many terms that can be grouped together, and the most important Natural Hazard coverages even have a 2 or 3-layer distinction (Tier I, II, and III) according to specific geographical zones and their inherent risk.

Multiply this for all the possible elements we can find, and the number of variants will soon become very large. This causes both the ML annotators and NLP engines to struggle when trying to retrieve, infer, even label the correct information.

The Hybrid Approach

A better approach to solve complex NLP tasks is based on hybrid (ML/Symbolic) technology, which improves the results and life cycle of an insurance workflow via micro-linguistic clustering based on Machine Learning, then inherited by a Symbolic engine.

While traditional text clustering is used in unsupervised learning approaches to infer semantic patterns and group together documents with similar topics, sentences with similar meanings, etc., a hybrid approach is substantially different. Micro-linguistic clusters are created at a granular level through ML algorithms trained on labeled data, using pre-defined normalized values. Once the micro-linguistic clustering is inferred, it can then be used for further ML activities or in a Hybrid pipeline which actuates inference logics based on a Symbolic layer.

This goes in the direction of the traditional golden rule of programming: “breaking down the problem.” The first step to solve a complex use case (like most in the Insurance domain are) is to break it into smaller, easier-to-take-on chunks.

breaking down the problem

 Breaking Down the Problem

Symbolic engines are often labeled as extremely precise but not scalable, as they do not have the flexibility of ML when it comes to handling cases unseen during the training stage.

However, this type of linguistic clustering goes in the direction of solving this matter by leveraging ML for the identification of concepts that are consequently passed on to the complex (and precise) logic of the Symbolic engine coming next in the pipeline.

Possibilities are endless: for instance, the Symbolic step can alter the intrinsic value of the ML identification according to the document segment the concept falls in.

The following is an example that uses the Symbolic process of “Segmentation” (splitting a text into its relevant zones) to understand how to use the label passed along by the ML module.

Let us imagine that our model needs to understand if certain insurance coverages are excluded from a 100-page policy.

The ML engine will first cluster together all the possible variations of the “Fine Arts” coverage:

  • “Fine Arts.”
  • “Work of Arts.”
  • “Artistic Items.”
  • “Jewelry”
  • etc.

Immediately after, the Symbolic part of the pipeline will check whether the “Fine Arts” label is mentioned in the “Exclusions” section, thus understanding if that coverage is excluded from the policy or if it is instead covered (as part of the sub-limits list).

Thanks to this, the ML annotators will not have to bother about having to assign a different label to all the “Fine Arts” variants according to where they are placed in a policy: they only need to annotate the normalized value of “Fine Arts” to its variants, which will act as a micro-linguistic cluster.

Another useful example of a complex task is the aggregation of data. If a hybrid engine aims at extracting sub-limits to specific coverages, along with the coverage normalization issue, there is an additional layer of complexity to handle: the order of the linguistic items for their aggregation.

Let’s consider that the task at hand is to extract not only the sub-limit for a specific coverage but also its qualifier (per occurrence, in the aggregate, etc.). These three items can be placed in several different orders:

  • Fine Arts $100,000 Per Item
  • Fine Arts Per Item $100,000
  • Per Item $100,000 Fine Arts
  • $100,000 Fine Arts
  • Fine Arts $100,000

Leveraging all these permutations while aggregating data can increase considerably the complexity of a Machine Learning model. A hybrid approach, on the other hand, would have the ML model identify the normalized labels and then have the Symbolic reasoning identifying the correct order based on the input data coming from the ML part.

Clearly, these are just two examples; an infinite number of complex Symbolic logic and inferences can be applied on top of the scalable ML algorithm for the identification of normalized concepts.

In addition to scalability, symbolic reasoning brings other positives to the whole project workflow:

  • There is no need to implement different ML workflows for a complex task, with different labeling to be implemented and maintained. Also, it is quicker and less resource-intensive to retrain a single ML model than multiple ones.
  • Since the complex portion of the business logic is dealt with symbolically, adding manual annotations to the ML pipeline is much easier for data annotators.
  • For these same reasons mentioned above, it is also easier for testers to directly provide feedback for the ML normalization process. Moreover, since linguistic elements are normalized by the ML portion of the workflow, users will have a smaller list of labels to tag documents.
  • Symbolic rules do not need to be updated often: what will be more often updated is the ML part, which can also benefit from users’ feedback.
  • ML in complex projects in the Insurance domain can suffer because inference logic can hardly be condensed into simple labels; this also makes life harder for the annotators.
  • Text position and inferences can dramatically change the actual meaning of concepts that share the same linguistic form
  • In a pure ML workflow, the more complex a logic is, the more training documents are usually needed to achieve production-grade accuracy
  • For this reason, ML would need thousands (or even tens of thousands) of pre-tagged documents to build effective models 
  • Complexity can be reduced by adopting a Hybrid approach: ML and users’ annotation create linguistic clusters/tags, then these will be used as the starting point OR building blocks for a Symbolic engine to reach its goal, which will manage all the complexity of a specific use case
  • Feedback from users, once validated, can be leveraged to retrain a model without changing the most delicate part (which can be handled by the Symbolic portion of the workflow)

News Credit

%d bloggers like this: