General Sessions
Length: 45 Minutes
Title: How to Super-Charge Named Entities to Find & Use Relevant Digital Assets
Time: 2:00 PM - 2:20 PM
Description: There’s lots of software available that does a good job of identifying named entities—the names of people, organizations, events, places, etc.—that occur in text. Too often, these are just used as keywords to find digital assets without any further differentiation. For example, an organization could be a public company, U.S. government agency, an institution of higher education, or something else. Being able to identify that a named entity is a specific type of organization can be important in determining whether a digital asset is relevant to a particular search. It can also be useful in adding further context to the digital asset. If we know that a named entity is an institution of higher education, we could further differentiate it by size and location, and even link to a short profile of the organization in Wikipedia or to the institution’s website itself. This talk explains how to easily build up metadata related to named entities to improve search to accurately find and use relevant digital assets and provides real-world examples from client projects.
Title: Tables Are Tough: Perfecting an AI Model to Automate Table-to-XML Extraction
Time: 2:25 PM - 2:45 PM
Description: Extracting and structuring content from text- or image-based tables has long been a challenge. Tabular content is particularly important in regulatory, financial, and scientific documents where complex alphanumeric content is often presented in tabular format. Tables are tough to structure due to inconsistencies with tabular content, high diversity of layouts, complicated elements such as straddle headings, various alignments of contents, the presence of empty cells, and other intricacies. Transforming tabular content into a structured model such as XML or HTML is nearly always a manual or semi-manual process. This presentation explores methods used to perfect a model to solve the challenges around automating table structure from text extraction. Data Conversion Laboratory and Fusemachines created an AI model that finds and extracts information from all tables in a document using a combination of computer vision (CV) and natural language processing (NLP). Our speakers review how they developed and managed a hybrid approach of rules-based processes and machine-learning to identify and extract tabular data and augmented training data to develop an AI model that automates table-to-XML