The Technical Track focuses on the latest text analytics techniques and methods – a how-to develop text analytics foundations, incorporate the latest advances, and explore new approaches. This track will appeal to anyone just starting to develop text analytics and to those who are looking to enhance their current efforts.
View the Text Analytics Forum 2018 Final Program PDF
Wednesday, November 7: 1:30 p.m. - 2:15 p.m.
This talk is about how we’ve found ways to clean up the mess by increasing precision and recall with a hybrid rules-based/Bayesian approach while also making a new data source meaningful and usable across the organization. We were able to dramatically increase the quality of extracted attributes by transforming raw data into a managed taxonomy. By integrating the work of engineering and taxonomy, we can ensure that changes to the taxonomy are painlessly integrated into databases and that engineering work increases the effectiveness of taxonomists. Attendees walk away with an idea of what collaboration between developers and taxonomists looks like from the taxonomist’s perspective at one company with a strong engineering culture, along with some practical tips on how to turn low-quality or unstructured data into high-quality semantic data.
Andrew Childress, Senior Taxonomy Analyst, Indeed
Shannon Hildenbrand, International Taxonomy Lead, Indeed
DTIC acquires approximately 25,000 new research documents each year, and this number is expected to at least double in the next few years. A key challenge for DTIC is to make this data useful to end users. In response, DTIC has invested in an enterprise metadata strategy to provide efficient and consistent information extraction methods across collections and develop downstream applications that will leverage this metadata to automate much of the manual effort it takes analysts to enrich the content and researchers to search through it to find answers. One of these applications is the Metatagger, a text analytics tool which is applied to content and then provides automatic tagging and subject categorization. The source of the terminology for the tagging is the DTIC Thesaurus, and through the use of topic files works to extract terms and categories.
Monica Butteriss, Analysis Division Chief, Defense Technical Information Center (DTIC)
Scott Steele, Ontologist, Defense Technical Information Center (DTIC)
Wednesday, November 7: 2:30 p.m. - 3:15 p.m.
Keyword research allows companies to learn the voice of their customers and tune their marketing messages for them. One of the challenges in keyword research is to find collections of keywords that are topically relevant and in demand and therefore likely to draw search traffic and customer engagement. Data sources such as search logs and search engine result pages provide valuable sources of keywords, as well as insight into audience-specific language. Additionally, cognitive technologies such as natural language processing and machine learning provide capabilities for mining those sources at scale. With a few tools and some minimal coding, an analyst can generate clusters of best-bet keywords that are not only syntactically similar but semantically related. This how-to talk presents some practical techniques for automated analysis of keyword source data using off-the-shelf APIs.
Dan Segal, Information Architect, IBM
Uncovering insights and deep connections across your unstructured data using AI is challenging. You need to design for scalability and apt level of sophistication at various stages in the data ingestion pipeline as well as post ingestion interactions with the corpora. In this session, we discuss the top 10 things, including techniques, you would need to account for when designing AI-enabled discovery and exploration systems that can augment knowledge workers to make good decisions. These include but are not limited to document cleansing and conversion, machine-learned entity extraction and resolution, knowledge graph construction, natural language queries, passage retrieval, relevancy training, relationship graphs, and anomaly detection.
Swami Chandrasekaran, Executive CTO Architect, Watson, IBM
Thursday, November 8: 10:15 a.m. - 11:00 a.m.
Most text analysis methods include the removal of stopwords, which generally overlap with the linguistic category of function words, as part of pre-processing. While this makes sense in the majority of use cases, function words can be extremely powerful. Research within the field of language psychology, largely centered around linguistic inquiry and word count (LIWC), has shown that function words are indicative of a range of cognitive, social, and psychological states. This makes an understanding of function words vital to making appropriate decisions in text analytics. In model design, differences in expected distributions of function words compared with content words have an impact on feature engineering. For instance, methods which use as their input the presence or absence of a word within a text segment will produce no useable signal when applied to function words, while those that are sensitive to deviations from expected frequency within a given language context will be highly successful. When interpreting results, differences in the way that function and content words are processed neurologically must be accounted for. As awareness of the utility of function words rises within the text analytics community, it is increasingly important to cultivate a nuanced understanding of the nature of function words.
Kiki Adams, Head of Science, Receptiviti
Shayna Gardiner, Computational Linguist & Data Scientist, Receptiviti
The basic premise of taxonomy and text analytics work is to impose structure on—or reveal structure in—unstructured content. Despite being called “unstructured,” much workplace information can be described as semi-structured, as there is always some level of organization in even the most basic content formats. For example, in a workplace document you will likely find titles, headers, sentences, and paragraphs, or at least a clear indicator of the beginning and end of a large block of text. Similarly, taxonomies and ontologies are artificial constructs which may reflect the information they describe or be imposed as a form of ordering on semi-structured content. In this session, attendees hear case studies about using the contextual structure of taxonomies and ontologies and the various structural indicators in text to perform taxonomy-based content auto-categorization and information extraction.
Ahren Lehnert, Principal Taxonomist, Nike Inc., USA
Thursday, November 8: 11:15 a.m. - 12:00 p.m.
This presentation serves as an overview of current issues with named entity recognition in text analytics, focusing on work done beyond the categories of people, place, organization, and other elements that are (relatively) easily extracted through current processes. It covers areas of ongoing research, issues, and ideas about their potential benefits to taxonomy and ontology development.
Brian Goss, Taxonomist, EBSCO Information Services
Traditional approaches to concept and relationship extraction focus either on pure statistical techniques or on detecting and extending noun phrases. This talk outlines an alternative approach that identifies multiword concepts and the relationships between them, without requiring any predefined knowledge about the text’s subject. We demonstrate a number of capabilities built using this approach, including ontology learning, intelligent browsing, semantic search, and text categorization.
Jeff Fried, Director, Platform Strategy & Innovation, InterSystems
Dirk Van Hyfte, Senior Advisor, Biomedical Informatics, InterSystems
Thursday, November 8: 1:00 p.m. - 1:45 p.m.
The Inter-American Development Bank is a multilateral public sector institution committed to improving lives in Latin America and the Caribbean. Human capital may be the institution’s most important resource for realizing its vision: The knowledge of its employees, roughly 5,000, is spread across offices in 29 countries throughout the Americas, Europe and Asia. The IDB’s knowledge management division led an 8-week proof of concept that used natural language processing techniques to create explicit representations of the tacit knowledge within its employees and make those representations searchable. Attempting to identify and represent people’s knowledge is a complex task. Part of this complexity lies in the fact that variables used to determine knowledge have ambiguous definitions. These and other considerations are what make this POC so different from a simple skills database or profile search. This presentation details our experience with this project and how the use of NLP allowed us to successfully create approximations of IDB personnel knowledge and turn them into machine searchable knowledge entities.
Kyle Strand, Lead Knowledge Management Specialist and Head of Library, Inter-American Development Bank (IDB)
Daniela Collaguazo, Text Analytics Consultant, Knowledge Innovation Communication Department, Inter-American Development Bank
Information retrieval can be seen as matching the intellectual content represented in documents to a knowledge gap in the mental map of a searcher. For decades, most of the focus of information retrieval research, whether in academia or in commercial systems, has been on improving the representation of documents, or collections of documents. Less attention has been paid to representing the searcher’s information need, or knowledge gap. This knowledge gap was characterized by Belkin, Brooks, and Oddy as an Anomalous State of Knowledge. This talk will describe the theory and practice of this concept and how it can be utilized to enhance information retrieval.
Paul Thompson, Instructor, Geisel Medical School, Dartmouth College
Thursday, November 8: 2:00 p.m. - 2:45 p.m.
Advances in machine learning have led to an evolution in the field of text analytics. As these and other AI technologies are incorporated into business processes at organizations around the world, there’s an expectation that intelligent automation will lead to improvements like increased operational efficiency, enriched customer engagements and faster detection of emerging issues. How will technology meet that demand? How can we combine the expertise of humans with the speed and power of machines to analyze unstructured text that’s being generated at an unprecedented rate? Find out in this talk from Mary Beth Moore, who will share stories about text analytics being used to augment regulatory analysis, improve product quality and fight financial crimes.
Mary Beth Moore, Global Product Marketing Manager for AI, Text Analytics, SAS
The advent of unsupervised machine-learning algorithms make it possible for content owners to index their content without a taxonomy. This means that publishers are faced with this challenge: Do you maintain your existing taxonomies or replace them by a full ML approach? Or is there any way of combining the two? This talk looks at some case studies that have implemented different solutions, including publishers with private taxonomies used by organizations, and the use of large-scale, public-controlled vocabularies such as MeSH.
Michael Upshall, Head of Business Development, UNSILO, Denmark
Thursday, November 8: 3:00 p.m. - 3:45 p.m.
Machine learning models often depend on large amounts of training data for supervised learning tasks. This data may be expensive to collect, especially if it requires human labeling. This raises some particular quality issues, for example, how to ensure that human agreement is high and what to do in the event that it is not? Also, when your data is expensive to tag, how do you ensure that you have the smallest set possible that is representative of all your features? This talk addresses these and other issues associated with gathering hand-coded datasets for supervised machine-learning models, especially models run on textual data.
Leslie Barrett, Senior Software Engineer, Bloomberg, LP
A U.S. intelligence community researcher recently declared, “Analytics is my second priority.” We have long passed the point where even “medium data” projects exceed the capacity of human analysts to actually read the corpus. Yet “human in the loop” is essential to ensuring quality in machine analytics. Thus his, and our, first priority becomes effective triage: determining which text warrants human attention, which should be condensed by automated means, and which may actually best be disregarded as valueless or actively malign. We model the text analytic process as a success of tiered steps, each with accuracy rates. While we classically think of text analytic accuracy in favorable terms as “precision and recall,” their inverses are “false negative and false positive.” We explore how initial steps with high-volume, automated processing can best tune their accuracy trade-offs to optimize the latter, human-moderated steps.
Christopher Biow, SVP, Global Public Sector, Basis Technology