Unstructured Data & Text Analytics

A significant amount of the world’s data is in an unstructured format. This includes web pages, scientific papers, news articles, word documents, pdf files, text files, images, videos, blogs, social media posts and so on. To complicate matters further, the volume of unstructured data being produced and published every day is rapidly increasing. This puts an immense strain on organisations that need to review and keep on top of key information that is relevant to their business.

Separating the important signals from the vast amount of noise can be a real challenge.

Scanning, Interpreting and Structuring

Having a powerful system prioritising and recommending the key documents and data that you and your team need massively enhances your productivity and the breadth of the research that your team can cover in an efficient and effective manner.

Text Analytics

Our Data Foundry system uses advanced text analytics to automate the process of scanning, discovering, prioritising and combining relevant information from these sources into a structured database for analysis and reporting.

We build custom systems on our platform to suit your exact requirements, linked to the appropriate sources for documents, articles and publications and to download and index all of the new relevant information the system detects on an automated basis.

The challenges of gathering, cleaning, structuring, aggregating and prioritising these sources into a machine-readable database that is clean and structured enough for analytics and predictions are non-trivial.

The key issue for these systems is to understand the context of the information in order to recommend and process only the relevant information. Data Foundry uses automated, high-speed machine learning systems to achieve this and to carry out the steps required to automate the review, prioritisation and categorisation of vast information sources.

Indexing, Compare, Score and Rank Results

The underpinning technology is a leading, industry-grade information retrieval engine, currently being used by pharmaceuticals and banking industry. Technical leads have combined 40 years on researching, developing and deploying these ranking systems in multiple domains.


Notification and Manual Review

Our systems can notify users and managers by email of any new results generated by the recommendation system. The results are presented in a prioritised list to make a manual review as efficient as possible.


Correlations, Trends and Predictions

Once the data has been gathered and structured in our systems (using our various partner tools and advanced methods), we can then run standard reporting, analytics and visualisation tools on the data to produce statistics, graphs and dashboards. These dashboards can update automatically based on scanning and aggregating new data into the system from unstructured sources either in real-time or in more complex cases based on batch processes running overnight.


Example Data Sources

Structured Sources


  • Curated databases – e.g. NHANES, ComBase
  • Open Data Portals
  • Distributed Databases


Semi-structured Sources


  • Espacenet – European Patent Office database
  • USPTO – the US Patent and Trademark Office database
  • The NCBI PubMed database


Other Sources


  • Social Media – Twitter, Linkedin, Blogging sites
  • News outlets – web pages
  • Word documents / PDF documents


Use Case

Media & Publishing Alert System

Many organisations require the tracking of information on certain chemicals, health effects or related topics in (for example) scientific publications, patents and news sources. Or for example, an organisation might be interested in understanding and keeping on top of the latest scientific research and news on bioactive compounds.

This is a challenging task when scientific publications, conference proceedings, patents, social media mentions and news articles need to be taken into account.

Discover Data Foundry
Use Case

Unstructured PDF documents to Tabular Databases

We work with organisations who receive large volumes of complex documents – full of scientific and technical information about products, suppliers, markets, regions, systems, etc… from their partners and suppliers.

Our solutions, built on our Data Foundry platform, allow our clients to extract the key data from these documents automatically and store them in a structured database table for further review and analysis. These solutions dramatically increase the speed of processing of these documents and reduce human error in transmitting data from an unstructured to a structured database.

Discover Data Foundry
Use Case

Identifying Innovation

How do you identify hot new innovations that are about to mature on the global marketplace? Through our unique scanning of patents, publications and news our systems combine the latest information in an intelligent way to point companies towards innovative developments in the market.

Discover Data Foundry

Using NLP To Find Relevant Research

For anyone working in a scientific discipline, staying up to date with the latest research is a part and parcel of the job. Researchers need a simple, automated way to find out which publications are relevant for their own areas of research. In order to do that in an automated way, we need to make use of Natural Language Processing (NLP).

Read the article