In this article we’ll explore how Open Data can be used to improve Large Language Models (“LLMs”).
We wrote a full paper on using our Open Data Repository to train an LLM model, please download it here.
First of all, we believe that using Open Data provides your models with a massive amount of “ground truth” for training purposes. We can safely assume that most Open Data available today (say, from FDA filings) was written by human Subject Matter Experts (“SMEs”), using both proper language as well as appropriate scientific arguments.
Secondly, there are massive amounts of Open Data available for almost every industry and market vertical. Since most industries are under a certain level of government regulation, the agencies that regulate each industry in turn disclose as Open Data many of each agencies’ activities about that industry.
Thirdly, by training LLMs on this Open Data your model will be able to acquire the “lingo” of each industry you provide the model Open Data from. This is important to give the output of your model higher credibility.
We wrote a document that guides you through the process of setting up your own local LLM development environment, running in your local computer. Download the document here.
Continuum of Reality
Our Open Data and our Synthetic Data capabilities can help your customers to more quickly build and validate ML models. We use the analogy of a “Continuum of Reality.” Where real reality is on the right side edge, and synthetic reality is on the left side edge.
Our premise is that using Open Data can provide the “ground truth” for many models. For example: the drugs approved by the FDA is the last 30 years is reality, from the authoritative source. And the Medicare reimbursements per doctor in Florida is real reality. This Open Data is the right side edge of the continuum.
Synthetic Data provides the other extreme, the “negative ground truth”: the results of a model should never look like the totally-random Synthetic Data. A client wants to test their reimbursement model with 100K random Medicare providers? We can support that.
Your customers’ models should generate results between these 02 extremes.
From a business perspective, your company can both increase the value of its platform as well as generate ancillary revenue by offering your customers both Open Data as well as Synthetic Data to go with each model.
Available Data
Our Open Data Repository represents the “Ground Truth” of what has happened in the pharma and medical device spaces in the US for the last 30 years. Including:
* 40,000+ Protocols, SAPs, ICFs
* over 100,000 FDA application files
* 110,000 full FDA labels (“SPL”)
You can use this data to train your models with the text extracted from all those documents, containing 600+ million words. We can also include additional Open Data from other US agencies, including:
* CMS – Medicare
* HHS – healthcare
* NLM – references
Ready to train your models with Open Data?
Contact us today to evaluate how DataSDR can add value to your software.