The desire and ability to genetically engineer organisms is becoming increasingly widespread, and the barriers to using the most sophisticated means of genome editing are falling rapidly. There is a corresponding risk that actors with malicious intent may decide to use these tools to create more dangerous strains of pathogenic organisms. The sophistication of the design tools to make such organisms currently far outstrips the capabilities of the tools with which genetic engineering can be detected quickly and accurately in an automated fashion.
In consultation with B.Next’s biodefense community partners, IQT Labs B.Next and Lab41 have explored how applying machine learning (ML) approaches to DNA sequence analysis may provide “triage” tools that enable users to quickly assess the likelihood that the genome of a suspect organism has been engineered. The Labs obtained a diverse dataset from both public (US and European) and private (a synthetic biology company) sources that was used to train, validate and test several ML models to detect the insertion of DNA from one organism or source into another. The performance of the trained models varied with the complexity of the underlying dataset, but was sufficient to illustrate the promise of ML-based approaches to rapid DNA sequence analysis for biodefense applications. Our findings also suggested immediate changes to model training that would likely improve performance.
The results of this project were recently briefed to the same members of the US biodefense community who helped frame the problem. The participants were uniformly optimistic about the potential for ML-based approaches, and recommended that future work include refinements in the data sources used and establishing confidence metrics for trained models.
• B.Next and colleagues in biodefense identified that the lack of analytical tools for detecting genetic engineering is a significant issue to enabling an effective response to a biothreat.
• ML has been applied to DNA analysis, but not extensively.
• The ML models developed by the IQT Labs team detect cloning boundaries – junctions between inserted DNA and the genome of the “destination” organism.
• The team created an in silico method to generate an unlimited source of synthetic cloning boundaries for use in model training and validation.
• The ML models we generated demonstrated classification accuracies between 93% and 74%, which correlated inversely with the complexity of the datasets used in training, validation and testing.
• Our work complements but does not duplicate other efforts across the USG and within IQT. B.Next-hosted discussions were the origin of IARPA’s FELIX program, which will generate larger-scale tools for detecting genetic engineering. B.Next staff were members of the source selection board for FELIX proposals.