An effective response to a disease outbreak requires the rapid identification of pathogen and source.

Exploiting genomic information has become an important component of effective biothreat agent identification, characterization, and attribution. To do so, the necessary bioinformatic analyses require known genomic data against which to compare the agent’s genomic data. However, more and more genomic data is becoming privately held. To truly understand where an agent came from or important features of the agent (e.g., virulence, alternative hosts, and environmental stability), the biodefense community will likely need to leverage the genomic data that resides in these private databases. This may be especially important when a truly novel agent is discovered and near-neighbors need to be identified. Security requirements necessary for biothreat agent information or active investigations limit the direct sharing of genomic information with outside parties. Private entities are often unable to share access to their database due to privacy and legal issues. Fortunately, technology options exist that enable secure computations to be executed that fulfill data privacy requirements.

We developed the Secure Interrogation of Genomic Databases (SIG-DB) algorithm to enable the interrogation of a privately held database with a sequence of interest to determine the presence of similar sequences, without compromising the query or database information. This method was confirmed to be functional and evaluated using wild-type and in silico mutated versions of Escherichia coli and Staphylococcus aureus genomic sequences obtained from the NCBI RefSeq database.

This is the poster that was presented at the 2018 annual biothreats meeting, hosted by the American Society for Microbiology (ASM).

Genomic data are becoming increasingly valuable as we develop methods to utilize the information at scale and gain a greater understanding of how genetic information relates to biological function.  Advances in synthetic biology and the low cost of sequencing are increasing the amount of privately held genomic data.  As the quantity and value of private genomic data grow, so does the incentive to acquire and protect such data, which in turn creates a need to store and process these data securely.

This project explores the limitations, opportunities, and capabilities of secure computation techniques applied to DNA sequence comparisons. Using homomorphic encryption (a software-based encryption approach) the Secure Interrogation of Genomic DataBases (SIG-DB) protocol was developed to enable searches of databases (DB) of genomic sequences with an encrypted query sequence without revealing the query sequence to the database owner or any of the database sequences to the Querier. Our results show that the SIG-DB algorithm returns an accurate assessment of the similarity of queries to databases of interest. The computational runtime and information leakage were compared between a fully homomorphic approach using the Microsoft SEAL cryptosystem and a partially homomorphic approach using the Paillier cryptosystem.  SIG-DB is the first application that we are aware of to take advantage of locality-sensitive hashing and homomorphic encryption to allow generalized sequence-to-sequence comparisons of genomic data.

We also explored an alternative approach that uses hardware-based secure computation, specifically Software Guard eXtension (SGX), by Intel®. We were unable to complete a prototype at this time due to the immaturity of the technology.  However, our research findings indicate that SGX has the potential to enable a cloud-based secure computation system with, theoretically, minimal information leakage and similarity scoring execution times near equivalent to plaintext comparisons.  Much research remains to be done to fully understand the operational and security limitations of the system.

We briefed government stakeholders on our prototype and findings at a recent B.Next event.  Attendees expressed strong support for continued work on homomorphic encryption for secure interrogation of genomic databases.  The participants provided valuable feedback on the tool and numerous use cases they encounter that could be transformed by this approach.  Although the algorithm was developed specifically for microbial genomics comparisons, SIG-DB could be useful for a number of applications, including healthcare, human genomics, organizational collaborations, and more.

The desire and ability to genetically engineer organisms is becoming increasingly widespread, and the barriers to using the most sophisticated means of genome editing are falling rapidly. There is a corresponding risk that actors with malicious intent may decide to use these tools to create more dangerous strains of pathogenic organisms. The sophistication of the design tools to make such organisms currently far outstrips the capabilities of the tools with which genetic engineering can be detected quickly and accurately in an automated fashion.

In consultation with B.Next’s biodefense community partners, IQT Labs B.Next and Lab41 have explored how applying machine learning (ML) approaches to DNA sequence analysis may provide “triage” tools that enable users to quickly assess the likelihood that the genome of a suspect organism has been engineered. The Labs obtained a diverse dataset from both public (US and European) and private (a synthetic biology company) sources that was used to train, validate and test several ML models to detect the insertion of DNA from one organism or source into another. The performance of the trained models varied with the complexity of the underlying dataset, but was sufficient to illustrate the promise of ML-based approaches to rapid DNA sequence analysis for biodefense applications. Our findings also suggested immediate changes to model training that would likely improve performance.

The results of this project were recently briefed to the same members of the US biodefense community who helped frame the problem. The participants were uniformly optimistic about the potential for ML-based approaches, and recommended that future work include refinements in the data sources used and establishing confidence metrics for trained models.


• B.Next and colleagues in biodefense identified that the lack of analytical tools for detecting genetic engineering is a significant issue to enabling an effective response to a biothreat.

• ML has been applied to DNA analysis, but not extensively.

• The ML models developed by the IQT Labs team detect cloning boundaries – junctions between inserted DNA and the genome of the “destination” organism.

• The team created an in silico method to generate an unlimited source of synthetic cloning boundaries for use in model training and validation.

• The ML models we generated demonstrated classification accuracies between 93% and 74%, which correlated inversely with the complexity of the datasets used in training, validation and testing.

• Our work complements but does not duplicate other efforts across the USG and within IQT. B.Next-hosted discussions were the origin of IARPA’s FELIX program, which will generate larger-scale tools for detecting genetic engineering. B.Next staff were members of the source selection board for FELIX proposals.