Skip to main content


Tools for drug discovery and chemical genomics research


The NSF ChemGen ChemGen Integrative Graduate Education and Research Trainee (IGERT) Program sponsored a collaboration that has produced an expedient method for similarity searches of compound. This computation method is extremely useful when compounds are identified that have a drug-like effect on an organism. ChemGen IGERT associate Yiqun Cao designed a method that greatly accelerates similarity search regardless of the underlying similarity measure and allows compound databases to scale without worry of degradation of structural similarity search performance. The lack of instant response in structural similarity search in large compound databases has been one of the major factors that hinder their usefulness. When applied to PubChem Compound database containing more than 19 million compounds, this method was able to reduce the average response time of atom pair-based similarity search from over 93 seconds to below 0.5 second, and that of fingerprint-based search from about 20 seconds to below 0.5 second.

This method has also found further utility in other important discovery tools such as cluster analysis. By adopting this method in Patrick clustering, it was demonstrated that PubChem Compound database could be clustered using atom pair- or fingerprint-based similarity measure within one day on a regular computer cluster, compared to an estimated time of over a year using traditional methods. This project involves student training in cheminformatics and computer science. Cheminformatic components of this project are under the supervision of bioinformaticist Assistant Professor Thomas Girke. Consultation in computer science is provided by Professor Tao Jiang.

Weblink for the software and an online service for searching the ~40 million compounds in PubChem with sub-second response times:

Cao, Y., Jiang, T., Girke, T. (2010) Accelerated Similarity Searching and Clustering of Large Compound Sets by Geometric Embedding and Locality Sensitive Hashing. Bioinformatics: 26: 953-959.

Address Goals

ChemMine is an infrastructure resource for all researchers involved in drug discovery and chemical biology. It is a bridge between chemistry, biology and informatics. This tool is publicly available.