Introduction to a New View of Drug Discovery Through Machine Learning Algorithms

On March 16th 2017, President Donald J. Trump released his first budget proposal to Congress. Although there were many surprising aspects of the bill, such as a huge increase in defense and a massive cut in the EPA and other organization, the most surprising thing of the budget proposal is the 20% cut of funding for the National Institute of Health or NIH. The NIH is responsible for partaking in medical research and reimburses many universities for medical research opportunities. This cut should not go unnoticed as funding for the NIH only constitutes 0.5 % of the U.S. Revenue through taxes and also partakes in vital work of improving the health of the nation. In order to defend the role in the NIH in the nation, I want to discuss a new method that could revolutionize the field of drug discovery and try to convince people through social media to support funding for the NIH in order to partake in research that could concern this new method of drug discovery.



Before I explain the new method that could be revolutionary in the field of drug discovery, I want to first explain the conventional method of drug discovery: the evolutionary algorithm.

There are two processes that have to occur in order to conduct an evolutionary algorithm:

  • Structure to Property
  • Property to Structure

The first process is what the name implies, we take the structure of the molecule and then determine the properties the molecule has.

The properties that are important in the field of drug discovery include but are not limited to

  • Intermolecular forces (electrical attractions that molecules have towards each other due to partial positive charges and negative charges throughout the molecule) including but not limited to hydrogen bonds (attraction between a hydrogen atom and a oxygen, nitrogen, or fluorine atom) , dipole-dipole attraction (attraction between any two atoms with a partial hydrogen charge), and et cetera. This property is important because for many illnesses or diseases that are directly caused by the overproduction of certain enzymes (enormous molecules that have the ability to speed up a chemical reaction that would take days, months, or years to happen under normal conditions), the right molecule with the right intermolecular forces can effectively fit in the active site of the enzyme and inhibit (the process of an enzyme losing its function to speed up a certain chemical reaction) the enzyme.
  • Acidity or basicity of the compound or parts of the compound which is crucial since the human body functions at a extremely narrow pH range. Although there are many strong buffers (normally a combination of two compounds that can effectively maintain a certain pH in the human body) such as the carbonate/carbonic acid buffer in the body to control drastic changes in the pH, drugs that are regularly taken can change the pH drastically (since a buffer solution can only handle so much acidity or basicity) causing either acidosis (the condition when the human body is experiencing too much acidity) or alkalosis (the condition when the human body is experiencing too much basicity).
  • Shape of the molecule as many of the illnesses or diseases can be linked to a malicious enzyme, meaning in order to have a drug that can effectively inhibit the enzyme, the drug has to have a certain shape that can fit in the active site (the area of the enzyme where the chemical reaction is sped up) of the enzyme and can properly inhibit the enzyme.

As structure to property is the first important step in the broader process known as the evolutionary algorithm, the next step is the go the other way and to do the property to structure step. The name of the step also directly explains the step itself. Once a clear model has been created to effectively determine the property of a molecule based on the structure, the next step is to find the properties that can most effectively treat the illness or disease and then from those properties, obtain the structure or molecule that has all those desirable properties.

The second step is the most important as it defines the evolutionary algorithm, or genetic algorithm.

The genetic algorithm is as follows.

  1. Start off with a very simple molecule, such as methane (one carbon with four hydrogens attached to the carbon).
  2. Based on the structure to property model, calculate the properties of the methane and also calculate how desirable those properties are for the function of the drug on the illness or the disease.
  3. Add various functional groups to methane and repeat Step 2. If the new molecules have more favorable properties for the function of the drug, keep them for Step 4.
  4. With the mutated molecules that have better properties than methane, repeat Step 2 and Step 3. Continually repeat the steps for the mutated molecules until edits to the molecules won’t give better properties than the one before.
  5. After having a list of very good candidates of molecules, begin selecting molecules that
    1. Based on the toxicity calculations of the molecules, isn’t toxic at all.
    2. Exhibits properties that are extremely beneficial at treating the illness or disease at hand.

Although pharmecutical companies uses this method as we speak on a very large scale, I want to address a new algorithm for drug discovery that could

  • be used to advance the field of personalized treatment
  • be a method that the NIH could delve further into so that once the NIH is able to create a refined algorithm based on the algorithm I plan the proposing, pharmaceutical companies can produce drugs that could curb the looming opioid crisis present in the United States. Due to this, the new algorithm could be a convincing argument to stop Drumpf from making dramatic cuts in the NIH.

(For more information on the opioid crisis, consult this video made by a very good comedian John Oliver and another video made by a news source called Vox.)

The new algorithm is known as the machine learning algorithm for drug discovery.

Before I go further into how the machine learning algorithm for drug discovery works, I need to discuss the concept of machine learning itself.

Imagine you have a set of apples and pears. As a human being, it is not necessarily difficult to distinguish between an apple or a pear if someone had told you that “This is an apple and this is a pear.” The first time that that person told you how to distinguish between an apple or a pear, you were able to notice the different features that differentiate the apple and the pear. You noticed the differences in shape, size, and color and were able to correctly differentiate the apple and the pear every single time.

Machine learning is a very similar concept but instead of you, the computer has to learn the difference between the apple and the pear. To do this, the machine has to have a large set of data with many apples and many pears. These apples and pears are divided into a training set and test set, where the training set is the set where the computer learns how to differentiate between an apple and a pear and then the test set is to see how well the computer was able to differentiate between an apple and a pear.


In drug discovery, the machine learning algorithm holds the same idea, but instead of apples and pears, the items are molecules that have been tailored for specific person’s genome.

With an extensive set of data on drugs that work and don’t work for a specific person based on his/her genome, the machine learning algorithm becomes extremely effective and efficient in determining whether the new molecule can become a potential drug.

The only problem that comes into this algorithm is how do we know if a drug would work in conjunction with a person’s specific genome. This is a question that countless people at the NIH, countless professors at private universities, countless students in the field of Bioinformatics have spent countless hours researching in order to find a solution to this problem. Although this is the case, we are really close to solving the problem of knowing if a drug would work in conjunction with a specific person’s genome, but there’s another problem. As of right now, these bioinformatic algorithms tailored at tackling this problem can take days, months, or even years to run, and the urgency of fighting to get the United States Congress not to cut funding for the NIH by the end of the next budget resolution is coming closer and closer. Because of this, the best way is to test a thousand or two thousand drugs for a specific person for a specific disease. By having a computer understand the reason that some of the drugs in the list of one thousand drugs succeed in that specific person or some fail, the computer itself can learn if new molecules are feasible as drugs cutting the time of the process from days, months, and years, to mere seconds or minutes.

There needs to be a basis for everything for change. And in order to convince our president that funding for the NIH is important, we need to propose a radical new method for drug discovery that the NIH can improve upon in order to demonstrate its effectiveness in dealing with the medical problems of today. And if the NIH can improve on the new algorithm for drug discovery, the opioid problem will reduce drastically, pharmaceutical companies can design more effective drugs in order to create a bigger market resulting in a better economy, something that our president has promised. In my last statement, I urge the public through the mobilization of social media to convince the administration that the funding for the NIH is necessary by using a solid basis of a new promising drug discovery algorithm that the NIH could look forward into.

Made with Padlet


Share this project