,Java。
Requirement
This is the last practical exercise and will continue over the remaining weeks of the course.
In this practical you will implement a real molecular similarity method
Ultrafast shape recognition to search compound databases for similar molecular shapes
So this problem involves reading from a file one reference molecule calculating a descriptor for it, then reading a series of molecules from a second file, computing the descriptor for each molecule and then quantifying the difference between it and the reference. At the end of the run the program should report the closest molecule and the magnitude of its difference to the reference. All files will be in SD format and hydrogens should be completely ignored in the procedure
The descriptor we will calculate consists of 4 triples of numbers. Each triple consists of 3 statistical measures of distances from a point.
The measures are
- The mean distance from the point (sum of all distances divided by number of distances)
- The variance of this distance (sum of the squares of distances - mean all divided by number of distances minus 1)
- The skew of this distance (sum of the cubes of (distances - mean) / standard dev all divided by number of distances. The standard deviation is the square root of the variance.
The four points we use to calculate these from are
- The centre of gravity
- the closest atom position to the COG
- The furthest atom position from the COG
- The furthest atom position from point 3 above.
To calculate the difference between any 12 double set and another simply do the equivalent of a distance calculation but over all 12 numbers.
Remember we know how to read SDfiles from a previous practical, however here is a reminder
In order to access the CDK library you will need some import statements
1 2 3 4 5
|
import org.openscience.cdk.CDKConstants; import org.openscience.cdk.Molecule; import org.openscience.cdk.DefaultChemObjectBuilder; import org.openscience.cdk.io.iterator.IteratingMDLReader; import org.openscience.cdk.io.MDLWriter;
|
import org.openscience.cdk.interfaces.*;
To read a single SD file you could use something like
1 2 3 4
|
IteratingMDLReader MDLReader = new IteratingMDLReader(new FileInputStream(RefFile), DefaultChemObjectBuilder.getInstance()); if (MDLReader.hasNext()) #123; mymol = (Molecule)MDLReader.next(); #125;
|
To read a sequence of files from an SD file
1 2 3 4 5
|
MDLReader = new IteratingMDLReader(new FileInputStream(ScrFile), DefaultChemObjectBuilder.getInstance()); while (MDLReader.hasNext()) #123; mymol = (Molecule)MDLReader.next(); #125; MDLReader.close();
|
To get the name of a Molecule (here called m1) object
1
|
Name = new String(String.valueOf(m1.getProperty(CDKConstants.TITLE)));
|
To get its number of atoms
1
|
int natoms = m1.getAtomCount();
|
you can get each atom in a molecule by
1
|
IAtom myatom = m1.getAtom(i);
|
Where i is the ith atom
You can get the chemical symbol from each atom
1
|
String s1 = myatom.getSymbol();
|
You can get the coordinates as a Point3d object by
1
|
Point3d mypoint = myatom.getPoint3d();
|
(to use Point3d class you have to import javax.vecmath.Point3d
)
The Point3d class has a method called distance which returns the distance between the instance calling and its argument so
1 2 3
|
Point3d a,b; ... d = a.distance(b);
|
In addition to the usual criteria of Functionality, readability, comments and a readme file, I request that you prepare a document called plan.txt in which you write a simple logic plan for the program.
In order that you don’t get bogged down in the statistics I have given you a set of example methods to calculate mean, variance and skew.