BINF2010辅导、辅导Python编程设计

BINF2010 – Scripting assignment “Glue” programming and scripting are important skills in bioinformatics, allowing the automation of complex and repetitive tasks when analysing large datasets. This assignment is worth 15% of the total mark for the course and requires you to write a script to automate a “pipeline” of bioinformatics tasks. Two options for the assignment are presented. Option 1 involves writing a bash shell script while option 2 involves a scripting language such as Perl or Python. Students who are doing or have done COMP2041 Software Construction and are familiar with Perl should be doing Option 2. If you have or are taking COMP2041 and you choose Option 1 your mark will be capped at 65%.
General considerations:
? Your script should run at the command line in a CSE account, but can assume that the binftools set-up command (source
~binftools/setup-env.sh) has already been run. You can make use of any programs installed in binftools (type ls ~binftools/bin for a list) as well as remote databases and software that you can access using their RESTful API.
Your script should not crash or hang when given incorrect user input (eg, no arguments, or an incorrect argument) and instead should quit gracefully with a suitable error message.
Your script should be properly formatted and commented. Some marks will be given for code that can be understood easily.
You should try to include an additional feature to your script in addition to what is in the spec. It could be an additional analysis step, or an improvement to the suggested user interface. Indicate what that additional feature is and how to activate it in the comments of your script. This additional feature is worth 5% of the mark for the assignment (out of 100%). For example, for option 2, an extra feature could be to have your script also output a graphical representation of your annotations.
That aside the script will be marked on its ability to perform the required task correctly, and its compliance with the specification.
Your script should be your own work, however it is OK to use, as part of it, open source code that has been published as long as its source is clearly acknowledged in the comments. This is especially for option 2.
You can make use of open source libraries such as BioPerl for some tasks as long as this use is acknowledged in the comments, and the script does not require any additional special installation steps such as installing extra libraries. Any extra code should be either already installed on the CSE computers or binftools, or included in the script itself.
Your submission should consist of only one file, either .sh or .pl or .py, and should be directly executable.
Submission will be through a link on the course website.
Submission deadline will be advised on the website.
Please ask your questions on the Assignment forum on Moodle. Email questions will be redirected to the forum.
Finally: TEST, TEST and then TEST some more. Use a variety of inputs, testing locations etc, and make sure your program has no unintended consequences. One year a student submitted a script that did what was asked but then erased everything in the current directory (except for the output file it had created). This was not a good outcome.
Option 1: Bash shell script Write a bash shell script that retrieves all the reviewed mammalian protein sequences from UniprotKB that correspond to a specific gene name (specified as argument 1 when running the script) then creates a multiple sequence alignment out of these sequences then builds a phylogenetic tree from this multiple sequence alignment.
Your submission should consist of one file named uniTree.sh
The script should be invoked using the command ./uniTree.sh GENE where GENE is a gene code in UniprotKB, found in the Gene name [GN] field. Example gene names include INS and EGF. The output of the script should be a phylogenetic tree file in Newick format (nested parentheses) suitable for input into a tree visualisation program such as Dendroscope. The file should be named GENE.tree (where GENE is the gene name specified at the command line). The leaves of the tree should be labelled with only the species name as much as the program allows you to (in some cases you may get multiple proteins from the same species in which case you may need to add a number to the name in the tree). An example output for EGF is on Moodle. However, note that because the databases are constantly updated, the example output may not be exactly as your program will generate it.
The script should perform the following suggested steps:
o Retrieve from UniprotKB all the sequences with the specified gene name in their GN field that also are of Mammalian origin (field OC=
Mammalia [40674]) and have been reviewed for inclusion into SwissProt (field Reviewed set to Yes). The sequences should be retrieved in one file in fasta format. You probably will need to make use of the Uniprot RESTful API (http://www.uniprot.org/help/programmatic_access) for this, together with the UNIX command wget.
o Edit the header of the fasta sequences so that they contain only the species name. If there are multiple sequences from the same species (for example a protein with multiple isoforms) you can add a number to the species name – eg Mus musculus1 and Mus musculus2.
o Run a multiple sequence alignment program (eg clustalw) to align the sequences and create a multiple sequence alignment
o Run another program to build a phylogenetic tree out of the multiple sequence alignment. Clustalw with the –tree option is probably the easiest option for this as phylip programs are quite
tricky to script and require writing a “batch file” (but do not simply use the clustalw guide tree created during multiple sequence alignment (.dnd file) as the tree as it is not a true phylogenetic tree).
o Rename the tree file if necessary.
o Delete any intermediary files created by the script, keeping only the final tree output (and any other files that were already there when you started the program).
Option 2: Perl or Python script Select this option if you are already familiar with a scripting language, for example if you are doing or have done COMP2041 Write a Perl or Python script that 1. takes as input a bacterial genome sequence in fasta format 2. searches it for ORFs 3. selects the 3 longest ORFs of length ≥ 150bp/50aa (if there are not 3 ORFs that long, selects only the ones over the minimum length) 4. translates them into protein 5. searches the PDB database for matching proteins with at least 40% identity using the RSCB PDB Search API (https://search.rcsb.org/index.html#search-api), retrieving only the highest-scoring hit 6. Formats the results into a table in CSV format The table should have up to 3 data rows (for the 3 longest ORFs) plus one header row with the following headers:
Start: the start position of the ORF on the sequence
End: the end position of the ORF on the sequence (note: for ORFs on the reverse strand, the end position index will be smaller than the start position)
Strand: FORWARD or REVERSE
PDB_ID: the PDB identifier of the top hit matching the translated ORF in the PDB database. If there is no similar sequence with at least 40% sequence identity then this field should just contain “-“ (no quotes)
E-value: the e value for the top BLAST hit, provided it is equal or lower than 1. If it is >1 then the field should contain “-“ (no quotes) The lines in the CSV file should be sorted by start position. An example input and output (geobacter.fasta, geobacter.csv) are provided in Moodle. However, note that because the databases are constantly updated, the example output may not be exactly as your program will generate it.
Your submission should be a single file named geneannot.pl or geneannot.py
The script should be invoked at the command line using the command
./geneannot.pl filename.fasta (or ./geneannot.py
filename.fasta) where filename.fasta is the name of a nucleotide sequence file in fasta format containing a segment of a bacterial genome
Your script should work with nucleotide sequences of length of up to 20,000 nucleotides (it can work with longer sequences if you want to but you should assume a test sequence of up to 20 kbps).
Script should quit gracefully if the input sequence is invalid or not in fasta format.
Script output should be in CSV format and go to standard output so it can be redirected if necessary. For example:
o ./geneannot.pl foo.fasta will display the CSV output on the screen.
o ./geneannot.pl foo.fasta > bar.csv will save the output to a file named bar.csv
Suggested steps for the script:
o Submit the input sequence to an ORF detection program of your choice (as long as it is installed in binftools)
o For each of the 3 longest ORFs identified longer than (or equal to) 150bp, its protein translation should be submitted using the Protein Sequence Search API at RSCB PDB.
o Parse both the ORF detection and the RSCB search outputs to extract the required information into a table in the specified format
o Sort the ORFs by start position
o Delete any intermediary files created by the script, keeping only the final CSV output (and any other files that were already there when you started the program).