Homework 1: Computational Bioscience

1.  Building the gold standard data for creating a  program

Find a friend.  Both of you tag the part of speech of each word in the following ten tweets.  Then calculate the agreement between the two of you, and explain to me two sources of disagreement between the two of you.  For this question, you will turn in:

  1. The set of tags that you used
  2. The text that you tagged
  3. The tags that you assigned
  4. The tags that your friend assigned
  5. An explanation of two sources of disagreement between the two of you–“we disagreed about nouns and verbs” is a statement, not an explanation
  6. The calculated agreement.  For this, you have two choices:
  • Do it by hand.  In this case, scan the paper with your calculations and add that to your PDF…
  • …or, write a program to calculate those numbers, and send me your code and your output.  Again, this should be in your PDF.  Your code can be a program in R, Python, or the programming language of your choice–or even an Excel spreadsheet.

The ten Tweets

  1. If you work in the gaming industry we need your help, please vote for the Child Brain Injury Trust
  2. That poor child probably has a brain injury after that celebration. Good lawd
  3. A common priority arising during treatment in women who have had a head/brain injury is their inner child; their inner young girl.
  4. in and could lead to in later
  5. Since Children have one of the HIGHEst risk of sustaining a Brain Injury, would love to help out Parents with…
  6. . was diagnosed with acquired brain injury in secondary school following surgery for a brain tumour
  7. my head hurts….but a wip! new oc i think
  8. and my head hurts so much from thinking about the same thing over and over again
  9. Courtney’s stupidity hurts my head
  10. Reading, huh. I used to read a lot when I was younger. My head hurts with how I only really read reference books these days….

These 10 tweets were selected from a number of searches.  The bolded words tell you what my search was, but are not relevant to anything else about this exercise.  I did not select the tweets randomly; neither did I edit them.

2. Given a set of correct answers and a set of answers from your program, determine the true positives, true negatives, false positives, and false negatives.

Suppose that we have a system that classifies genes as being potentially druggable, or not.  The data is in this file.  means that a gene is druggable, and means that it is not. The column labelled gold.standard specifies the correct answer.  The column labelled system.output is what our system thinks the answer is.  Note that since we have a binary classification (either yes or no) and a defined set of examples with no boundary issues, we can determine the number of true negatives, which isn’t always the case in bioinformatics.

For this question, submit the counts of true positives, true negatives, false positives, and false negatives.  So, your answer should look something like this:

  • true positives: 613
  • true negatives: 1024
  • false positives: 1789
  • false negatives: 1871

3. Modeling the effect of rarity on accuracy

One of the problems with accuracy is that it can be a terrible over-estimate of the aspects of system performance that you care about. This is especially true in the situation where what you care about the most is identifying the positive situations, and even more so, when the positive situations are rare.

To see how this works, suppose that in our data set, we have four cases of phone calls to the admissions desk in an emergency room.  Our job is to build a program that correctly classifies phone calls as an emergency situation when they are, in fact an emergency.  Of the four true emergencies, our system says that it is really an emergency for only three of those. Also suppose that if a situation is not an emergency, the system always says–correctly–that it is not an emergency. Calculate the accuracy as the number of true negatives goes up–which means that the positives became rarer and rarer–from 0 true negatives to 100 true negatives.  Graph this with the number of true negatives on the axis and the accuracy on the axis.  As always, make the range for accuracy on the axis be from 0 to 1.0.

To clarify: your first data point will be 3 true positives, one false negative, and no true negatives or false positives.  The next data point will be 3 true positives, one false negative, one true negative, and no false positives.  Continue until you have the data point for 3 true positives, 100 true negatives, one false negative, and no false positives.

 

Advertisements