This is the homework for the first week of class. It is considerably less complicated than what we talked about during the lecture. For planning purposes, note that it took me about three hours to do this homework, including figuring out the cause of a very stupid bug in a for-loop.
We’re going to look here at the relationships between various variables that are used in the evaluation of natural language processing. Send me your answers in a single PDF by 17h00 on Monday the 23rd of January.
1. Find a sample of 80-100 words in a language of your choice. I’ve put some English-language data here and some French-language data here, but you’re free to use any language you like. Find a friend who speaks the language in question at least as well as you do. Both of you tag the part of speech of each word. Then calculate the agreement between the two of you, and explain to me two sources of disagreement between the two of you. For this question, you will turn in:
- The set of tags that you used
- The text that you tagged
- The tags that you assigned
- The tags that your friend assigned
- An explanation of two sources of disagreement between the two of you
- The calculated agreement. For this, you have two choices:
- Do it by hand. In this case, scan the paper with your calculations and add that to your PDF…
- …or, write a program to calculate those numbers, and send me your code and your output. Again, this should be in your PDF. Your code can be a program in R, Python, or the programming language of your choice, or even an Excel spreadsheet.
2. Given a set of correct answers and a set of answers from your program, determine the true positives, true negatives, false positives, and false negatives.
Suppose that we have a system that classifies tweets as expressing positive opinions about Quisp cereal. The data is in this file. Y means that a tweet does express a positive opinion about Quisp cereal, and n means that it does not. (This could mean that it expresses a negative opinion about Quisp cereal, or a neutral opinion about Quisp cereal, or doesn’t even mention Quisp cereal. All we know is that it doesn’t express a positive opinion about Quisp cereal.) The column labelled gold.standard specifies the correct answer. The column labelled system.output is what our system thinks the answer is. Note that since we have a binary classification (either yes or no) and a defined set of examples with no boundary issues, we can determine the number of true negatives, which isn’t always the case in language processing.
For this question, submit the counts of true positives, true negatives, false positives, and false negatives. So, your answer should look something like this:
true positives: 613
true negatives: 1024
false positives: 1789
false negatives: 1871
3. With the numbers from your answer to Question 1, calculate the precision, recall, and F-measure. You have the same two options as in Question 1.
4. Now calculate the accuracy. You have the same two options as in Question 1.
5. One of the problems with accuracy is that it can be a terrible over-estimate of the aspects of system performance that you care about. This is especially true in the situation where what you care about the most is identifying the positive situations, and even more so, when the positive situations are rare.
To see how this works, suppose that in our data set, we have four cases of phone calls to the police emergency number. Our job is to build a program that correctly classifies phone calls as an emergency situation when they are, in fact an emergency. Of the four true emergencies, our system says that it is really an emergency for only three of those. Also suppose that if a situation is not an emergency, the system always says–correctly–that it is not an emergency. Calculate the accuracy as the number of true negatives goes up–which means that the positives became rarer and rarer–from 0 true negatives to 100 true negatives. Graph this with the number of true negatives on the x axis and the accuracy on the y axis. As always, make the range for accuracy on the y axis be from 0 to 1.0.
To clarify: your first data point will be 3 true positives, one false negative, and no true negatives or false positives. The next data point will be 3 true positives, one false negative, one true negative, and no false positives. Continue until you have the data point for 3 true positives, 100 true negatives, one false negative, and no false positives.
6. Now let’s see how F-measure is affected by the rarity of the positive cases. We’ll model the same situation: the true negatives go up and up, while the number of correctly and incorrectly labeled positives (i.e., true positives and false negatives) stays the same. Plot the true negatives on the x axis and the F-measure on the y axis. As always, make the range for F-measure on the y axis be from 0 to 1.0.
7. What is the bug in this line of code?
f.measure = 2 * precision * recall
8. In what situation will your calculation of accuracy always cause your program to crash unless you check for the relevant input and/or catch the resulting exception?