This is the web site for the Spring 2017 section of Natural Language Processing at Université Paris-Centrale. If you’re a student in the class, I recommend that you follow the blog. You’ll find links to readings, homework, and the like here.
1. Building the gold standard data for creating a program
Find a friend. Both of you tag the part of speech of each word in the following ten tweets. Then calculate the agreement between the two of you, and explain to me two sources of disagreement between the two of you. For this question, you will turn in:
- The set of tags that you used
- The text that you tagged
- The tags that you assigned
- The tags that your friend assigned
- An explanation of two sources of disagreement between the two of you–“we disagreed about nouns and verbs” is a statement, not an explanation
- The calculated agreement. For this, you have two choices:
- Do it by hand. In this case, scan the paper with your calculations and add that to your PDF…
- …or, write a program to calculate those numbers, and send me your code and your output. Again, this should be in your PDF. Your code can be a program in R, Python, or the programming language of your choice–or even an Excel spreadsheet.
The ten Tweets
- If you work in the gaming industry we need your help, please vote for the Child Brain Injury Trust
#gamesaid #ukgaming #nigaming
- That poor child probably has a brain injury after that celebration. Good lawd
- A common priority arising during treatment in women who have had a head/brain injury is their inner child; their inner young girl.
#Traumaticbraininjuries in #children and #adolescents could lead to #alcohol #abuse in later #life…
- Since Children have one of the HIGHEst risk of sustaining a Brain Injury, would love to help out Parents with…
@leightomilton was diagnosed with acquired brain injury in secondary school following surgery for a brain tumour https://goo.gl/fBDEhN
- my head hurts….but a wip! new oc i think
- and my head hurts so much from thinking about the same thing over and over again
#90DayFiance Courtney’s stupidity hurts my head
- Reading, huh. I used to read a lot when I was younger. My head hurts with how I only really read reference books these days….
These 10 tweets were selected from a number of searches. The bolded words tell you what my search was, but are not relevant to anything else about this exercise. I did not select the tweets randomly; neither did I edit them.
2. Given a set of correct answers and a set of answers from your program, determine the true positives, true negatives, false positives, and false negatives.
Suppose that we have a system that classifies genes as being potentially druggable, or not. The data is in this file. Y means that a gene is druggable, and n means that it is not. The column labelled gold.standard specifies the correct answer. The column labelled system.output is what our system thinks the answer is. Note that since we have a binary classification (either yes or no) and a defined set of examples with no boundary issues, we can determine the number of true negatives, which isn’t always the case in bioinformatics.
For this question, submit the counts of true positives, true negatives, false positives, and false negatives. So, your answer should look something like this:
true positives: 613
true negatives: 1024
false positives: 1789
false negatives: 1871
3. Modeling the effect of rarity on accuracy
One of the problems with accuracy is that it can be a terrible over-estimate of the aspects of system performance that you care about. This is especially true in the situation where what you care about the most is identifying the positive situations, and even more so, when the positive situations are rare.
To see how this works, suppose that in our data set, we have four cases of phone calls to the admissions desk in an emergency room. Our job is to build a program that correctly classifies phone calls as an emergency situation when they are, in fact an emergency. Of the four true emergencies, our system says that it is really an emergency for only three of those. Also suppose that if a situation is not an emergency, the system always says–correctly–that it is not an emergency. Calculate the accuracy as the number of true negatives goes up–which means that the positives became rarer and rarer–from 0 true negatives to 100 true negatives. Graph this with the number of true negatives on the x axis and the accuracy on the y axis. As always, make the range for accuracy on the y axis be from 0 to 1.0.
To clarify: your first data point will be 3 true positives, one false negative, and no true negatives or false positives. The next data point will be 3 true positives, one false negative, one true negative, and no false positives. Continue until you have the data point for 3 true positives, 100 true negatives, one false negative, and no false positives.
R is not the most obvious programming language to use for text mining, but it makes analysis of your output so much easier to do that I’ve been using it more and more–for text mining. Any language that doesn’t have a good testing framework is probably going to land you in trouble sooner or later; R does have a testing framework, and it’s worth learning to use it. Here are some exercises for the R testthat library–have fun.
Nice piece on responsible Big Data research from Matthew Zook and a bunch of other people.
In order to help you interpret your performance on the final exam, you’ll find an analysis of the exam scores here. I’ll give you the R code that I used to do the analysis, so that you can see how my thinking works here even if my prose is not clear.
Brief overview of the data
First, let’s look at the overall scores on the final exam. The format of this exam was 10 basic questions, plus 10 questions that you could think of as more advanced. (We’ll return later to the question of whether or not the basic questions were really more basic than the advanced questions, and vice versa.) By “basic questions,” I mean that if one knows the answers to these, then you will be able to understand most conversations about natural language processing. By “advanced questions,” I mean that if one knows the answers to those, then you will be able to be an active participant in those conversations.
How I’ll test this
Since I didn’t write any of my own functions, I don’t have a way to break this down into unit tests. So, I’ll calculate some of the scores for individual students manually, and make sure that the program calculates the scores for those students correctly. To make sure that unexpected inputs don’t do anything horrible to the calculations, I’ll also do this check for students who didn’t take the test, and therefore have NA values for both sets of questions.
Statistics on the overall scores for the final exam
# the column page.01 is the scores for the first page of the exam # the column page.02 is the scores for the second page of the exam # ...so, the score on the final exam is the sum of those two columns. scores.total <- data$page.01 + data$page.02 summary(scores.total) ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 3.50 12.00 13.00 13.25 15.00 20.00 2 shapiro.test(scores.total) ## ## Shapiro-Wilk normality test ## ## data: scores.total ## W = 0.97045, p-value = 0.06652 hist(scores.total, main = "Histogram of scores on final exam")
What do we learn from this? First of all, the typical grade was around 13, whether you look at the mean or the median. So, most people passed the final exam. We also know that some people got excellent scores–there were a couple 20s. And, we know that some people got terrible scores. The fact that a couple people got 20s is consistent with the idea that the materials in the course covered the materials on the final exam. The fact that some people failed, including some people that got very low scores, suggests that the exam was sufficiently difficult to be appropriate for the student population in this grande école.
Statistics on the first page (basic questions)
summary(data$page.01) ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 3.500 6.625 7.500 7.519 8.875 10.000 2 shapiro.test(data$page.01) ## ## Shapiro-Wilk normality test ## ## data: data$page.01 ## W = 0.96359, p-value = 0.02508 hist(data$page.01, main = "Histogram of scores on basic questions")
What do we learn from this? Bearing in mind that the highest possible score on the first page was 10 points, the fact that the mean and median scores were about around 7.5 suggests that students typically got (the equivalent of) passing scores on the basic questions–that is to say, students were typically safely above a score of 5. The fact that several students got scores lower than that suggests that even the basic questions were difficult enough for this context, and again, the fact that a large number of students got scores of 9 or above suggests that the course covered the basic aspects of natural language processing thoroughly.
On the other hand, personally, I was somewhat disappointed with the scores on the basic questions. My hope was that all students would walk out of the course with a solid understanding of the basics of the subject. Although the high proportion of 9s and 10s makes me confident that I covered that material thoroughly, it’s difficult to avoid the conclusion that the large number of absences in every one of the 2nd through 6th class sessions (that is, all of the class sessions but the first) affected students’ overall performance pretty heavily. I don’t have numbers on the attendance rates, so I can’t test this hypothesis quantitatively.
Statistics on the second page (advanced questions)
summary(data$page.02) ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.000 5.000 6.000 5.731 7.000 10.000 2 shapiro.test(data$page.02) ## ## Shapiro-Wilk normality test ## ## data: data$page.02 ## W = 0.95442, p-value = 0.00714 hist(data$page.02, main = "Histogram of scores on advanced questions")
Were the advanced questions really more advanced?
If the gradient of difficulty that I was looking for was there, then very few people should have done better on the second page (advanced questions) than they did on the first page (basic questions). When we plot the difference between the first page and the second page, most people should be at zero (equally difficult) or higher (second page more difficult). Very few people should have a negative value (which would indicate that they did better on the questions that I expected to be more advanced; certainly this could happen sometimes, but it shouldn’t happen very often).
difference.between.first.and.second <- data$page.01 - data$page.02 hist(difference.between.first.and.second)
Do differences between the basic and advanced scores correlate with the final score? Maybe people who did really poorly or really well will show unusual relationships there.
# let's visualize the data before we do anything else plot(difference.between.first.and.second, (data$page.01 + data$page.02), main = "Relationship between the gap between scores on the basic and advanced questions, versus total score", ylab = "Total score on the exam", xlab = "Score on basic questions minus score on advanced questions")
Not much point in looking for a correlation here–we can see from the plot that there won’t be much of one.
However, we’ll do a t-test to see if the difference between the mean scores on the two pages is significant…
t.test(data$page.01, data$page.02) ## ## Welch Two Sample t-test ## ## data: data$page.01 and data$page.02 ## t = 6.9339, df = 151.12, p-value = 1.115e-10 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 1.278847 2.298077 ## sample estimates: ## mean of x mean of y ## 7.519231 5.730769
…and, yes: it is, and very much so. The first page was easier than the second page, and the second page was harder than the first page. I hope this is helpful!
Here are some suggested readings for Week 7. Remember that I do not distribute my lecture notes. Note also that you are responsible for all of the material on which I lecture. These readings are not required, but they are intended to cover everything that I talk about in our lectures (modulo the caution in the preceding sentence). All of them are available for free on line.
Intrinsic versus extrinsic evaluation
Resnik, Philip, and Jimmy Lin. “Evaluation of NLP Systems.” The handbook of computational linguistics and natural language processing 57 (2010).
Scope in information extraction
Ding, Jing, et al. “Mining MEDLINE: abstracts, sentences, or phrases.” Proceedings of the pacific symposium on biocomputing. Vol. 7. 2002.
Distance in information extraction
Blaschke, Christian, and Alfonso Valencia. “The frame-based module of the SUISEKI information extraction system.” IEEE Intelligent Systems 17.2 (2002): 14-20.
Combining and weighting evidence sources
Lu, Zhiyong, K. Bretonnel Cohen, and Lawrence Hunter. “Finding GeneRIFs via gene ontology annotations.” Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. NIH Public Access, 2006.
In lieu of another homework, here are some optional exercises. These exercises will (should you elect to do them) help you understand some of the basic skills involved in natural language processing. Some things that you will be better able to do after completing these exercises include:
- Perform several language processing tasks
- Estimate the time required to perform a “simple” natural language processing task
- Estimate the time required to obtain evaluation data for a natural language processing task
- Analyze the difficulties involved in solving “simple” natural language processing problems
- Explain the challenges of natural language processing to non-experts
It’s worth being aware that the first one of these is worth about $45 an hour for a CDI. In contrast, being able to do the other four is worth about $75 an hour for a CDI or $200 an hour for consulting work.
- Spend one hour writing a program to split texts into sentences. Evaluate it with a large body of textual data. What kinds of problems do you still have after an hour of writing code to do this? Alternatively: Google for publicly available natural language processing libraries, and use one of your choice to split a large body of textual data into sentences. What kinds of errors does the library make on this task?
- Spend one hour writing a program to split textual inputs into tokens. Evaluate it with a large body of textual data. What kinds of problems do you still have after an hour of writing code to do this? Alternatively: Google for publicly available natural language processing libraries, and use one of your choice to split a large body of textual data into sentences. What kinds of errors does the library make on this task?
- Spend three hours writing a program to find all mentions of persons, places, and organizations in textual inputs. Evaluate it on a large body of textual data. What kinds of problems do you still have after three hours of writing code to do this? Alternatively: Google for publicly available natural language processing libraries, and use one of your choice to find all mentions of persons, places, and organizations in a large body of textual data. What kinds of errors does the library make on this task?
- Spend eight hours writing a program to find all pairwise relationships between people, places, and organizations in a large body of textual data. For example, the following input (copied from Wikipedia and then simplified for expository purposes) should give you the following output:
Louise Michel was born at the Château of Vroncourt (Haute-Marne) on 29 May 1830, the illegitimate daughter of a serving-maid, Marianne Michel, and the châtelain, Etienne Charles Demahis. She was brought up by her mother and her father’s parents near the village of Vroncourt-la-Côte and received a liberal education. She became interested in traditional customs, folk myths and legends. In 1866 a feminist group called the Société pour la Revendication du Droit des Femmes began to meet at the house of André Léo. Members included Louise Michel.
She was with the Commune de Paris, which made its last stand in the cemetery of Montmartre, and was closely allied with Théophile Ferré, who was executed in November 1871. Michel was loaded onto the ship Virginie on 8 August 1873, to be deported to New Caledonia, where she arrived four months later. Whilst on board, she became acquainted with Henri Rochefort, a famous polemicist, who became her friend until her death. She remained in New Caledonia for seven years, refusing special treatment reserved for women. Michel’s grave is in the cemetery of Levallois-Perret, in one of the suburbs of Paris.
Your program’s output should include the following persons:
- Louise Michel
- Théophile Ferré
- Henri Rochefort
- André Léo
…and the following locations:
- Châteauof Vroncourt (Haute-Marne)
- cemetery of Montmartre
- New Caledonia
- cemetery of Levallois-Perret
- the house of André Léo
…and the following organizations:
- Société pour la Revendication du Droit des Femmes
- Commune de Paris
Note: if you see inconsistencies in the desired outputs that I’ve given you, how would you suggest resolving them, and what would the resulting changes in the desired outputs be?
This would be a false positive for the semantic class of “person” if your program returned it:
- Virginie (it’s the name of the ship on which Louise Michel was taken to New Caledonia)
…and a false positive for the semantic class of “location:”
- Commune de Paris
Some correct pairwise relations would be:
- Marianne Michel and Etienne Charles Demahis
- Louise Michel and Etienne Charles Demahis
- Louise Michel and Marianne Michel
- Louise Michel and Château of Vroncourt (Haute-Marne)
- Louise Michel and Société pour la Revendication du Droit des Femmes
- Louise Michel and Commune de Paris
Some incorrect pairwise relations would be:
- Marianne Michel and Société pour la Revendication du Droit des Femmes
- Etienne Charles Demahis and Société pour la Revendication du Droit des Femmes
- Henri Rochefort and Société pour la Revendication du Droit des Femmes
Sample questions for the final exam
All human languages, but no computer languages, have the property of ____________________.
When a system returns increasing numbers of false positives, then even if we do not know the number of true negatives, the ________________ goes down. (Two possible correct answers, just give one)
Languages like Finnish, Turkish, and Hungarian, in which words can have multiple morphemes but it is relatively easy to separate them, are known as __________________________.
I may also give you some graphs to examine and ask you to interpret them.
Answers to example questions
All human languages, but no computer languages, have the property of ambiguity.
When a system returns increasing numbers of false positives, then even if we do not know the number of true negatives, the precision or F-measure goes down. (Two possible correct answers, just give one)
Languages like Finnish, Turkish, and Hungarian, in which words can have multiple morphemes but it is relatively easy to separate them, are known as agglutinative.