
This is the web site for the Spring 2017 section of Natural Language Processing at Université Paris-Centrale. If you’re a student in the class, I recommend that you follow the blog. You’ll find links to readings, homework, and the like here.
This is the web site for the Spring 2017 section of Natural Language Processing at Université Paris-Centrale. If you’re a student in the class, I recommend that you follow the blog. You’ll find links to readings, homework, and the like here.
Find a friend. Both of you tag the part of speech of each word in the following ten tweets. Then calculate the agreement between the two of you, and explain to me two sources of disagreement between the two of you. For this question, you will turn in:
These 10 tweets were selected from a number of searches. The bolded words tell you what my search was, but are not relevant to anything else about this exercise. I did not select the tweets randomly; neither did I edit them.
Suppose that we have a system that classifies genes as being potentially druggable, or not. The data is in this file. Y means that a gene is druggable, and n means that it is not. The column labelled gold.standard specifies the correct answer. The column labelled system.output is what our system thinks the answer is. Note that since we have a binary classification (either yes or no) and a defined set of examples with no boundary issues, we can determine the number of true negatives, which isn’t always the case in bioinformatics.
For this question, submit the counts of true positives, true negatives, false positives, and false negatives. So, your answer should look something like this:
true positives: 613
true negatives: 1024
false positives: 1789
false negatives: 1871
One of the problems with accuracy is that it can be a terrible over-estimate of the aspects of system performance that you care about. This is especially true in the situation where what you care about the most is identifying the positive situations, and even more so, when the positive situations are rare.
To see how this works, suppose that in our data set, we have four cases of phone calls to the admissions desk in an emergency room. Our job is to build a program that correctly classifies phone calls as an emergency situation when they are, in fact an emergency. Of the four true emergencies, our system says that it is really an emergency for only three of those. Also suppose that if a situation is not an emergency, the system always says–correctly–that it is not an emergency. Calculate the accuracy as the number of true negatives goes up–which means that the positives became rarer and rarer–from 0 true negatives to 100 true negatives. Graph this with the number of true negatives on the x axis and the accuracy on the y axis. As always, make the range for accuracy on the y axis be from 0 to 1.0.
To clarify: your first data point will be 3 true positives, one false negative, and no true negatives or false positives. The next data point will be 3 true positives, one false negative, one true negative, and no false positives. Continue until you have the data point for 3 true positives, 100 true negatives, one false negative, and no false positives.
R is not the most obvious programming language to use for text mining, but it makes analysis of your output so much easier to do that I’ve been using it more and more–for text mining. Any language that doesn’t have a good testing framework is probably going to land you in trouble sooner or later; R does have a testing framework, and it’s worth learning to use it. Here are some exercises for the R testthat library–have fun.
https://www.r-bloggers.com/unit-testing-in-r-using-testthat-library-exercises/
Nice piece on responsible Big Data research from Matthew Zook and a bunch of other people.
In order to help you interpret your performance on the final exam, you’ll find an analysis of the exam scores here. I’ll give you the R code that I used to do the analysis, so that you can see how my thinking works here even if my prose is not clear.
First, let’s look at the overall scores on the final exam. The format of this exam was 10 basic questions, plus 10 questions that you could think of as more advanced. (We’ll return later to the question of whether or not the basic questions were really more basic than the advanced questions, and vice versa.) By “basic questions,” I mean that if one knows the answers to these, then you will be able to understand most conversations about natural language processing. By “advanced questions,” I mean that if one knows the answers to those, then you will be able to be an active participant in those conversations.
Since I didn’t write any of my own functions, I don’t have a way to break this down into unit tests. So, I’ll calculate some of the scores for individual students manually, and make sure that the program calculates the scores for those students correctly. To make sure that unexpected inputs don’t do anything horrible to the calculations, I’ll also do this check for students who didn’t take the test, and therefore have NA values for both sets of questions.
# the column page.01 is the scores for the first page of the exam # the column page.02 is the scores for the second page of the exam # ...so, the score on the final exam is the sum of those two columns. scores.total <- data$page.01 + data$page.02 summary(scores.total) ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 3.50 12.00 13.00 13.25 15.00 20.00 2 shapiro.test(scores.total) ## ## Shapiro-Wilk normality test ## ## data: scores.total ## W = 0.97045, p-value = 0.06652 hist(scores.total, main = "Histogram of scores on final exam")
What do we learn from this? First of all, the typical grade was around 13, whether you look at the mean or the median. So, most people passed the final exam. We also know that some people got excellent scores–there were a couple 20s. And, we know that some people got terrible scores. The fact that a couple people got 20s is consistent with the idea that the materials in the course covered the materials on the final exam. The fact that some people failed, including some people that got very low scores, suggests that the exam was sufficiently difficult to be appropriate for the student population in this grande école.
summary(data$page.01) ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 3.500 6.625 7.500 7.519 8.875 10.000 2 shapiro.test(data$page.01) ## ## Shapiro-Wilk normality test ## ## data: data$page.01 ## W = 0.96359, p-value = 0.02508 hist(data$page.01, main = "Histogram of scores on basic questions")
What do we learn from this? Bearing in mind that the highest possible score on the first page was 10 points, the fact that the mean and median scores were about around 7.5 suggests that students typically got (the equivalent of) passing scores on the basic questions–that is to say, students were typically safely above a score of 5. The fact that several students got scores lower than that suggests that even the basic questions were difficult enough for this context, and again, the fact that a large number of students got scores of 9 or above suggests that the course covered the basic aspects of natural language processing thoroughly.
On the other hand, personally, I was somewhat disappointed with the scores on the basic questions. My hope was that all students would walk out of the course with a solid understanding of the basics of the subject. Although the high proportion of 9s and 10s makes me confident that I covered that material thoroughly, it’s difficult to avoid the conclusion that the large number of absences in every one of the 2nd through 6th class sessions (that is, all of the class sessions but the first) affected students’ overall performance pretty heavily. I don’t have numbers on the attendance rates, so I can’t test this hypothesis quantitatively.
summary(data$page.02) ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.000 5.000 6.000 5.731 7.000 10.000 2 shapiro.test(data$page.02) ## ## Shapiro-Wilk normality test ## ## data: data$page.02 ## W = 0.95442, p-value = 0.00714 hist(data$page.02, main = "Histogram of scores on advanced questions")
If the gradient of difficulty that I was looking for was there, then very few people should have done better on the second page (advanced questions) than they did on the first page (basic questions). When we plot the difference between the first page and the second page, most people should be at zero (equally difficult) or higher (second page more difficult). Very few people should have a negative value (which would indicate that they did better on the questions that I expected to be more advanced; certainly this could happen sometimes, but it shouldn’t happen very often).
difference.between.first.and.second <- data$page.01 - data$page.02 hist(difference.between.first.and.second)
Do differences between the basic and advanced scores correlate with the final score? Maybe people who did really poorly or really well will show unusual relationships there.
# let's visualize the data before we do anything else plot(difference.between.first.and.second, (data$page.01 + data$page.02), main = "Relationship between the gap between scores on the basic and advanced questions, versus total score", ylab = "Total score on the exam", xlab = "Score on basic questions minus score on advanced questions")
Not much point in looking for a correlation here–we can see from the plot that there won’t be much of one.
However, we’ll do a t-test to see if the difference between the mean scores on the two pages is significant…
t.test(data$page.01, data$page.02) ## ## Welch Two Sample t-test ## ## data: data$page.01 and data$page.02 ## t = 6.9339, df = 151.12, p-value = 1.115e-10 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 1.278847 2.298077 ## sample estimates: ## mean of x mean of y ## 7.519231 5.730769
…and, yes: it is, and very much so. The first page was easier than the second page, and the second page was harder than the first page. I hope this is helpful!
Here are some suggested readings for Week 7. Remember that I do not distribute my lecture notes. Note also that you are responsible for all of the material on which I lecture. These readings are not required, but they are intended to cover everything that I talk about in our lectures (modulo the caution in the preceding sentence). All of them are available for free on line.
Resnik, Philip, and Jimmy Lin. “Evaluation of NLP Systems.” The handbook of computational linguistics and natural language processing 57 (2010).
Ding, Jing, et al. “Mining MEDLINE: abstracts, sentences, or phrases.” Proceedings of the pacific symposium on biocomputing. Vol. 7. 2002.
Blaschke, Christian, and Alfonso Valencia. “The frame-based module of the SUISEKI information extraction system.” IEEE Intelligent Systems 17.2 (2002): 14-20.
Lu, Zhiyong, K. Bretonnel Cohen, and Lawrence Hunter. “Finding GeneRIFs via gene ontology annotations.” Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. NIH Public Access, 2006.
In lieu of another homework, here are some optional exercises. These exercises will (should you elect to do them) help you understand some of the basic skills involved in natural language processing. Some things that you will be better able to do after completing these exercises include:
It’s worth being aware that the first one of these is worth about $45 an hour for a CDI. In contrast, being able to do the other four is worth about $75 an hour for a CDI or $200 an hour for consulting work.
Louise Michel was born at the Château of Vroncourt (Haute-Marne) on 29 May 1830, the illegitimate daughter of a serving-maid, Marianne Michel, and the châtelain, Etienne Charles Demahis.[1] She was brought up by her mother and her father’s parents near the village of Vroncourt-la-Côte and received a liberal education. She became interested in traditional customs, folk myths and legends.[1] In 1866 a feminist group called the Société pour la Revendication du Droit des Femmes began to meet at the house of André Léo. Members included Louise Michel.
She was with the Commune de Paris, which made its last stand in the cemetery of Montmartre, and was closely allied with Théophile Ferré, who was executed in November 1871. Michel was loaded onto the ship Virginie on 8 August 1873,[7] to be deported to New Caledonia, where she arrived four months later. Whilst on board, she became acquainted with Henri Rochefort, a famous polemicist, who became her friend until her death. She remained in New Caledonia for seven years, refusing special treatment reserved for women. Michel’s grave is in the cemetery of Levallois-Perret, in one of the suburbs of Paris.
Your program’s output should include the following persons:
…and the following locations:
…and the following organizations:
Note: if you see inconsistencies in the desired outputs that I’ve given you, how would you suggest resolving them, and what would the resulting changes in the desired outputs be?
This would be a false positive for the semantic class of “person” if your program returned it:
…and a false positive for the semantic class of “location:”
Some correct pairwise relations would be:
Some incorrect pairwise relations would be:
Sample questions for the final exam
All human languages, but no computer languages, have the property of ____________________.
When a system returns increasing numbers of false positives, then even if we do not know the number of true negatives, the ________________ goes down. (Two possible correct answers, just give one)
Languages like Finnish, Turkish, and Hungarian, in which words can have multiple morphemes but it is relatively easy to separate them, are known as __________________________.
I may also give you some graphs to examine and ask you to interpret them.
Answers to example questions
All human languages, but no computer languages, have the property of ambiguity.
When a system returns increasing numbers of false positives, then even if we do not know the number of true negatives, the precision or F-measure goes down. (Two possible correct answers, just give one)
Languages like Finnish, Turkish, and Hungarian, in which words can have multiple morphemes but it is relatively easy to separate them, are known as agglutinative.
Here are some suggested readings for Week 6. Remember that I do not distribute my lecture notes. Note also that you are responsible for all of the material on which I lecture. These readings are not required, but they are intended to cover everything that I talk about in our lectures (modulo the caution in the preceding sentence). All of them are available for free on line.
Here are some suggested readings for Week 5. Remember that I do not distribute my lecture notes. Note also that you are responsible for all of the material on which I lecture. These readings are not required, but they are intended to cover everything that I talk about in our lectures (modulo the caution in the preceding sentence). All of them are available for free on line except for the books (although the Good and Hardin book is available for free, as well). All of them should be available in an academic library. Feel free to contact me if you have trouble finding a copy of either.
Here are some suggested readings for Weeks 3 and 4 of class. Remember that I do not distribute my lecture notes. Note also that you are responsible for all of the material on which I lecture. These readings are not required, but they are intended to cover everything that I talk about in our lectures (modulo the caution in the previous sentence). All of them are available for free on line except for the book by Jackson and Moulinier. It is truly excellent, and well worth the cost.