This is the web site for the Spring 2017 section of Natural Language Processing at Université Paris-Centrale. If you’re a student in the class, I recommend that you follow the blog. You’ll find links to readings, homework, and the like here.
R is not the most obvious programming language to use for text mining, but it makes analysis of your output so much easier to do that I’ve been using it more and more–for text mining. Any language that doesn’t have a good testing framework is probably going to land you in trouble sooner or later; R does have a testing framework, and it’s worth learning to use it. Here are some exercises for the R testthat library–have fun.
Nice piece on responsible Big Data research from Matthew Zook and a bunch of other people.
In order to help you interpret your performance on the final exam, you’ll find an analysis of the exam scores here. I’ll give you the R code that I used to do the analysis, so that you can see how my thinking works here even if my prose is not clear.
Brief overview of the data
First, let’s look at the overall scores on the final exam. The format of this exam was 10 basic questions, plus 10 questions that you could think of as more advanced. (We’ll return later to the question of whether or not the basic questions were really more basic than the advanced questions, and vice versa.) By “basic questions,” I mean that if one knows the answers to these, then you will be able to understand most conversations about natural language processing. By “advanced questions,” I mean that if one knows the answers to those, then you will be able to be an active participant in those conversations.
How I’ll test this
Since I didn’t write any of my own functions, I don’t have a way to break this down into unit tests. So, I’ll calculate some of the scores for individual students manually, and make sure that the program calculates the scores for those students correctly. To make sure that unexpected inputs don’t do anything horrible to the calculations, I’ll also do this check for students who didn’t take the test, and therefore have NA values for both sets of questions.
Statistics on the overall scores for the final exam
# the column page.01 is the scores for the first page of the exam # the column page.02 is the scores for the second page of the exam # ...so, the score on the final exam is the sum of those two columns. scores.total <- data$page.01 + data$page.02 summary(scores.total) ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 3.50 12.00 13.00 13.25 15.00 20.00 2 shapiro.test(scores.total) ## ## Shapiro-Wilk normality test ## ## data: scores.total ## W = 0.97045, p-value = 0.06652 hist(scores.total, main = "Histogram of scores on final exam")
What do we learn from this? First of all, the typical grade was around 13, whether you look at the mean or the median. So, most people passed the final exam. We also know that some people got excellent scores–there were a couple 20s. And, we know that some people got terrible scores. The fact that a couple people got 20s is consistent with the idea that the materials in the course covered the materials on the final exam. The fact that some people failed, including some people that got very low scores, suggests that the exam was sufficiently difficult to be appropriate for the student population in this grande école.
Statistics on the first page (basic questions)
summary(data$page.01) ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 3.500 6.625 7.500 7.519 8.875 10.000 2 shapiro.test(data$page.01) ## ## Shapiro-Wilk normality test ## ## data: data$page.01 ## W = 0.96359, p-value = 0.02508 hist(data$page.01, main = "Histogram of scores on basic questions")
What do we learn from this? Bearing in mind that the highest possible score on the first page was 10 points, the fact that the mean and median scores were about around 7.5 suggests that students typically got (the equivalent of) passing scores on the basic questions–that is to say, students were typically safely above a score of 5. The fact that several students got scores lower than that suggests that even the basic questions were difficult enough for this context, and again, the fact that a large number of students got scores of 9 or above suggests that the course covered the basic aspects of natural language processing thoroughly.
On the other hand, personally, I was somewhat disappointed with the scores on the basic questions. My hope was that all students would walk out of the course with a solid understanding of the basics of the subject. Although the high proportion of 9s and 10s makes me confident that I covered that material thoroughly, it’s difficult to avoid the conclusion that the large number of absences in every one of the 2nd through 6th class sessions (that is, all of the class sessions but the first) affected students’ overall performance pretty heavily. I don’t have numbers on the attendance rates, so I can’t test this hypothesis quantitatively.
Statistics on the second page (advanced questions)
summary(data$page.02) ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.000 5.000 6.000 5.731 7.000 10.000 2 shapiro.test(data$page.02) ## ## Shapiro-Wilk normality test ## ## data: data$page.02 ## W = 0.95442, p-value = 0.00714 hist(data$page.02, main = "Histogram of scores on advanced questions")
Were the advanced questions really more advanced?
If the gradient of difficulty that I was looking for was there, then very few people should have done better on the second page (advanced questions) than they did on the first page (basic questions). When we plot the difference between the first page and the second page, most people should be at zero (equally difficult) or higher (second page more difficult). Very few people should have a negative value (which would indicate that they did better on the questions that I expected to be more advanced; certainly this could happen sometimes, but it shouldn’t happen very often).
difference.between.first.and.second <- data$page.01 - data$page.02 hist(difference.between.first.and.second)
Do differences between the basic and advanced scores correlate with the final score? Maybe people who did really poorly or really well will show unusual relationships there.
# let's visualize the data before we do anything else plot(difference.between.first.and.second, (data$page.01 + data$page.02), main = "Relationship between the gap between scores on the basic and advanced questions, versus total score", ylab = "Total score on the exam", xlab = "Score on basic questions minus score on advanced questions")
Not much point in looking for a correlation here–we can see from the plot that there won’t be much of one.
However, we’ll do a t-test to see if the difference between the mean scores on the two pages is significant…
t.test(data$page.01, data$page.02) ## ## Welch Two Sample t-test ## ## data: data$page.01 and data$page.02 ## t = 6.9339, df = 151.12, p-value = 1.115e-10 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 1.278847 2.298077 ## sample estimates: ## mean of x mean of y ## 7.519231 5.730769
…and, yes: it is, and very much so. The first page was easier than the second page, and the second page was harder than the first page. I hope this is helpful!
Here are some suggested readings for Week 7. Remember that I do not distribute my lecture notes. Note also that you are responsible for all of the material on which I lecture. These readings are not required, but they are intended to cover everything that I talk about in our lectures (modulo the caution in the preceding sentence). All of them are available for free on line.
Intrinsic versus extrinsic evaluation
Resnik, Philip, and Jimmy Lin. “Evaluation of NLP Systems.” The handbook of computational linguistics and natural language processing 57 (2010).
Scope in information extraction
Ding, Jing, et al. “Mining MEDLINE: abstracts, sentences, or phrases.” Proceedings of the pacific symposium on biocomputing. Vol. 7. 2002.
Distance in information extraction
Blaschke, Christian, and Alfonso Valencia. “The frame-based module of the SUISEKI information extraction system.” IEEE Intelligent Systems 17.2 (2002): 14-20.
Combining and weighting evidence sources
Lu, Zhiyong, K. Bretonnel Cohen, and Lawrence Hunter. “Finding GeneRIFs via gene ontology annotations.” Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. NIH Public Access, 2006.
In lieu of another homework, here are some optional exercises. These exercises will (should you elect to do them) help you understand some of the basic skills involved in natural language processing. Some things that you will be better able to do after completing these exercises include:
- Perform several language processing tasks
- Estimate the time required to perform a “simple” natural language processing task
- Estimate the time required to obtain evaluation data for a natural language processing task
- Analyze the difficulties involved in solving “simple” natural language processing problems
- Explain the challenges of natural language processing to non-experts
It’s worth being aware that the first one of these is worth about $45 an hour for a CDI. In contrast, being able to do the other four is worth about $75 an hour for a CDI or $200 an hour for consulting work.
- Spend one hour writing a program to split texts into sentences. Evaluate it with a large body of textual data. What kinds of problems do you still have after an hour of writing code to do this? Alternatively: Google for publicly available natural language processing libraries, and use one of your choice to split a large body of textual data into sentences. What kinds of errors does the library make on this task?
- Spend one hour writing a program to split textual inputs into tokens. Evaluate it with a large body of textual data. What kinds of problems do you still have after an hour of writing code to do this? Alternatively: Google for publicly available natural language processing libraries, and use one of your choice to split a large body of textual data into sentences. What kinds of errors does the library make on this task?
- Spend three hours writing a program to find all mentions of persons, places, and organizations in textual inputs. Evaluate it on a large body of textual data. What kinds of problems do you still have after three hours of writing code to do this? Alternatively: Google for publicly available natural language processing libraries, and use one of your choice to find all mentions of persons, places, and organizations in a large body of textual data. What kinds of errors does the library make on this task?
- Spend eight hours writing a program to find all pairwise relationships between people, places, and organizations in a large body of textual data. For example, the following input (copied from Wikipedia and then simplified for expository purposes) should give you the following output:
Louise Michel was born at the Château of Vroncourt (Haute-Marne) on 29 May 1830, the illegitimate daughter of a serving-maid, Marianne Michel, and the châtelain, Etienne Charles Demahis. She was brought up by her mother and her father’s parents near the village of Vroncourt-la-Côte and received a liberal education. She became interested in traditional customs, folk myths and legends. In 1866 a feminist group called the Société pour la Revendication du Droit des Femmes began to meet at the house of André Léo. Members included Louise Michel.
She was with the Commune de Paris, which made its last stand in the cemetery of Montmartre, and was closely allied with Théophile Ferré, who was executed in November 1871. Michel was loaded onto the ship Virginie on 8 August 1873, to be deported to New Caledonia, where she arrived four months later. Whilst on board, she became acquainted with Henri Rochefort, a famous polemicist, who became her friend until her death. She remained in New Caledonia for seven years, refusing special treatment reserved for women. Michel’s grave is in the cemetery of Levallois-Perret, in one of the suburbs of Paris.
Your program’s output should include the following persons:
- Louise Michel
- Théophile Ferré
- Henri Rochefort
- André Léo
…and the following locations:
- Châteauof Vroncourt (Haute-Marne)
- cemetery of Montmartre
- New Caledonia
- cemetery of Levallois-Perret
- the house of André Léo
…and the following organizations:
- Société pour la Revendication du Droit des Femmes
- Commune de Paris
Note: if you see inconsistencies in the desired outputs that I’ve given you, how would you suggest resolving them, and what would the resulting changes in the desired outputs be?
This would be a false positive for the semantic class of “person” if your program returned it:
- Virginie (it’s the name of the ship on which Louise Michel was taken to New Caledonia)
…and a false positive for the semantic class of “location:”
- Commune de Paris
Some correct pairwise relations would be:
- Marianne Michel and Etienne Charles Demahis
- Louise Michel and Etienne Charles Demahis
- Louise Michel and Marianne Michel
- Louise Michel and Château of Vroncourt (Haute-Marne)
- Louise Michel and Société pour la Revendication du Droit des Femmes
- Louise Michel and Commune de Paris
Some incorrect pairwise relations would be:
- Marianne Michel and Société pour la Revendication du Droit des Femmes
- Etienne Charles Demahis and Société pour la Revendication du Droit des Femmes
- Henri Rochefort and Société pour la Revendication du Droit des Femmes
Sample questions for the final exam
All human languages, but no computer languages, have the property of ____________________.
When a system returns increasing numbers of false positives, then even if we do not know the number of true negatives, the ________________ goes down. (Two possible correct answers, just give one)
Languages like Finnish, Turkish, and Hungarian, in which words can have multiple morphemes but it is relatively easy to separate them, are known as __________________________.
I may also give you some graphs to examine and ask you to interpret them.
Answers to example questions
All human languages, but no computer languages, have the property of ambiguity.
When a system returns increasing numbers of false positives, then even if we do not know the number of true negatives, the precision or F-measure goes down. (Two possible correct answers, just give one)
Languages like Finnish, Turkish, and Hungarian, in which words can have multiple morphemes but it is relatively easy to separate them, are known as agglutinative.
Here are some suggested readings for Week 6. Remember that I do not distribute my lecture notes. Note also that you are responsible for all of the material on which I lecture. These readings are not required, but they are intended to cover everything that I talk about in our lectures (modulo the caution in the preceding sentence). All of them are available for free on line.
- Mark Liberman: Lessons for reproducible science from DARPA’s programs on Human Language Technology
- Aurélie Névéol, Reproducibility in computer science
- Chapman et al. (2011), Reproducibility and shared tasks
- Collberg, Christian, et al. “Measuring reproducibility in computer systems research.” Department of Computer Science, University of Arizona, Tech. Rep (2014).
- Fokkens, Antske, et al. “Offspring from Reproduction Problems: What Replication Failure Teaches Us.” ACL (1). 2013.
- Causation and experimental confounds
- Banko, Michele, and Eric Brill. “Scaling to very very large corpora for natural language disambiguation.” Proceedings of the 39th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2001.
- Cohen, K. Bretonnel, William A. Baumgartner Jr, and Lawrence Hunter. “Software testing and the naturally occurring data assumption in natural language processing.” Software engineering, testing, and quality assurance for natural language processing. Association for Computational Linguistics, 2008.
- Regression testing (Wikipedia)