Optional exercises in lieu of another homework assignment, and some example questions for the final exam

In lieu of another homework, here are some optional exercises.  These exercises will (should you elect to do them) help you understand some of the basic skills involved in natural language processing.  Some things that you will be better able to do after completing these exercises include:

  • Perform several language processing tasks
  • Estimate the time required to perform a “simple” natural language processing task
  • Estimate the time required to obtain evaluation data for a natural language processing task
  • Analyze the difficulties involved in solving “simple” natural language processing problems
  • Explain the challenges of natural language processing to non-experts

It’s worth being aware that the first one of these is worth about $45 an hour for a CDI.  In contrast, being able to do the other four is worth about $75 an hour for a CDI or $200 an hour for consulting work.

  1. Spend one hour writing a program to split texts into sentences. Evaluate it with a large body of textual data.  What kinds of problems do you still have after an hour of writing code to do this?  Alternatively: Google for publicly available natural language processing libraries, and use one of your choice to split a large body of textual data into sentences.  What kinds of errors does the library make on this task?
  2. Spend one hour writing a program to split textual inputs into tokens. Evaluate it with a large body of textual data.  What kinds of problems do you still have after an hour of writing code to do this?  Alternatively: Google for publicly available natural language processing libraries, and use one of your choice to split a large body of textual data into sentences.  What kinds of errors does the library make on this task?
  3. Spend three hours writing a program to find all mentions of persons, places, and organizations in textual inputs. Evaluate it on a large body of textual data.  What kinds of problems do you still have after three hours of writing code to do this?  Alternatively: Google for publicly available natural language processing libraries, and use one of your choice to find all mentions of persons, places, and organizations in a large body of textual data.  What kinds of errors does the library make on this task?
  4. Spend eight hours writing a program to find all pairwise relationships between people, places, and organizations in a large body of textual data. For example, the following input (copied from Wikipedia and then simplified for expository purposes) should give you the following output:

Louise Michel was born at the Château of Vroncourt (Haute-Marne) on 29 May 1830, the illegitimate daughter of a serving-maid, Marianne Michel, and the châtelain, Etienne Charles Demahis.[1]  She was brought up by her mother and her father’s parents near the village of Vroncourt-la-Côte and received a liberal education. She became interested in traditional customs, folk myths and legends.[1]   In 1866 a feminist group called the Société pour la Revendication du Droit des Femmes began to meet at the house of André Léo. Members included Louise Michel.

She was with the Commune de Paris, which made its last stand in the cemetery of Montmartre, and was closely allied with Théophile Ferré, who was executed in November 1871.  Michel was loaded onto the ship Virginie on 8 August 1873,[7] to be deported to New Caledonia, where she arrived four months later. Whilst on board, she became acquainted with Henri Rochefort, a famous polemicist, who became her friend until her death. She remained in New Caledonia for seven years, refusing special treatment reserved for women.  Michel’s grave is in the cemetery of Levallois-Perret, in one of the suburbs of Paris.

Your program’s output should include the following persons:

  • Louise Michel
  • Théophile Ferré
  • Henri Rochefort
  • André Léo

…and the following locations:

…and the following organizations:

  • Société pour la Revendication du Droit des Femmes
  • Commune de Paris

Note: if you see inconsistencies in the desired outputs that I’ve given you, how would you suggest resolving them, and what would the resulting changes in the desired outputs be?

This would be a false positive for the semantic class of “person” if your program returned it:

  • Virginie (it’s the name of the ship on which Louise Michel was taken to New Caledonia)

…and a false positive for the semantic class of “location:”

  • Commune de Paris

Some correct pairwise relations would be:

  • Marianne Michel and Etienne Charles Demahis
  • Louise Michel and Etienne Charles Demahis
  • Louise Michel and Marianne Michel
  • Louise Michel and Château of Vroncourt (Haute-Marne)
  • Louise Michel and Société pour la Revendication du Droit des Femmes
  • Louise Michel and Commune de Paris

Some incorrect pairwise relations would be:

  • Marianne Michel and Société pour la Revendication du Droit des Femmes
  • Etienne Charles Demahis and Société pour la Revendication du Droit des Femmes
  • Henri Rochefort and Société pour la Revendication du Droit des Femmes

Sample questions for the final exam

All human languages, but no computer languages, have the property of ____________________.

When a system returns increasing numbers of false positives, then even if we do not know the number of true negatives, the ________________ goes down.  (Two possible correct answers, just give one)

Languages like Finnish, Turkish, and Hungarian, in which words can have multiple morphemes but it is relatively easy to separate them, are known as __________________________.

I may also give you some graphs to examine and ask you to interpret them.

Answers to example questions

All human languages, but no computer languages, have the property of ambiguity.

When a system returns increasing numbers of false positives, then even if we do not know the number of true negatives, the precision or F-measure goes down.  (Two possible correct answers, just give one)

Languages like Finnish, Turkish, and Hungarian, in which words can have multiple morphemes but it is relatively easy to separate them, are known as agglutinative.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s