: My goal is to index wikipedia in order to demonstrate search to a class of : middle school kids that I've volunteered to teach for a couple of hours. : Which brings me to my next question...
twitter data is a little easier to ingest easily then the wikipedia markup (the json based streaming API gives you each tweet on it's own line in a way that's really trivial to convert into CSV with a perl script) and might seem more interesting to kids then wikipedia, while still having some interesting metadata (user, post date, hash tags) and lexigraphic challanges (synonyms, abbreviations, @ and # markup, etc...) : One idea I have is to bring some actual "documents", say a poster board with a : sentence written largely on it, have the students physically *tokenize* the : document by cutting it up and lexicographically building the term dictionary. : Thoughts on taking it further welcome! cutting up a paper document is a great way to teach textual analysis, but i think the real key is having two copies of multiple documents (3 would be enough) ... cut up one copy of each doc to build the term dictionary and tape all of those to the wall; then tape the second copy of every doc on the wall arround them and draw lines from each term to the documents it's in (using differnet a different color for the paper of each doc would be an easy way to spot which term is in which doc, and what the term frequency is). -Hoss