nedLinks

Results depend on WHAT you search for, HOW you search for it, and WHAT's OUT THERE to be found.

Introduction

nedLinks1 is a tool for summarizing and organizing large collections of text. It works by identifying key phrases in documents and sorting them into ordered lists.

nedLinks is based on the indexes of print books. It turns out that entries in a book index follow consistent patterns of type and number.

Indexable phrases are either proper nouns, or noun phrases of more than one word.

The Resnikoff-Dolby 30:1 Rule2 tells us how large the index should be.

The most-cited terms in an index summarize the text's topic.

The distribution of most-cited terms within an index tells us what sort of text it is. If few terms are highly cited, it is an overview or anthology; if many are highly cited, it is an in-depth treatment of a single topic.

nedLinks also is used to create composite indexes for a collection of documents. The set of most-cited phrases summarizes the collection.

Comparing phrases (used, in general, by multiple authors) also imposes a link-like structure that can be used to organize the collection in various ways. Phrases are bigger than words, yet smaller than documents, and provide an intermediate level at which to organize data.

1. Phrases as a summary of a document


The most-cited entries of an index summarize a document. Here are two examples. The link will show you the original document. The number of citations for each indexable phrase is given in parenthesis.

The thirty years war (18K) 3.
	Catholic(20) Germany(18) Ferdinand(17) Habsburg(13) Protestant(13)
	Wallenstein(11) Gustavus(9) Bohemia(9) Empire(9) Spain(9) Gustavus
	Adolphus(8) Maximilian(8) French(7) Spanish(7) Netherlands(7)
	Protestantism(6) northern Germany(6) France(6) Baltic(6) Europe(5)
	Sweden(5) Lutheran(5) Protestant princes(5) Swedes(5) Bohemian(5)
	Westphalia(4) southern Germany(4) Catholicism(4) imperial authority(3)
	Catholic League(2) Peace of Augsburg(2) central Europe(2) southern
	shores(2) Edict of Restitution(2) northern provinces(2) Treaty of
	Pyrenees(2) Treaty of Westphalia(2) northern kingdom(2) Catholic
	revival(2) Catholic Reformation(2) German Protestant princes(2)
      



The original Brin & Page paper describing the structure of Google4 (70K).
	search engine(86) Google(53) Web(33) PageRank(27) docID(19) search
	result(14) wordID(13) data structures(12) anchor text(12) Page(12)
	bill clinton(12) Bill(12) Clinton(12) quality search(9) Conference(9)
	information retrieval(9) high quality(8) commercial search(7) million
	pages(7) forward index(6) commercial search engines(6) Stanford(6)
	World Wide Web(6) link structure(6) document index(5) Repository(5)
	McBryan(5) link text(5) Figure(4) President(4) external meta
	information(4) ranking function(4) search engine technology(4)
	operating systems(4) anchor hits(4) Marchiori(4) quality search
	results(4) computer science(4) Lawrence Page(4) large-scale search(4)
	business model(4) cellular phone(4) World Wide Web Worm(3) word
	occurrence(3) high quality search(3) research tool(3) high
	precision(3) higher quality search(3) information retrieval systems(3)
	damping factor(3) search quality(3) Pinkerton(3) Sergey Brin(3)
	compact encoding(3) hash table(3) Inverted Index(3) storage space(3)
	large-scale search engine(3) link points(3) reasonable cost(3)
	Stanford University(3) full text(3) Hector Garcia-Molina(3) random
	surfer(3) high PageRank(3) great deal(2) single word(2) robots
	exclusion protocol(2) cooperative agreement(2) Driver Attention(2)
	random page(2) manipulating search engines(2) citation importance(2)
	Order to the Web(2) html version(2) plain hits(2) intuitive
	justification(2) Future Work(2) Text Retrieval Conference(2) search
	terms(2) additional information present(2) multiple file systems(2)
	hits occurring(2) full version(2) storage requirements(2) academic
	search engines(2) data mining(2) current version(2) large amount(2)
	extra words(2) file descriptors(2) hypertextual information(2)
	indexing phase(2) challenging task(2) capitalization information(2)
	name servers(2) HTML tags(2) structure present(2) forward barrels(2)
	reasonable number(2) Rajeev Motwani(2) major search engine(2) higher
	quality search results(2) large part(2) main goal(2) Bill Clinton Joke
	of the Day(2) existing systems(2) Terry Winograd(2) worth looking(2)
	anchors file(2) links database(2) query caching(2) indexer performs(2)
	broken link(2) huge amount(2) research interests(2) complex system(2)
	Santa Clara(2) multiple word(2) ranking system(2) designing Google(2)
	Google search engine(2) short barrel(2) National Science(2) indexing
	system(2) million words(2) relevant documents(2) fair amount(2) fancy
	hits(2) administration cost(2)
      



2. nedLinks, the search tool

This tool shows search results with a combined index for the group, as well as card catalog-like summaries of each document.

The combined index is useful for assessing the overall content of the collection, identifying alternative search terms, and filtering divergent meanings of a search term (e.g. inflation can refer to both economics and cosmology).

(Be apprised that the buttons on the page aren't connected to anything; you're looking at a saved page of results. The links to documents, however, are active, so you can compare the original documents to their summaries. To return to this page, use the 'back' button on your browser)

#1: search term = 'pyramid'. Note that 'pyramid' can refer to various things: tombs, a dietary guide (= 'the food pyramid'), or swindles ('pyramid schemes').

#2: search term = 'history'. The search term is quite broad. The heterogeneity of the collection is indicated by the rapid drop-off in the number of citations in the combined index-- only 5 terms are mentioned in more than 4 documents.

#3: search term = 'middle ages'. The search term is more specific. 9 terms are mentioned in 5 or more documents, showing the increasing homogeneity of the collection.

#4: search term = 'thirty years war'. The search term is quite specific. 18 terms are mentioned in more than 4 documents. Interestingly, quite a number of documents have nothing to do with the 17th-century conflict, other than the name.




3. Automated Directories

The largest collection I have worked on is a subset of the data provided for the Google 2002 programming contest. I created a database of phrases from 190,000 documents comprising 223MB5. While this is still peanuts in the grand scheme of things, it provided a look at what might be possible.

The most common phrases in the database were:
	university(11883) program(4778) science(4506) center(4381)
	student(3616) service(3325) research(3141) graduate student(2940)
	college(2896) education(2813) school(2541) library(2437)
	department(2375) united state(2268) new york(2237) state(2214)
	information(2039) office(1956) english(1856) california(1792) home
	page(1788) course(1749) faculty member(1738) web(1651)
      

By taking the most common words in the database, and linking them to the most common phrases, I created a directory-like structure of phrases to serve as an index for the entire collection. As with indexes for individual documents, the directory size is guided by the 30:1 rule. The resulting list of phrases alluded to 21% of the database in a directory of about a page and a half, and gave a fair sense6 of their likely origin.
	UNIVERSITY: University of California(1330) University of
	Minnesota(1038) University of Michigan(781) State University(738)
	University of Chicago(684) University of Maryland(680) University of
	Iowa(672) Case Western Reserve University(656) Colorado State
	University(598) University of Arizona(566) Northwestern
	University(552) Boston University(549)

	GRADUATE: graduate students(2940) Graduate School(1295) graduate
	program(1033) Graduate Studies(373) graduate courses(277) graduate
	study(249) graduate degree(175) Graduate College(167) graduate
	education(144) Graduate Assistant(135) graduate credit(131) Graduate
	Admissions(112)

	STATE: United States(2268) State University(738) Colorado State
	University(598) Iowa State University(511) Penn State(496) Florida
	State University(443) Wayne State University(408) Kansas State
	University(404) Arizona State University(275) Ohio State
	University(246) Diego State University(232) Iowa State(207)

	STUDENT: graduate students(2940) international students(634)
	undergraduate students(489) Student Affairs(470) Student Services(435)
	student organizations(423) transfer students(241) Student Life(238)
	Dean of Students(216) Prospective Students(190) medical students(188)
	doctoral students(176)

	SCIENCE: Computer Science(1610) social sciences(984) Political
	Science(749) National Science Foundation(429) Biological Sciences(412)
	Health Sciences(323) Life Sciences(283) Information Science(246)
	Environmental Science(243) Materials Science(232) physical
	sciences(215) Natural Sciences(209)

	SCHOOL: Graduate School(1295) high school(1017) School of
	Medicine(586) Law School(579) medical school(526) School of Law(381)
	School of Music(287) School of Education(215) School of
	Engineering(210) Business School(174) high school students(168)
	Eastman School of Music(159)

	PROGRAM: graduate program(1033) degree program(735) Academic
	Programs(335) research program(269) Honors Program(266) undergraduate
	program(264) certificate program(245) training program(204) doctoral
	program(197) education programs(172) Program Director(146)
	International Programs(145)

	INFORMATION: information technology(816) Information Systems(479)
	General Information(389) additional information(364) Contact
	Information(257) Information Science(246) information resources(149)
	Information Services(147) Geographic Information Systems(113)
	Information Center(113) information technologies(103) information
	session(91)

	EDUCATION: higher education(775) College of Education(313) distance
	education(233) School of Education(215) special education(194) general
	education(193) Physical Education(191) medical education(183)
	education programs(172) general education requirements(169) teacher
	education(151) Health Education(147)

	COURSE: course work(713) Course Description(538) graduate courses(277)
	core courses(234) course materials(188) course requirements(169)
	Course Schedule(161) course offerings(145) undergraduate courses(137)
	level courses(135) elective courses(127) online course(97)

	COLLEGE: College of Engineering(434) College of Arts(420) community
	college(314) College of Education(313) Wellesley College(256) College
	Park(235) Graduate College(167) College of Medicine(162) Fellows of
	Harvard College(156) College of Business(151) College of Liberal
	Arts(137) Mills College(136)

	RESEARCH: research project(616) research interests(417) Research
	Center(318) research program(269) Research Associate(190) research
	paper(177) Research Assistant(151) research group(148) research
	methods(138) undergraduate research(122) Operations Research(121)
	current research(120)

	SERVICE: Student Services(435) Career Services(417) Health
	Services(311) public service(296) community service(248) Human
	Services(177) support services(167) Computing Services(149)
	Information Services(147) Student Health Service(141) food
	service(141) Dining Services(126)

	CENTER: Medical Center(550) Research Center(318) Career Center(253)
	Resource Center(222) Health Sciences Center(151) Student Center(145)
	Counseling Center(145) Cancer Center(131) Health Center(117)
	Information Center(113) Learning Center(113) Computing Center(105)

	SYSTEM: Information Systems(479) operating system(448) Davis Health
	System(163) computer system(161) solar system(148) Health System(134)
	nervous system(113) Geographic Information Systems(113) system
	administrator(89) Systems Engineering(85) control systems(82) immune
	system(76)

	DEPARTMENT: U. S. Department(411) department head(249) department
	chair(220) academic department(142) Computer Science Department(124)
	Department of Mathematics(117) Department of Education(116) Department
	of Chemistry(114) Department of Physics(108) English Department(107)
	Police Department(105) Department of Computer Science(96)

	OFFICE: office hours(501) Office of Admissions(180) Office of the
	Registrar(166) Registrar's Office(137) Dean's Office(125) Admissions
	Office(123) Office of Research(118) Office of the University
	Registrar(111) Financial Aid Office(94) News Office(86) Dean of
	Students Office(80) Study Abroad Office(78)

	LIBRARY: University Library(178) Law Library(159) Library of
	Congress(132) Library Catalog(122) Digital Library(98) Science
	Library(96) Music Library(90) Main Library(89) Engineering Library(81)
	Library Services(77) library staff(77) Library Resources(76)
      



4. Conclusion

These are examples of how large bodies of text can be self-organized by linking together indexable phrases. The strength of the process is that it is guided the theory and practice of conventional indexes.

As a data processing technique for a search engine company, these techniques can add value by organizing data into self-imposed categories. Users can be prompted with alternative search terms, or can simply navigate information space, without necessarily pre-selecting any particular search term.

On the desktop these techniques can underlie an application that assists the sifting of large volumes of text, by making the various phrases 'clickable'. Re-ordering a quantity of data can make it tangible, whether one is researching a new TV or a term paper.



Notes

1 I'm not attached to the name, but I had to call it something. (back)

2 See, for example, Yes, there are scaling constants to knowledge catagorization, or do a google search on "Resnikoff Dolby Rule". (back)

3 And I think this is about all we need to know about the thirty years war :-). (back)

4 You may disregard the disclaimer that Google is not affiliated with the authors of this page nor responsible for its content. (back)

5 of text; html not included. (back)

6 It seems safe to say that Google chose to provide data generally from university web sites. (back)