nedLinks

Results depend on WHAT you search for, HOW you search for it, and WHAT's OUT THERE to be found.

Introduction

nedLinks1 is a tool for summarizing and organizing large collections of text. It works by identifying key phrases in documents and sorting them into ordered lists.

nedLinks is based on the indexes of print books. It turns out that entries in a book index follow consistent patterns of type and number.

Indexable phrases are either proper nouns, or noun phrases of more than one word.

The Resnikoff-Dolby 30:1 Rule2 tells us how large the index should be.

The most-cited terms in an index summarize the text's topic.

The distribution of most-cited terms within an index tells us what sort of text it is. If few terms are highly cited, it is an overview or anthology; if many are highly cited, it is an in-depth treatment of a single topic.

nedLinks also is used to create composite indexes for a collection of documents. The set of most-cited phrases summarizes the collection.

Comparing phrases (used, in general, by multiple authors) also imposes a link-like structure that can be used to organize the collection in various ways. Phrases are bigger than words, yet smaller than documents, and provide an intermediate level at which to organize data.

1. Phrases as a summary of a document


The most-cited entries of an index summarize a document. Here are two examples. The link will show you the original document. The number of citations for each indexable phrase is given in parenthesis.

The thirty years war (18K) 3 .
Catholic(20) Germany(18) Ferdinand(17) Habsburg(13) Protestant(13)
Wallenstein(11) Gustavus(9) Bohemia(9) Empire(9) Spain(9) Gustavus
Adolphus(8) Maximilian(8) French(7) Spanish(7) Netherlands(7)
Protestantism(6) northern Germany(6) France(6) Baltic(6) Europe(5)
Sweden(5) Lutheran(5) Protestant princes(5) Swedes(5) Bohemian(5)
Westphalia(4) southern Germany(4) Catholicism(4) imperial authority(3)
Catholic League(2) Peace of Augsburg(2) central Europe(2) southern
shores(2) Edict of Restitution(2) northern provinces(2) Treaty of
Pyrenees(2) Treaty of Westphalia(2) northern kingdom(2) Catholic
revival(2) Catholic Reformation(2) German Protestant princes(2)



The original Brin & Page paper describing the structure of Google4 (70K).
search engine(86) Google(53) Web(33) PageRank(27) docID(19) search
result(14) wordID(13) data structures(12) anchor text(12) Page(12)
bill clinton(12) Bill(12) Clinton(12) quality search(9) Conference(9)
information retrieval(9) high quality(8) commercial search(7) million
pages(7) forward index(6) commercial search engines(6) Stanford(6)
World Wide Web(6) link structure(6) document index(5) Repository(5)
McBryan(5) link text(5) Figure(4) President(4) external meta
information(4) ranking function(4) search engine technology(4)
operating systems(4) anchor hits(4) Marchiori(4) quality search
results(4) computer science(4) Lawrence Page(4) large-scale search(4)
business model(4) cellular phone(4) World Wide Web Worm(3) word
occurrence(3) high quality search(3) research tool(3) high
precision(3) higher quality search(3) information retrieval systems(3)
damping factor(3) search quality(3) Pinkerton(3) Sergey Brin(3)
compact encoding(3) hash table(3) Inverted Index(3) storage space(3)
large-scale search engine(3) link points(3) reasonable cost(3)
Stanford University(3) full text(3) Hector Garcia-Molina(3) random
surfer(3) high PageRank(3) great deal(2) single word(2) robots
exclusion protocol(2) cooperative agreement(2) Driver Attention(2)
random page(2) manipulating search engines(2) citation importance(2)
Order to the Web(2) html version(2) plain hits(2) intuitive
justification(2) Future Work(2) Text Retrieval Conference(2) search
terms(2) additional information present(2) multiple file systems(2)
hits occurring(2) full version(2) storage requirements(2) academic
search engines(2) data mining(2) current version(2) large amount(2)
extra words(2) file descriptors(2) hypertextual information(2)
indexing phase(2) challenging task(2) capitalization information(2)
name servers(2) HTML tags(2) structure present(2) forward barrels(2)
reasonable number(2) Rajeev Motwani(2) major search engine(2) higher
quality search results(2) large part(2) main goal(2) Bill Clinton Joke
of the Day(2) existing systems(2) Terry Winograd(2) worth looking(2)
anchors file(2) links database(2) query caching(2) indexer performs(2)
broken link(2) huge amount(2) research interests(2) complex system(2)
Santa Clara(2) multiple word(2) ranking system(2) designing Google(2)
Google search engine(2) short barrel(2) National Science(2) indexing
system(2) million words(2) relevant documents(2) fair amount(2) fancy
hits(2) administration cost(2)



2. nedLinks, the search tool

This tool shows search results with a combined index for the group, as well as card catalog-like summaries of each document.

The combined index is useful for assessing the overall content of the collection, identifying alternative search terms, and filtering divergent meanings of a search term (e.g. inflation can refer to both economics and cosmology).

(Be apprised that the buttons on the page aren't connected to anything; you're looking at a saved page of results. The links to documents, however, are active, so you can compare the original documents to their summaries. To return to this page, use the 'back' button on your browser)

#1: search term = 'pyramid'. Note that 'pyramid' can refer to various things: tombs, a dietary guide (= 'the food pyramid'), or swindles ('pyramid schemes').

#2: search term = 'history'. The search term is quite broad. The heterogeneity of the collection is indicated by the rapid drop-off in the number of citations in the combined index-- only 5 terms are mentioned in more than 4 documents.

#3: search term = 'middle ages'. The search term is more specific. 9 terms are mentioned in 5 or more documents, showing the increasing homogeneity of the collection.

#4: search term = 'thirty years war'. The search term is quite specific. 18 terms are mentioned in more than 4 documents. Interestingly, quite a number of documents have nothing to do with the 17th-century conflict, other than the name.




3. Automated Directories

The largest collection I have worked on is a subset of the data provided for the Google 2002 programming contest. I created a database of phrases from 190,000 documents comprising 223MB5. While this is still peanuts in the grand scheme of things, it provided a look at what might be possible.

The most common phrases in the database were:
university(11883) program(4778) science(4506) center(4381)
student(3616) service(3325) research(3141) graduate student(2940)
college(2896) education(2813) school(2541) library(2437)
department(2375) united state(2268) new york(2237) state(2214)
information(2039) office(1956) english(1856) california(1792) home
page(1788) course(1749) faculty member(1738) web(1651)

By taking the most common words in the database, and linking them to the most common phrases, I created a directory-like structure of phrases to serve as an index for the entire collection. As with indexes for individual documents, the directory size is guided by the 30:1 rule. The resulting list of phrases alluded to 21% of the database in a directory of about a page and a half, and gave a fair sense6 of their likely origin.
UNIVERSITY: University of California(1330) University of
Minnesota(1038) University of Michigan(781) State University(738)
University of Chicago(684) University of Maryland(680) University of
Iowa(672) Case Western Reserve University(656) Colorado State
University(598) University of Arizona(566) Northwestern
University(552) Boston University(549)

GRADUATE: graduate students(2940) Graduate School(1295) graduate
program(1033) Graduate Studies(373) graduate courses(277) graduate
study(249) graduate degree(175) Graduate College(167) graduate
education(144) Graduate Assistant(135) graduate credit(131) Graduate
Admissions(112)

STATE: United States(2268) State University(738) Colorado State
University(598) Iowa State University(511) Penn State(496) Florida
State University(443) Wayne State University(408) Kansas State
University(404) Arizona State University(275) Ohio State
University(246) Diego State University(232) Iowa State(207)

STUDENT: graduate students(2940) international students(634)
undergraduate students(489) Student Affairs(470) Student Services(435)
student organizations(423) transfer students(241) Student Life(238)
Dean of Students(216) Prospective Students(190) medical students(188)
doctoral students(176)

SCIENCE: Computer Science(1610) social sciences(984) Political
Science(749) National Science Foundation(429) Biological Sciences(412)
Health Sciences(323) Life Sciences(283) Information Science(246)
Environmental Science(243) Materials Science(232) physical
sciences(215) Natural Sciences(209)

SCHOOL: Graduate School(1295) high school(1017) School of
Medicine(586) Law School(579) medical school(526) School of Law(381)
School of Music(287) School of Education(215) School of
Engineering(210) Business School(174) high school students(168)
Eastman School of Music(159)

PROGRAM: graduate program(1033) degree program(735) Academic
Programs(335) research program(269) Honors Program(266) undergraduate
program(264) certificate program(245) training program(204) doctoral
program(197) education programs(172) Program Director(146)
International Programs(145)

INFORMATION: information technology(816) Information Systems(479)
General Information(389) additional information(364) Contact
Information(257) Information Science(246) information resources(149)
Information Services(147) Geographic Information Systems(113)
Information Center(113) information technologies(103) information
session(91)

EDUCATION: higher education(775) College of Education(313) distance
education(233) School of Education(215) special education(194) general
education(193) Physical Education(191) medical education(183)
education programs(172) general education requirements(169) teacher
education(151) Health Education(147)

COURSE: course work(713) Course Description(538) graduate courses(277)
core courses(234) course materials(188) course requirements(169)
Course Schedule(161) course offerings(145) undergraduate courses(137)
level courses(135) elective courses(127) online course(97)

COLLEGE: College of Engineering(434) College of Arts(420) community
college(314) College of Education(313) Wellesley College(256) College
Park(235) Graduate College(167) College of Medicine(162) Fellows of
Harvard College(156) College of Business(151) College of Liberal
Arts(137) Mills College(136)

RESEARCH: research project(616) research interests(417) Research
Center(318) research program(269) Research Associate(190) research
paper(177) Research Assistant(151) research group(148) research
methods(138) undergraduate research(122) Operations Research(121)
current research(120)

SERVICE: Student Services(435) Career Services(417) Health
Services(311) public service(296) community service(248) Human
Services(177) support services(167) Computing Services(149)
Information Services(147) Student Health Service(141) food
service(141) Dining Services(126)

CENTER: Medical Center(550) Research Center(318) Career Center(253)
Resource Center(222) Health Sciences Center(151) Student Center(145)
Counseling Center(145) Cancer Center(131) Health Center(117)
Information Center(113) Learning Center(113) Computing Center(105)

SYSTEM: Information Systems(479) operating system(448) Davis Health
System(163) computer system(161) solar system(148) Health System(134)
nervous system(113) Geographic Information Systems(113) system
administrator(89) Systems Engineering(85) control systems(82) immune
system(76)

DEPARTMENT: U. S. Department(411) department head(249) department
chair(220) academic department(142) Computer Science Department(124)
Department of Mathematics(117) Department of Education(116) Department
of Chemistry(114) Department of Physics(108) English Department(107)
Police Department(105) Department of Computer Science(96)

OFFICE: office hours(501) Office of Admissions(180) Office of the
Registrar(166) Registrar's Office(137) Dean's Office(125) Admissions
Office(123) Office of Research(118) Office of the University
Registrar(111) Financial Aid Office(94) News Office(86) Dean of
Students Office(80) Study Abroad Office(78)

LIBRARY: University Library(178) Law Library(159) Library of
Congress(132) Library Catalog(122) Digital Library(98) Science
Library(96) Music Library(90) Main Library(89) Engineering Library(81)
Library Services(77) library staff(77) Library Resources(76)



4. Conclusion

These are examples of how large bodies of text can be self-organized by linking together indexable phrases. The strength of the process is that it is guided the theory and practice of conventional indexes.

As a data processing technique for a search engine company, these techniques can add value by organizing data into self-imposed categories. Users can be prompted with alternative search terms, or can simply navigate information space, without necessarily pre-selecting any particular search term.

On the desktop these techniques can underlie an application that assists the sifting of large volumes of text, by making the various phrases 'clickable'. Re-ordering a quantity of data can make it tangible, whether one is researching a new TV or a term paper.



Notes

1 I'm not attached to the name, but I had to call it something. (back)

2 See, for example, Yes, there are scaling constants to knowledge catagorization, or do a google search on "Resnikoff Dolby Rule". (back)

3 And I think this is about all we need to know about the thirty years war :<). (back)

4 You may disregard the disclaimer that Google is not affiliated with the authors of this page nor responsible for its content. (back)

5 of text; html not included. (back)

6 It seems safe to say that Google chose to provide data generally from university web sites. (back)