False Web memories: A case study on finding information about Andrei Broder
First Monday

False Web memories: A case study on finding information about Andrei Broder by Judit Bar-Ilan



Abstract
Andrei Broder, the well-known Internet researcher does not have a home page of his own. This complicates finding information about him, especially since during the last ten years he switched several employers. Especially of interest is the page research.compaq.com/people/Andrei_Broder/bio.html, which almost seven years after Andrei Broder left Compaq, still appears among the top ten results displayed by Google for the query Andrei Broder as of June 2006. The title of this page is “No such user” and its content is “Sorry, Andrei Broder is no longer working in Compaq Corporate Research.” The case becomes even more interesting, since the actual page and the whole site research.compaq.com are inaccessible at least since March 2006, and the Google’s cached copy is from December 2005!

In this paper we investigate the placement of this page at various search engines over the years, and describe searchers’ efforts to find information about the job title and business address of Andrei Broder as of May 2005, when he was still working at IBM.

Contents

Introduction
Rankings of the “No such user” page
Finding current information about Andrei Broder
Two detailed search logs
Conclusion

 


 

Introduction

The Web has become a major source of information for the developed world. With billions of Web pages, one of the major problems is information overload. Search engines are the major information retrieval tools. Search engines rank their results according to their proprietary ranking algorithms. Since date of publication is not reliable and not always relevant, the search engines either do not weigh this factor into their ranking algorithm or assign to it some very low weight. Other factors, like occurrence, frequency and position of search terms and links to the specific document are assigned more considerable weights by the ranking algorithms.

When looking for current information about a person, date of publication is an important factor, especially in case the person switches positions. Queries about the address, position and e–mail of a person are examples of navigational queries (Broder, 2002). Broder (2002) in his paper on “a taxonomy of Web search” provides a specific example: suppose you want to get information about Donald Knuth, then the probable page you are targeting is Donald Knuth’s homepage at http://www–cs–faculty.stanford.edu/~knuth/. But what happens if you want information about a person who does not have a homepage (like Andrei Broder)? In this case one has to rely on information published by other sources, and Web users should be aware that the information may be inaccurate or out–of–date.

In this paper, we follow the “Web tracks” of Andrei Broder. Andrei worked at DEC SRC from 1984 to 1998. Earlier information about him pre–dates the Web, and thus is not presented here, but the reader may consult Yahoo! Media Relations (2005) or the Yale Information Society Project (2005). In mid–1998 DEC merged with Compaq, and from 1998 to 1999 he was a senior research staff member at Compaq SRC (in 2002 Compaq merged with HC, and Compaq SRC became HP SRC, but by then Andrei Broder has already left Compaq). In August 1999, Andrei Broder was appointed as Chief Technology Officer of AltaVista, and joined IBM Research in February 2002 as Distinguished Engineer and the CTO of the Institute for Search and Text Analysis, and since November 2005 he is a Research Fellow and Vice President of Emerging Search Technology at Yahoo! Research (Yahoo! Media Relations, 2005). Thus during the last eight years, Andrei Broder had a number of job titles and employers. This situation is best demonstrated by the top results of the query Andrei Broder bio retrieved by Google on 1 May 2006 (see Figure 1). Note, that the second search result looks very intriguing — “No such user”. In this paper we observe the placement of this page with various search engines over the years. In addition we analyze the efforts of 49 participants at the 2005 searcher competition in Israel (http://www.afeka.ac.il/siteFiles/1/68/571.asp) to find information about Andrei Broder’s job title and work address at the beginning of May 2005.

++++++++++

Rankings of the “No such user” page

Andrei Broder left Compaq in August 1999. The “no such user” page has two interchangable URLs, and http:// research.compaq.com/people/Andrei_Broder/bio.html and http://www.research.digital.com/people/Andrei_Broder/bio.html. The Internet Archive (http://archive.org) has a copy of this page from January 1999, when Andrei Broder was still a Compaq employee (see Figure 2). There are no archived pages from 2000. There are two archived copies from 2001, but they both redirect to the copy from September 2002 (captured by the Internet Archive) — see Figure 3. The page has not changed since then, but during the last few months (at least since March 2006) the server research.compaq.com is unreachable, even though Google, Yahoo!, Ask and Exalead still index this page as of June 2006, and Google has a cached copy dated December 2005 (the others do not provide cached copies of the page).

Google’s search results for Andrei Broder were first noticed by us in January 2003, at that time the “no such user” page was the top result (see Figure 4). This issue was reexamined in May 2003, when the search results of Google, AltaVista, AlltheWeb and Teoma were checked. This time the “no such user” page was number three on Google, number one on AltaVista, number 11 at Teoma and not among the top 30 results at AlltheWeb. At the beginning of May 2005, the “no such user” page was number two on Google, number three on Teoma (with the http://www.research.digital.com/people/Andrei_Broder/bio.html URL), not among the top 100 on Yahoo and not among the top 50 at MSN. Finally, at the beginning of May 2006, this page is the eighth result on Google, not among the 1,000 displayed results at Yahoo! (Yahoo! indexed both URLs of the page, they are retreived as the last two results, number 263 and 264 for the query Andrei Broder bio), the URLs are not indexed by MSN, and variations of the first URL come up as results 24 and 25 on Ask (http://www.research.compaq.com/people/Andrei_Broder/bio.html%202), and the second URL is retrieved as result number 82.

Figure 1: Results of the query Andrei Broder bio on Google as of 1 May 2006
Figure 1: Results of the query Andrei Broder bio on Google as of 1 May 2006.

 

Figure 2: http://www.research.digital.com/SRC/staff/broder/bio.html as of 17 January 1999
Figure 2: http://www.research.digital.com/SRC/staff/broder/bio.html, as of 17 January 1999.

 

Figure 3: The no such user page captured on 6 September 2002 by the Internet Archive
Figure 3: The “no such user” page captured on 6 September 2002 by the Internet Archive.

 

Figure 4: Search results by Google as of 22 January 2003
Figure 4: Search results by Google as of 22 January 2003.

Why is this probably non–existing, and obviously non–relevant page indexed by three out of the four major search engines six and a half years after Andrei Broder left Compaq? The Web page contains only thirteen words, Andrei and Broder each appear twice in the text, and for a third time in the URL. Another potential reason could be that a large number of Web pages link to this page. Google does not retrieve a single link to this page; however it is well known that Google does not retrieve all the links to a Web page (see Fourms.SearchengineWatch, 2004 or Bar–Ilan, 2002). Yahoo retrieved eight pages with links to http://www.research.digital.com/people/Andrei_Broder/bio.html and none to http://www.research.compaq.com/people/Andrei_Broder/bio.html. Thus obviously the reason for the page’s high rank is not because of the high number of pages linking to it.

++++++++++

Finding current information about Andrei Broder

Assume that we are in May 2005, and our task is to find Andrei Broder’s job title and business address. This is a typical navigational task (as defined by Broder [2002]), thus the average Internet user would turn to Google (comScore, 2005), submit the query Andrei Broder and take a look at the top three search results (Enquiro, 2005). In this case this strategy does not lead to satisfactory results (see Figure 5) — the first result does not say anything about Andrei Broder’s affiliation, while the second and third results convey seemingly contradictory information, the second result indicates that Broder is not affiliated with Compaq, while the third result seems to imply the contrary. Actually, if the third result is clicked, it says “Our collaborators include … Andrei Broder (AltaVista)…”. Note that in May 2005, Andrei Broder was affiliated with IBM. Perhaps, the average user would just give up at this point, however suppose that you are at a search competition, where one of the tasks is to find an answer exactly to this question, in this case you would probably not give up so fast, especially since this question seems to be one of the easier questions in the competition.

This was the exact scenario at the semi–finals of the Second Searcher Championship held on 1 May 2005 in Israel. At this stage 60 contestants were invited to the computer labs of the Afeka College and had two hours to answer correctly as many questions as possible. At the preliminary phase potential contestants had to submit answers to six questions within a limited amount of time. The answers had to be based on information available from the Web. Their answers were graded by the judges, and those contestants whose grades passed a threshold were invited to the semifinals. Over half of them were either students in various disciplines or programmers/hi–tech workers. More than two–thirds had either university degrees or were studying towards a university degree, and more than two–thirds of the participants reported that they spend twenty hours or more per week on the Internet.

The list of questions (in Hebrew), can be found at http://www.afeka.ac.il/search-form-1.asp?info_id=607. The best four participants competed in the finals at the Teldan Info2005 Conference. The participants were informed that their activities on the Internet were being logged by a monitoring program. The log files enabled us to analyze their search steps towards solving question number 5 (translated from Hebrew): What is the current job title and business address of the Internet researcher Andrei Broder? The question was formulated in Hebrew, but the name Andrei Broder appeared in Latin characters.

Figure 5: Search results by Google as of 3 May 2005
Figure 5: Search results by Google as of 3 May 2005.

Almost all of the contestants (49 participants, 82 percent) attempted to answer the question. Their answers are summarized in Table 1. Note that nearly 25 percent of the contestants reached the conclusion that Andrei Broder was still affiliated with AltaVista in May 2005.

 

Table 1: The contestants’ answers (N=49).

Answer

number of contestants

percent of contestants

correct answer

16

32.7%

AltaVista

12

24.5%

IBM Watson

8

16.3%

correct job title, no address

6

12.2%

IBM research

4

8.2%

correct address, no job title

2

4.1%

IBM + Princeton

1

2.0%

 

The first query of the majority of the contestants was simply Andrei Broder (as suggested by Broder for queries for locating the homepage of a person [2002]). Sixteen of the logs were unfortunately missing or corrupted, and Table 2 presents the opening query as recorded in 33 logs.

 

Table 2: The opening queries (N=33).

The query

number of contestants

percent of contestants

Andrei Broder

20

60.6%

Andrei Broder + extra word(s)

7

21.2%

"Andrei Broder" + extra word(s)

3

9.1%

"Andrei Broder"

2

6.1%

Andrei Broder in Hebrew

1

3.0%

 

Phrase searches (search strings enclosed within quotation marks) are usually employed in order to bring more focused results. In this case, however this did not prove to be the best strategy, since this kind of search excludes pages where Andrei Broder’s name appears with his middle initial (Z), like the two pages on which all the correct answers were based (see Figures 6 and 7).

Among the participants starting out with a phrase search, only one of them was able to answer the question fully and correctly. On the other hand none of the participants in this group (i.e., starting with a phrase search) reached the conclusion that Andrei Broder was affiliated with AltaVista as of May 2005.

Figure 6: http://www.research.ibm.com/journal/sj/433/brodeaut.html
Figure 6: http://www.research.ibm.com/journal/sj/433/brodeaut.html.

 

Figure 7: http://www.computer.org/tab/tclist/tcmf.htm (only accessible through the Internet Archive, captured on 7 March 2005)
Figure 7: http://www.computer.org/tab/tclist/tcmf.htm (only accessible through the Internet Archive, captured on 7 March 2005).

Table 3 lists the URLs that were cited more than once as the source of the answer to the question on Andrei Broder’s job title and business address as of May 2005.

 

Table 3: Most frequently provided URLs as source of answer.
 

URL on which the answer was based

info in URL

number of times chosen

1

http://www.research.ibm.com/journal/sj/433/brodeaut.html

correct

14 (29%)

2

http://www.mitacs.ca/interchange/bios/border.html

job title

4 (8%)

3

http://www.mail-archive.com/nlpatumd@yahoogroups.com/msg00008.html

researcher at IBM Watson

3 (6%)

4

http://www.cs.toronto.edu/colloq/1999/broder.html

AltaVista

3 (6%)

5

http://www.computer.org/tab/tclist/tcmf.htm

correct

2 (4%)

6

http://www.infotoday.com/newsbreaks/nb000320-1.htm + http://www.altavista.com/about

AltaVista

2 (4%)

7

http://www.cc.gatech.edu/~lingliu/panels/WWWPanelMay13am.htm

job title

2 (4%)

8

http://www.ee.technion.ac.il/people/zivby/papers/decay/decay.pdf

address

2 (4%)

 

The second URL in Table 3 probably would have occurred more frequently if Broder was spelt correctly in the URL (notice that the URL ends with border.html). Even so, this URL was visited by eight participants during their search for a complete answer. The third URL was visited by six participants; however only three of them were satisfied with the answer provided on this page, the others continued their search. Note the following comment on this page, summarizing the essence of the issue: “Strangely enough, for a person who does research on the web, Andrei doesn’t have a home page, or at least doesn’t have one that can easily be found. Perhaps that should tell us something? :)”. The fourth URL was visited by seven participants, but only three of them based their answer on this URL, even though five of them reached the conclusion that Andrei Broder was affiliated with AltaVista in May 2005.

The average number of URLs visited until the answer was found was eight. The average is calculated out of the 33 logs that were not corrupted or missing. The average number of Web pages visited was 9.2 for the participants who found the full answer to the question. Note, that the monitoring software did not work properly all the time, so the actual number of Web pages visited could have been larger. Some visited more than twenty pages before answering the question, while others visited only two or three Web pages.

All except three participants started their searches at Google — the three exceptions started at Yahoo, AltaVista and Answers.com. Most of them remained “faithful” to Google all through the process, only eight, five and one participants submitted queries to AltaVista, Yahoo and Teoma respectively in addition to utilizing Google. The local search engines at the IBM site and at the ACM Digital Library were put to use by 10 and three participants respectively. Two participants tried their luck with Yahoo’s People Search as well. The numbers presented here are based on 33 logs.

As we saw before (see Table 2), most participants started out with the obvious query Andrei Broder, however the top results for this query on Google were not very helpful. The participants either browsed the search results and modified their queries based on the snippets or visited a few pages and either conducted local searches or modified their queries based on the information in the visited pages. This information seeking processes of the participants are according Marcia Bates’ berrypicking model (Bates, 1989), where the users conduct berrypicking (collecting small pieces of information), evolving searches (the query is modified in accordance with the information gathered over time). In a few cases, the participants based their answer on more than one URL (for example one for the job title and one for the business address).

These evolving, berrypicking searches are illustrated by detailed descriptions of the search processes of two participants, Participant A who reached the correct answer, and Participant B who decided that Andrei Broder was Vice President of AltaVista.

++++++++++

Two detailed search logs

Participant A

Figure 8: A screen shot from the competition
Figure 8: A screen shot from the competition.

  • He decided to visit another page from the search results of the query Andrei Broder haifa: the page http://www2005.org/program–wed.html, which appeared on the results page, since David Carmel from IBM Haifa was a coauthor of Andrei Broder. On this page Andrei Broder’s affiliation appears as the IBM Watson Research Lab.
  • The two above–mentioned pages convinced the participant to modify the query again, this time to Andrei Broder watson site:ibm.com. Before running the new query, he also decided to try the query Andrei Broder on Google Scholar, but seemingly did not visit any of the links from the Google Scholar result page.
  • The snippet, for one of the results, “Towards the next generation of enterprise search technology — Author Bios” (http://www.research.ibm.com/journal/sj/433/brodeaut.html) must have looked promising, because it displayed some address. The participant decided to visit the page that contained the complete answer to the question (see Figure 6).
  • The whole process took fifteen minutes

Participant B

  • Participant B also started out by searching Google for Andrei Broder. He also visited the top result, DBLP: Andrei Z. Broder (http://www.informatik.uni–trier.de/~ley/db/indices/a–tree/b/Broder:Andrei_Z=.html).
  • Next he decided to visit another page from the search results for Andrei Broder, he skipped the “no such user result” and another page from the Compaq site and clicked on the page entitled “www9 paper” (http://www.almaden.ibm.com/cs/k53/www9.final — now accessible only through the Internet Archive). This is the “Graph structure of the Web” paper, and Andrei Broder’s affiliation is given as the AltaVista Company, San Mateo, CA.
  • In spite of this information, the next search conducted by participant B was Andrei Broder IBM (perhaps because of the snippet of the result following the “www9 paper” page). He did not visit any pages linked from the search result page.
  • He submitted a new query to Google, Andrei Broder AltaVista. At the same time he also visited Yahoo! People Search, but seemingly no actual searches were conducted from there.
  • From the search results of the Andrei Broder AltaVista query, he decided to visit the page entitled “AltaVista Introduces Advanced Search Center, New Resources for Power Searchers and Webmasters” (http://www.infotoday.com/newsbreaks/nb000320-1.htm). The participant did not notice or did not care that this article was dated 20 March 2000. The relevant sentence in this document states “Andrei Broder, vice president of research at AltaVista”.
  • He linked to AltaVista (http://www.altavista.com) from the previous page, chose the About AltaVista link (http://www.altavista.com/about/) which contained AltaVista’s address: 74 North Pasadena Avenue, 3rd Floor, Pasadena, California 91103.

In this case the answer was based on two pages, one for the job title and one for the business address. Participant B worked on the answer for six minutes.

++++++++++

Conclusion

One can find information on almost any imaginable topic on the Web, however sometimes searching the Web is just like looking for a needle in a haystack. The search for current information regarding Andrei Broder in May 2005 is an excellent example of such a search. The first bit of information the user receives is that there is “no such user”. As of today (May 2006) it is much easier to find information on Andrei Broder, because his move to Yahoo Research was heavily publicized (see Figure 9). As of June 2006, Andrei Broder already has his own page at Yahoo! Research (http://research.yahoo.com/~azbroder), but the profile part of the page is still empty. Thus it seems that Andrei Broder continues to try to keep a low Internet profile — not with too much success, since a search for Andrei Broder returns 170,000 results on Google and 37,800 results on Yahoo! as of 14 May 2006. It will be interesting to follow the fate of the “no such user” page. When will it finally disappear from the search results? Note that this page was visited by five participants of the competition.

Figure 9: Google Trends Broder peak C is due to Andrei Broder
Figure 9: Google Trends — Broder — peak C is due to Andrei Broder.

The paper also provides an insight to the ways experienced Internet searchers try to locate information on the Web. We recommend conducting additional and more comprehensive user studies related to searching on the Web. End of article

 

About the author

Judit Bar–Ilan is a senior lecturer at the Department of Information Science of the Bar–Ilan University, Israel. She received her PhD in computer science from the Hebrew University of Jerusalem. She started her research in information science in the mid–1990s. Her areas of interest include: information retrieval, Internet research, informetrics, information behavior and usability.

 

Acknowledgements

I thank Zvi Schwartzman for coming up with the idea of the search competition, for organizing it and for inviting me to serve as one of the judges of the competition. The competition was held under the auspices of the Afeka — Tel Aviv Academic College of Engineering (http://www.afeka.ac.il/) and Teldan Information Systems Ltd. (http://www.teldan.com). I also thank Andrei Broder for his comments and for suggesting the title of the paper.

 

References

Judit Bar–Ilan, 2002. “ How much information search engines disclose on the links to a Web page? – A longitudinal case study of the ‘Cybermetrics’ home page,” Journal of Information Science, volume 28, number 6, pp. 455–466. http://dx.doi.org/10.1177/016555150202800602

Marcia Bates, 1989. “The design of browsing and berrypicking techniques for the online search interface,” Online Review, volume 13, pp. 407–424, at http://www.gseis.ucla.edu/faculty/bates/berrypicking.html, accessed 4 May 2006.

Andrei Broder, 2002. “A taxonomy of Web search,” ACM SIGIR Forum, volume 36, number 2, at http://www.sigir.org/forum/F2002/broder.pdf, accessed 4 May 2006.

comScore, 2005. “comScore reports July 2005 search engine rankings,” at http://www.comscore.com/press/release.asp?press=622, accessed 4 May 2006.

Enquiro, 2005. “Did–It, Enquiro and Eyetools uncover search’s golden triangle,” at http://www.enquiro.com/eye–tracking–pr.asp, accessed 4 May 2006.

Forums.Searchenginewatch, 2004. “Google say not reporting all backlinks ):,” at http://forums.searchenginewatch.com/showthread.php?t=2423&page=1&pp=20, accessed 4 May 2006.

Yahoo! Media Relations, 2005. “Web Expert and Search Pioneer to Apply Information Retrieval Expertise to Yahoo! Research Projects,” at http://docs.yahoo.com/docs/pr/release1271.html, accessed 4 May 2006.

Yale Information Society Project, 2005. At http://islandia.law.yale.edu/isp/regulatingsearch.html#broder, accessed 4 May 2006.

 


Editorial history

Paper received 29 June 2006; accepted 20 July 2006.


Creative Commons License
This work is licensed under a Creative Commons Attribution–Noncommercial–No Derivative Works 2.5 License.

False Web memories: A case study on finding information about Andrei Broder
by Judit Bar–Ilan
First Monday, Volume 11, Number 9 — 4 September 2006
http://www.firstmonday.org/ojs/index.php/fm/article/view/1401/1319





A Great Cities Initiative of the University of Illinois at Chicago University Library.

© First Monday, 1995-2017. ISSN 1396-0466.