CS691: Big Data Systems

 

Fall 2012

     

Home

Schedule

Resources

 

 

Accessing these traces requires an account on the 'obelix' cluster in the department. If you do not have an account already, contact Tyler Trafford <trafford.cs.umass.edu> in order to obtain one.

Notes from Emmanuel Cecchet about this dataset are below.

The data is on bigbackup:/srv/backup2/wikibench also mounted on
cecchet@obelix.cs.umass.edu:/nfs/bigbackup/wikibench/traces>ls
2009-10 2010-01 2010-03 2010-06 filter_en_wikibooks.sh index.html?C=D;O=D index.html?C=N;O=A index.html?C=S;O=D
2009-11 2010-02 2010-04 2010-07 index.html index.html?C=M;O=A index.html?C=N;O=D mysql
2009-12 2010-02.03-wikibooks 2010-05 2010-08 index.html?C=D;O=A index.html?C=M;O=D index.html?C=S;O=A wikibooks

There is one directory per month.
A month of data is approximately 250GB compressed
cecchet@obelix.cs.umass.edu:/nfs/bigbackup/wikibench/traces>du --si 2010-05
257G 2010-05

The compression factor is about 6, so expect 1 month of traces to be about 1.5TB.
cecchet@obelix.cs.umass.edu:/nfs/bigbackup/wikibench/traces/2010-05>ls -lh wiki.1274957416.gz
-rw-r--r-- 1 cecchet lass 63M May 27 2010 wiki.1274957416.gz
cecchet@obelix.cs.umass.edu:/nfs/bigbackup/wikibench/traces/2010-05>ls -lh wiki.1274957416
-rw-r--r-- 1 cecchet lass 376M May 27 2010 wiki.1274957416

Each line in the log consists of:

  • a unique request id,
  • a timestamp,
  • the URL being accessed (there is no information on origin though you can sometimes find more info embedded in URLs).

154132716 2010-05-27T09:31:55.63 http://en.m.wikipedia.org/wiki/Judge_Me_Tender -

The traces come originally from Guillaume Pierre in Amsterdam. They published that paper about the workload: http://dl.acm.org/citation.cfm?id=1551224

Here is what I figured out so far from the URLs:

Wikibooks traces

List of parameters to PHP scripts described at http://www.mediawiki.org/wiki/Manual:Parameters_to_index.php

  • Wiki page access (GET):
    • http://en.wikibooks.org/wiki/NameOfThePage (this is what the user has clicked on)
    • Show random book (read interaction):
      • http://en.wikibooks.org/w/api.php?action=query&indexpageids=1&generator=random&grnnamespace=0%7C110&grnlimit=10&prop=categories&cllimit=100&format=json&callback=showRandBookCB&requestid=rb4
    • Rendering (cmd=rendering):
      • http://en.wikibooks.org/w/index.php?title=Special:Book&bookcmd=rendering&return_to=Special%3ABook&collection_id=2d1a78448da71d72&writer=rl
  • Write and Special actions:
    • POST
      • Edit a page:
        http://en.wikibooks.org/w/index.php?title=Oracle_Programming/10g_Advanced_SQL&action=edit&section=3
      • Submit (when edit is done):
        http://en.wikibooks.org/w/index.php?title=A-level_Biology/Human_Health_and_Disease/infectious_diseases&action=submit
    • RSS feeds (subscription to a RSS feed):
      • http://en.wikibooks.org/w/index.php?title=Special:RecentChanges&feed=atom
      • http://en.wikibooks.org/w/index.php?title=Special:RecentChanges&feed=rss
    • Search (when &suggest is appended, these are requests automatically generated by the web browser each time the user hit a key in the search field, it searches for
      suggestions in the search field)
      • http://en.wikibooks.org/w/api.php?action=opensearch&search=carcinoi&namespace=0%7C4%7C112&suggest
    • Login:
      • http://en.wikibooks.org/w/index.php?title=Special:UserLogin&type=signup
      • From a page:
        • http://en.wikibooks.org/w/index.php?title=Special:UserLogin&returnto=Help:Collections
        • http://en.wikibooks.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login&returnto=Help:Collections
      • Logout:
        • http://en.wikibooks.org/w/index.php?title=Special:UserLogou&returnto=Main_Page
      • Versioning (requires revisions of pages)
        • http://en.wikibooks.org/w/index.php?title=Na%27vi/Verbs&diff=1720852&oldid=prev
        • http://en.wikibooks.org/w/index.php?diff=1725464&oldid=1725463&rcid=1733365&diffonly=1&action=render
        • http://192.168.245.200/w/index.php?title=Talk:French&action=history
  • Ignored interactions:
    • Images (downloaded as part of the page):
      • http://upload.wikimedia.org/wikibooks/en/1/18/CompRad1.jpg actually located in http://IP/images/1/18/CompRad1.jpg
      • http://upload.wikimedia.org/wikibooks/en/thumb/0/00/Nonlinear_separable.JPG/300px-Nonlinear_separable.JPG
    • CentralAuth (http://www.mediawiki.org/wiki/Extension:CentralAuth) for authentication among multiple wikis (shared accounts):
      • http://en.wikibooks.org/w/api.php?action=query&meta=globaluserinfo&guiprop=merged%7Cunattached&format=json&guiuser=Karmine201
    • Talk?
      • http://en.wikibooks.org/w/api.php?inprop=protection%7Ctalkid%7Csubjectid%7Curl%7Creadable&format=json&rvprop=content%7Cids%7Cflags%7Ctimestamp%7Cuser%7Ccomment%7Csize&prop=revisions%7Cinfo&titles=User%20talk%3ALaleena&rvlimit=1&action=query
    • Redirects:
      • http://en.wikibooks.org/w/api.php?redirects=1&tllimit=500&format=json&rvprop=ids%7Ccontent%7Ctimestamp%7Cuser&prop=revisions%7Ccategories&titles=General+Chemistry%2FThermodynamics%2FThe+First+Law+of+Thermodynamics%7CGeneral+Chemistry%2FSolubility%7CTemplate:Element+color%2FAlkaline+earth+metals%7CTemplate:Element+color%2FAlkaline+earth+metals%2FPrint%7CGeneral+Chemistry%2FIntroduction%7CGeneral+Chemistry%2FThermodynamics%2FIntroduction%7CGeneral+Chemistry%2FChemical+Equilibria%2FLe+Chatelier%27s+Principle%7CGeneral+Chemistry%2FChemical+Equilibria%2FEquilibrium%7CTemplate:General+Chemistry%2FNavigation%7CTemplate:General+Chemistry%2FNavigation%2FPrint%7CGeneral+Chemistry%2FReaction+Mechanisms%7CGeneral+Chemistry%2FChemical+Equilibria%2FSolutions+in+Equilibrium%7CGeneral+Chemistry%2FThermodynamics%2FThe+Second+Law+of+Thermodynamics%7CGeneral+Chemistry%2FBook+Cover%7CGeneral+Chemistry%2FChe!
        mistries+of+Various+Elements%2FGroup+2&action=query&imlimit=500