Version 0.6
Copyright © 2009 - 2010 Lars Vogel
11.01.2010
| Revision History | ||
|---|---|---|
| Revision 0.1 | 15.09.2009 | Lars Vogel |
| Created | ||
| Revision 0.2 | 21.09.2009 | Lars Vogel |
| Added Pagerank | ||
| Revision 0.3 | 16.11.2009 | Lars Vogel |
| Add Google File System | ||
| Revision 0.4 | 09.01.2010 | Lars Vogel |
| Update on bigtable description | ||
| Revision 0.5 | 10.01.2010 | Lars Vogel |
| Added link to Google research papers | ||
| Revision 0.6 | 11.01.2010 | Lars Vogel |
| Added mapreduce | ||
Table of Contents
Google uses an amazing set of technologies. I use this article to keep information about this technology and to keep pointers to publicly available information about these technologies.
Pagerank is the weight of importance for a webpage calculated by Google.
The calculation process is not known and contains many factors but the principle is simple. A web page is more important if other web pages link to it. The more importance a web pages has (by inbound limks from other sites) the more important are outbound links from this webpage.
Please find a good desccription of the pagerank calculation in the following article http://www.ams.org/featurecolumn/archive/pagerank.html .
MapReduce is a programming model for solving large scale data problems using a functional programming model. The programming model is based on the definition of a map and a reduce function.
The map function processes key/value pairs to generated a different set of a intermediate key/value pairs.
The reduce function merges all intermediate values associated with the same intermediate key.
Typical applications are:
Distributed Grep
Reverse Web-Link Graph: Map function outputs (URL target, source) from an input webpage (source). The reduce function concatenates the list of all source URLs associated with a give target URL and returns (target, list(sources))
Word count in a number of documents
MapReduce at Google is described Google MapReduce Research paper .
A Java implementation (Apache Hadoop) is available and described here. MapReduce Tutorial with Apache Hadoop .
Google has the GFS, a distributed, multi-gigabyte files. This file system is described Google File System Whitepaper .
GFS does not handle the replication between different data centers.
Google uses as a data storage a facility called Bigtable . Bigtable is not a relational database.
Bigtable is a distributed, persistent, multidimensional sorted map. In Bigtable you can store strings under an index which consistes out of a row key, a column key and a timestamp. This key points to a uninterpreted array of bytes (string) of size 64 KB.
(row:string, column:string, time:int64) -> string
The key can be a database generated numeric ID or application created. This includes the timestamp but in case the application is responsible for creating the key it must ensure that the key is unique.
For example in the Google Webtable (for Google search) the reserse URL is used as the row key, the column used for different attributes of the webpage and the timestamp indicates from then the data is. The data this key points to is some content from the webpage.
Bigtable is build upon the Google File System and stored in an immutable datastructure called SSTable. The application can define how many entries based on the timestamp should be keep. Alternatively the application can also specify how long entries should be keep. Bigtable will clean-up the obsolete data by deleting the SSTables which only contains irrelevant data using a mark-and-sweep algorithm.
For more information on Bigtable check out the Google whitepaper Whitepaper for Bigtable .
Google provides on the Google App Engine memcache as a caching mechanism.
Memcache is a high-performance, distributed memory object caching system, primarily intended for fast access to cached results of datastore queries.
Similar to Bigtable it works similar to a map with key and objects. If the memory consumption of memcache is to big then memory will automatically released based on a Last-Recently-Used (LRU) strategy.
Google provides an API to put something into memcache and to remove something again from memcache.
Thank you for practicing with this tutorial.
I maintain this tutorial in my private time. If you like the information please help me by using flattr or donating or by
|
Before posting questions, please see the vogella FAQ . If you have questions or find an error in this article please use the www.vogella.de Google Group . I have created a short list how to create good questions which might also help you. .