Notes: Data

http://en.wikipedia.org/wiki/Machine-readable_data

Machine-readable data is data (or metadata) which is in a format that can be understood by a computer.

There are two types:

  • human-readable data that is marked up so that it can also be read by machines (examples; microformatsRDFa)
  • data file formats intended principally for machines (RDFXMLJSON)

For purposes of implementation of the GPRA Modernization Act (GPRAMA), the Office of Management and Budget (OMB) defines “machine readable” as follows: “Format in a standard computer language (not English text) that can be read automatically by a web browser or computer system. (e.g.; xml). Traditional word processing documents, hypertext markup language (HTML) and portable document format (PDF) files are easily read by humans but typically are difficult for machines to interpret. Other formats such as extensible markup language (XML), (JSON), or spreadsheets with header columns that can be exported as comma separated values (CSV) are machine readable formats. It is possible to make traditional word processing documents and other formats machine readable but the documents must include enhanced structural elements.”[1]

Publishing public data in an openstandard, machine-readable format is a best practice (good operating practice).

Presentational markup
The kind of markup used by traditional word-processing systems: binary codes embedded within document text that produce the WYSIWYG effect. Such markup is usually designed to be hidden from human users, even those who are authors or editors.
Procedural markup
Markup is embedded in text and provides instructions for programs that are to process the text. Well-known examples include troffLaTeX, and PostScript. It is expected that the processor will run through the text from beginning to end, following the instructions as encountered. Text with such markup is often edited with the markup visible and directly manipulated by the author. Popular procedural-markup systems usually include programming constructs, so macros or subroutines can be defined and invoked by name.
Descriptive markup
Markup is used to label parts of the document rather than to provide specific instructions as to how they should be processed. The objective is to decouple the inherent structure of the document from any particular treatment or rendition of it. Such markup is often described as “semantic”. An example of descriptive markup would be HTML’s <cite> tag, which is used to label a citation. Descriptive markup — sometimes called logical markup or conceptual markup — encourages authors to write in a way that describes the material conceptually, rather than visually.[3]

Root element

  • The root element of an XHTML document must be html
  • must contain an xmlns attribute to associate it with the XHTML namespace
  • The namespace URI for XHTML is http://www.w3.org/1999/xhtml
  • xml:lang attribute to identify the document with a natural language

Aryabhata (Sanskrit: आर्यभट; IAST: Āryabhaṭa) or Aryabhata I[1][2] (476–550 CE)[3][4] was the first in the line of greatmathematicianastronomers from the classical age of Indian mathematics and Indian astronomy. His works include theĀryabhaṭīya (499 CE, when he was 23 years old)[5] and the Arya-siddhanta.

Georges Ifrah concludes in his Universal History of Numbers:

Thus it would seem highly probable under the circumstances that the discovery of zero and the place-value system were inventions unique to the Indiancivilization. As the Brahmi notation of the first nine whole numbers (incontestably the graphical origin of our present-day numerals and of all the decimal numeral systems in use in India, Southeast and Central Asia and the Near East) was autochthonous and free of any outside influence, there can be no doubt that our decimal place-value system was born in India and was the product of Indian civilization alone.[3]

Aryabhata stated “sthānam sthānam daśa guṇam” meaning “From place to place, ten times in value”. Indian mathematicians and astronomers also developed Sanskrit positional number words to describe astronomical facts or algorithms using poetic sutras. A key argument against the positional system was its susceptibility to easy fraud by simply putting a number at the beginning or end of a quantity, thereby changing (e.g.) 100 into 5100, or 100 into 1000. Modern cheques require a natural language spelling of an amount, as well as the decimal amount itself, to prevent such fraud. For the same reason the Chinese also use natural language numerals, for example 100 is written as 壹佰, which can never be forged into 壹仟(1000) or 伍仟壹佰(5100).

Babylonian tablet YBC 7289 showing the sexagesimal number1;24,51,10 approximating √2

The square root of 2, the length of the diagonal of a unit square, was approximated by the Babylonians of the Old Babylonian Period (1900 BC – 1650 BC) as

1;24,51,10=1+frac{24}{60}+frac{51}{60^2}+frac{10}{60^3}=frac{30547}{21600}approx 1.414212ldots[15]

Because sqrt{2} is an irrational number, it cannot be expressed exactly in sexagesimal numbers, but its sexagesimal expansion does begin 1;24,51,10,7,46,6,4,44 …

The length of the tropical year in Neo-Babylonian astronomy (see Hipparchus), 365.24579… days, can be expressed in sexagesimal as 6,5;14,44,51 (6×60 + 5 + 14/60 + 44/602 + 51/603) days. The average length of a year in the Gregorian calendar is exactly 6,5;14,33 in the same notation because the values 14 and 33 were the first two values for the tropical year from the Alfonsine tables, which were in sexagesimal notation.

The value of π as used by the Greek mathematician and scientist Claudius Ptolemaeus (Ptolemy) was 3;8,30 = 3 + 8/60 + 30/602 = 377/120 ≈ 3.141666….[16] Jamshīd al-Kāshī, a 15th-century Persian mathematician, calculated π in sexagesimal numbers to an accuracy of nine sexagesimal digits; his value for 2π was 6;16,59,28,1,34,51,46,14,50.[17][18]

In computing, the binary (base-2) and hexadecimal (base-16) bases are used. Computers, at the most basic level, deal only with sequences of conventional zeroes and ones, thus it is easier in this sense to deal with powers of two. The hexadecimal system is used as “shorthand” for binary—every 4 binary digits (bits) relate to one and only one hexadecimal digit. In hexadecimal, the six digits after 9 are denoted by A, B, C, D, E, and F (and sometimes a, b, c, d, e, and f).

The octal numbering system is also used as another way to represent binary numbers. In this case the base is 8 and therefore only digits 0, 1, 2, 3, 4, 5, 6, and 7 are used. When converting from binary to octal every 3 bits relate to one and only one octal digit.

Base-12 systems (duodecimal or dozenal) have been popular because multiplication and division are easier than in base-10, with addition and subtraction being just as easy. Twelve is a useful base because it has many factors. It is the smallest common multiple of one, two, three, four and six. There is still a special word for “dozen” in English, and by analogy with the word for 102, hundred, commerce developed a word for 122, gross. The standard 12-hour clock and common use of 12 in English units emphasize the utility of the base. In addition, prior to its conversion to decimal, the old British currency Pound Sterling (GBP) partially used base-12; there were 12 pence (d) in a shilling (s), 20 shillings in a pound (£), and therefore 240 pence in a pound. Hence the term LSD or, more properly, £sd.

The Maya civilization and other civilizations of pre-Columbian Mesoamerica used base-20 (vigesimal), as did several North American tribes (two being in southern California). Evidence of base-20 counting systems is also found in the languages of central and western Africa.

Remnants of a Gaulish base-20 system also exist in French, as seen today in the names of the numbers from 60 through 99. For example, sixty-five is soixante-cinq(literally, “sixty [and] five”), while seventy-five is soixante-quinze (literally, “sixty [and] fifteen”). Furthermore, for any number between 80 and 99, the “tens-column” number is expressed as a multiple of twenty (somewhat similar to the archaic English manner of speaking of “scores“, probably originating from the same underlying Celtic system). For example, eighty-two is quatre-vingt-deux (literally, four twenty[s] [and] two), while ninety-two is quatre-vingt-douze (literally, four twenty[s] [and] twelve). In Old French, forty was expressed as two twenties and sixty was three twenties, so that fifty-three was expressed as two twenties [and] thirteen, and so on.

The Irish language also used base-20 in the past, twenty being fichid, forty dhá fhichid, sixty trí fhichid and eighty ceithre fhichid. A remnant of this system may be seen in the modern word for 40, daoichead.

The Welsh language continues to use a base-20 counting system, particularly for the age of people, dates and in common phrases. 15 is also important, with 16–19 being “one on 15”, “two on 15” etc. 18 is normally “two nines”. A decimal system is commonly used.

Danish numerals display a similar base-20 structure.

The Maori language of New Zealand also has evidence of an underlying base-20 system as seen in the terms Te Hokowhitu a Tu referring to a war party (literally “the seven 20s of Tu”) and Tama-hokotahi, referring to a great warrior (“the one man equal to 20”).

The binary system was used in the Egyptian Old Kingdom, 3000 BC to 2050 BC. It was cursive by rounding off rational numbers smaller than 1 to1/2 + 1/4 + 1/8 + 1/16 + 1/32 + 1/64, with a 1/64 term thrown away (the system was called the Eye of Horus).

A number of Australian Aboriginal languages employ binary or binary-like counting systems. For example, in Kala Lagaw Ya, the numbers one through six are urapon,ukasarukasar-uraponukasar-ukasarukasar-ukasar-uraponukasar-ukasar-ukasar.

North and Central American natives used base-4 (quaternary) to represent the four cardinal directions. Mesoamericans tended to add a second base-5 system to create a modified base-20 system.

A base-5 system (quinary) has been used in many cultures for counting. Plainly it is based on the number of digits on a human hand. It may also be regarded as a sub-base of other bases, such as base-10, base-20, and base-60.

A base-8 system (octal) was devised by the Yuki tribe of Northern California, who used the spaces between the fingers to count, corresponding to the digits one through eight.[6] There is also linguistic evidence which suggests that the Bronze Age Proto-Indo Europeans (from whom most European and Indic languages descend) might have replaced a base-8 system (or a system which could only count up to 8) with a base-10 system. The evidence is that the word for 9, newm, is suggested by some to derive from the word for “new”, newo-, suggesting that the number 9 had been recently invented and called the “new number”.[7]

Many ancient counting systems use five as a primary base, almost surely coming from the number of fingers on a person’s hand. Often these systems are supplemented with a secondary base, sometimes ten, sometimes twenty. In some African languages the word for five is the same as “hand” or “fist” (Dyola language of Guinea-Bissau,Banda language of Central Africa). Counting continues by adding 1, 2, 3, or 4 to combinations of 5, until the secondary base is reached. In the case of twenty, this word often means “man complete”. This system is referred to as quinquavigesimal. It is found in many languages of the Sudan region.

The Telefol language, spoken in Papua New Guinea, is notable for possessing a base-27 numeral system.

Each position does not need to be positional itself. Babylonian sexagesimal numerals were positional, but in each position were groups of two kinds of wedges representing ones and tens (a narrow vertical wedge ( | ) and an open left pointing wedge (<))—up to 14 symbols per position (5 tens (<<<<<) and 9 ones ( ||||||||| ) grouped into one or two near squares containing up to three tiers of symbols, or a place holder (\) for the lack of a position).[8] Hellenistic astronomers used one or two alphabetic Greek numerals for each position (one chosen from 5 letters representing 10–50 and/or one chosen from 9 letters representing 1–9, or a zero symbol).[9]

The bijective base-10 system is also known as decimal without a zero. It is a base ten positional numeral system that does not use a digit to represent zero. It instead has a digit to represent ten, such as A.

The system of common numerals used in ancient Greece prior to the Hellenistic Age was a bijective base-10 number system in which letters of the Greek alphabet were assigned values between 1 and 900. This was the system used to reckon the year based on the four-year Olympiads, so for instance 480 BCE (the date of the Battle of Thermopylae) would be written ἔτει αʹ Ὀλυμπιάδος οδʹ, that is, the 1st year of the 74th Olympiad. These numbers are still commonly used in Greece for ordinals.

The bijective base-26 system is also known as base 26 without a zero. It is a base twenty-six positional numeral system that does not use a digit to represent zero. It uses digits “A” to “Z” to represent one to twenty-six.

The number sequence goes A, B, C, …, X, Y, Z, AA, AB, AC, …, AX, AY, AZ, BA, BB, BC, …

Each digit position represents a power of twenty-six, so for example, ABC is “one 262, plus two 261, plus three 260” since A is worth one, B is worth two, and C is worth three.

In this representation the number WIKIPEDIA is:

23×268 + 9×267 + 11×266 + 9×265 + 16×264 + 5×263 + 4×262 + 9×261 + 1×260 = 4,878,821,185,187

Many spreadsheets including Microsoft Excel use the 26-adic counting system with the “digits” A-Z to label the columns of a spreadsheet, starting A, B, C, …, Z, AA, AB, …, AZ, BA, …, ZZ, AAA, etc. The numbering starts at 1 or A, so the empty string is not used. A variant of this system is used to name variable stars, it can be applied to any problem where a systematic naming using letters is desired, while using the shortest possible strings.

The fact that every non-negative integer has a unique representation in bijective base-k (k ≥ 1), is a “folk theorem” that has been rediscovered many times. Early instances are Foster, J. E. (1947), Smullyan (1961) for the case k = 2, and Böhm (1964) for all k ≥ 1 (the latter using these representations to perform computations in the programming language P′′). Knuth (1969) mentions the special case of k = 10, and Salomaa (1973) discusses the cases k ≥ 2. Forslund (1995) considers that if ancient numeration systems used bijective base-k, they might not be recognized as such in archaeological documents, due to general unfamiliarity with this system. (The latter article is notable in that it does not cite existing literature on this system, but appears to unwittingly reinvent it.)

HASH TABLE
TYPE Unordered associative array
INVENTED 1953

The idea of hashing arose independently in different places. In January 1953, H. P. Luhn wrote an internal IBM memorandum that used hashing with chaining.[23] G. N. Amdahl, E. M. Boehme, N. Rochester, and Arthur Samuel implemented a program using hashing at about the same time. Open addressing with linear probing (relatively prime stepping) is credited to Amdahl, but Ershov (in Russia) had the same idea.[23]

BINARY SEARCH TREE
TYPE Tree
INVENTED 1960
INVENTED BY P.F. Windley, A.D. Booth, A.J.T. Colin, and T.N. Hibbard

ftp://ftp.math.utah.edu/pub/tex/bib/idx/compj/3/2/84-88.html

Entry Windley:1960:TFR from compj.bib

@Article{Windley:1960:TFR,
author = “P. F. Windley”,
title = “Trees, forests and rearranging”

Windley, P. F., 2(1)47, 2(2)49, 3(1)

@Article{Windley:1959:TMD,
author = “P. F. Windley”,
title = “Transposing matrices in a digital computer”

@Article{Windley:1959:ISA,
author = “P. F. Windley”,
title = “The influence of storage access time on merging
processes in a computer”

AVL TREE
TYPE Tree
INVENTED 1962
INVENTED BY G. M. Adelson-Velskii andE. M. Landis

The AVL tree is named after its two Soviet inventors, G. M. Adelson-Velskii and E. M. Landis, who published it in their 1962 paper “An algorithm for the organization of information”.[2] It was the first such data structure to be invented.

A B-tree of order 2 (Bayer & McCreight 1972) or order 5 (Knuth 1998).

Etymology unknown

Rudolf Bayer and Ed McCreight invented the B-tree while working at Boeing Research Labs in 1971 (Bayer & McCreight 1972), but they did not explain what, if anything, the B stands for. Douglas Comer explains:

The origin of “B-tree” has never been explained by the authors. As we shall see, “balanced,” “broad,” or “bushy” might apply. Others suggest that the “B” stands for Boeing. Because of his contributions, however, it seems appropriate to think of B-trees as “Bayer”-trees. (Comer 1979, p. 123 footnote 1)

Donald Knuth speculates on the etymology of B-trees in his May, 1980 lecture on the topic “CS144C classroom lecture about disk storage and B-trees”, suggesting the “B” may have originated from Boeing or from Bayer’s name.[2]

After a talk at CPM 2013 (24th Annual Symposium on Combinatorial Pattern Matching, Bad Herrenalb, Germany, June 17–19, 2013), Ed McCreight answered a question on B-tree’s name by Martin Farach-Colton saying: “Bayer and I were in a lunch time where we get to think a name. And we were, so, B, we were thinking… B is, you know… We were working for Boeing at the time, we couldn’t use the name without talking to lawyers. So, there is a B. It has to do with balance, another B. Bayer was the senior author, who did have several years older than I am and had many more publications than I did. So there is another B. And so, at the lunch table we never did resolve whether there was one of those that made more sense than the rest. What really lives to say is: the more you think about what the B in B-trees means, the better you understand B-trees.”[3]

An ancestry chart which maps to a perfect depth-4 binary tree


Singly-linked-list.svg
A linked list whose nodes contain two fields: an integer value and a link to the next node. The last node is linked to a terminator used to signify the end of the list.
Singly-linked-list.svg
A singly linked list whose nodes contain two fields: an integer value and a link to the next node
Doubly-linked-list.svg
A doubly linked list whose nodes contain three fields: an integer value, the link forward to the next node, and the link backward to the previous node
Circularly-linked-list.svg
A circular linked list




Diagram of binary tree. The black root node has two red children and four black grandchildren. The child nodes of the grandchildren are black nil pointers or red nodes with black nil pointers.

Diagram of case 3
Diagram of case 4
Diagram of case 5
Diagram of case 2
Diagram of case 3
Diagram of case 4
Diagram of case 5
Diagram of case 6
Diagram of case 6

Binary tree sort(2).png

Zigzig.gif

Zigzag.gif






A small complete binary tree stored in an arrayAn example of converting an n-ary tree to a binary tree

Data rectangles organized in a Hilbert R-tree

Collection

  • shared significance
  • problem being solved
  • operated upon together
  • controlled

Generally, the data items will be of the same type or, in languages supporting inheritance, derived from some common ancestor type.

  • listssetsbags (or multisets),
  • trees and graphs.
  • An enumerated type may be either a list or a set.
  • Variable-sized arrays are generally considered collections, and fixed-size arrays may likewise considered a collection, albeit with limitations.

A fixed-size table (or array) is usually not considered a collection because it holds a fixed number of items,

although tables/arrays commonly play a role in the implementation of collections.

Linear Collection

data in a line, ordered in some way, with access to one or both ends.

List

Duplicate data items are permitted.

  • list or sequence
  • abstract data type
  • finite ordered collection of values
  • where the same value may occur more than once
  • concept of a finite sequence;
  • the (potentially) infinite analog of a list is a stream.
  • Lists are a basic example of containers, as they contain other values.
  • Each instance of a value in the list is usually called an itementry, or element of the list; if the same value occurs multiple times, each occurrence is considered a distinct item
  • lists only allow sequential access,
  • arrays allow random access
  • queue or FIFO special case of the list collection
  • stack or LIFO special case of the list collection

Other specialized operations on lists include sorting, where, again, the order of items is of great importance.

Priority queues: minimum or maximum at the head, while the remaining elements are kept in a bag.

Some languages do not offer a list data structure, but offer the use of associative arrays or some kind of table to emulate lists.

Associative Collection

  • sets
  • multisets
  • Associative arrays

Other collections can instead be interpreted as sort of function: given an input “key”, the collection yields an output value.

A set can be interpreted as a specialized multiset, which in turn is a specialized map, in each case by limiting the possible values – considering a set as represented by its indicator function.

Sets=unordered.  unique.

Some languages support sets directly

in others, sets can be implemented by a hash table with dummy values; only the keys are used in representing the set.

bag=multiset

Associative Array

  • map
  • dictionary
  • lookup table
  • The “value” might be a reference to a compound data structure.A hash table is usually an efficient implementation, and thus this data type is often known as a “hash”.

Hash table

load factor=n/k

where n is the number of entries and k is the number of buckets.

  • load factor is kept reasonable, the hash table should perform well, provided the hashing is good.
  • If the load factor grows too large, the hash table will become slow, or it may fail to work (depending on the method used)
  • expected constant time property of a hash table assumes that the load factor is kept below some bound.
  • fixed number of buckets, the time for a lookup grows with the number of entries and so does not achieve the desired constant time.
  • variance of number of entries per bucket.
  • A low load factor is not especially beneficial

Linked List

  • group of nodes which together represent a sequence.
  • each node is composed of a datum and a reference (in other words, a link) to the next node
  • more complex variants add additional links.
  • This structure allows for efficient insertion or removal of elements from any position in the sequence.
  • Linked lists are among the simplest and most common data structures.
  • They can be used to implement several other common abstract data types, including lists (the abstract data type), stacksqueuesassociative arrays, and S-expressions, though it is not uncommon to implement the other data structures directly without using a list as the basis of implementation.
  • list elements can easily be inserted or removed without reallocation or reorganization of the entire structure
  • allow insertion and removal of nodes at any point in the list, and can do so with a constant number of operations
  • On the other hand, simple linked lists by themselves do not allow random access to the data, or any form of efficient indexing. Thus, many basic operations may require scanning most or all of the list elements

Queue

 a collection in which the entities in the collection are kept in order

First-In-First-Out (FIFO) data structure:

  • principal (or only) operations on the collection are
  • addition of entities to the rear terminal position, known as enqueue
  • removal of entities from the front terminal position, known as dequeue.

Tree

Elements

  • Root • top most node in a tree.
  • Parent (or ancestor node, or superior) • node that has a child
  • Siblings • node with the same parent
  • Leaves (external node, outer nodeleaf node, or terminal node) • nodes with no children
  • Internal nodes (inner nodeinode for short, or branch node •nodes with at least one child
  • Degree •number of sub trees of a node
  • Edge •connection between one node to another
  • Height •The level of a node is defined by initially letting the root be at level 1
  • Forest • A forest is a set of n ≥ 0 disjoint trees

by convention, trees are drawn growing downwards

Binary tree

  • The number of nodes n in a perfect binary tree can be found using this formula: n = 2h+1-1 where h is the depth of the tree.
  • The number of nodes n in a binary tree of height h is at least n = h + 1 and at most n = 2h+1-1 where h is the depth of the tree.
  • The number of leaf nodes l in a perfect binary tree can be found using this formula: l = 2h where h is the depth of the tree.
  • The number of nodes n in a perfect binary tree can also be found using this formula: n = 2l-1 where l is the number of leaf nodes in the tree.
  • The number of null links (i.e., absent children of nodes) in a complete binary tree of n nodes is (n+1).
  • The number of internal nodes (i.e., non-leaf nodes or nl) in a complete binary tree of n nodes is ⌊ n/2 ⌋.
  • For any non-empty binary tree with n0 leaf nodes and n2 nodes of degree 2, n0 = n2 + 1.[6]

Binary search tree

  • binary search tree (BST) also called an ordered or sorted binary tree
  • node-based binary tree
  • each node has a comparable key (and an associated value)
  • satisfies the restriction that the key in any node is larger than the keys in all nodes in that node’s left subtree and smaller than the keys in all nodes in that node’s right sub-tree
  • Each node has no more than two child nodes
  • Each child must either be a leaf node or the root of another binary search tree
  • The left sub-tree contains only nodes with keys less than the parent node
  • the right sub-tree contains only nodes with keys greater than the parent node.
  • BSTs are also dynamic data structures, and the size of a BST is only limited by the amount of free memory in the operating system.
  • binary search trees remain ordered, which provides quicker search times than many other data structures.
  • Binary Search Tree is fast in insertion and deletion etc when balanced.
  • Very efficient and its code is easier than other data structures.
  • Stores keys in the nodes in a way that searching, insertion and deletion can be done efficiently.
  • Implementation is very simple in Binary Search Trees.
  • Nodes in tree are dynamic in nature.
  • Generally, the information represented by each node is a record rather than a single data element.
  • for sequencing purposes, nodes are compared according to their keys rather than any part of their associated records.
  • sorting algorithms and search algorithms such as in-order traversal can be very efficient
  • Binary search trees are a fundamental data structure used to construct more abstract data structures such as setsmultisets, and associative arrays.

The common properties of binary search trees are as follows:[1]

  • The left subtree of a node contains only nodes with keys less than the node’s key.
  • The right subtree of a node contains only nodes with keys greater than the node’s key.
  • The left and right subtree each must also be a binary search tree.
  • Each node can have up to two successor nodes.
  • There must be no duplicate nodes.
  • A unique path exists from the root to every other node.

Some of their disadvantages are as follows:

  • The shape of the binary search tree totally depends on the order of insertions, and it can be degenerated.
  • When inserting or searching for an element in binary search tree, the key of each visited node has to be compared with the key of the element to be inserted or found, i.e., it takes a long time to search an element in a binary search tree.
  • The keys in the binary search tree may be long and the run time may increase.

Binary tree: In short, a binary tree is a tree where each node has up to two leaves. In binary tree, a left child node and a right child node contains value which can be either greater, less, or equal to parent node.

Binary Search Tree: In binary search tree, the left child contains nodes with values less than the parent node and where the right child only contains nodes with values greater than or equal to the parent.

There are many types of binary search trees.

AVL trees
red-black trees

  • splay tree automatically moves frequently accessed elements nearer to the root.

 treap (tree heap), each node also holds a (randomly chosen) priority and the parent node has higher priority than its children.
Tango trees are trees optimized for fast searches.

Degenerate tree

  • A complete binary tree is a tree with n levels, where for each level d <= n – 1, the number of existing nodes at level d is equal to 2d.
  • This means all possible nodes exist at these levels.
  • An additional requirement for a complete binary tree is that for the nth level, while every node does not have to exist, the nodes that do exist must fill from left to right.
  • A degenerate tree is a tree where for each parent node, there is only one associated child node. It is unbalanced and, in the worst case, performance degrades to that of a linked list.
  • If your added node function does not handle re-balancing, then you can easily construct a degenerate tree by feeding it with data that is already sorted.
  • What this means is that in a performance measurement, the tree will essentially behave like a linked list data structure.

AVL tree

  • AVL tree (Adelson-Velskii and Landis’ tree, named after the inventors) is a self-balancing binary search tree.
  • It was the first such data structure to be invented.[1]
  • In an AVL tree, the heights of the two child subtrees of any node differ by at most one;
  • if at any time they differ by more than one, rebalancing is done to restore this property.
  • Lookup, insertion, and deletion all take O(log n) time in both the average and worst cases, where n is the number of nodes in the tree prior to the operation.
  • Insertions and deletions may require the tree to be rebalanced by one or more tree rotations.

Comparison Red/Black

  • AVL trees are often compared with red-black trees because both support the same set of operations and take O(log n) time for the basic operations.
  • For lookup-intensive applications, AVL trees are faster than red-black trees because they are more rigidly balanced.[3]
  • Similar to red-black trees, AVL trees are height-balanced.
  • Both are in general not weight-balancednor μ-balanced for any scriptstyle muleqtfrac12;[4] that is, sibling nodes can have hugely differing numbers of descendants.

AVL trees are more rigidly balanced than red-black trees, leading to slower insertion and removal but faster retrieval.

Red–black tree

In addition to the requirements imposed on a binary search tree the following must be satisfied by a red–black tree:[5]

  1. A node is either red or black.
  2. The root is black. (This rule is sometimes omitted. Since the root can always be changed from red to black, but not necessarily vice-versa, this rule has little effect on analysis.)
  3. All leaves (NIL) are black. (All leaves are same color as the root.)
  4. Every red node must have two black child nodes.
  5. Every path from a given node to any of its descendant leaves contains the same number of black nodes.
  • painting each node of the tree with one of two colors (typically called ‘red’ and ‘black’) in a way that satisfies certain properties
  • collectively constrain how unbalanced the tree can become in the worst case.
  • When the tree is modified, the new tree is subsequently rearranged and repainted to restore the coloring properties. efficiently.
  • The balancing of the tree is not perfect but it is good enough to allow it to guarantee searching in O(log n) time, where n is the total number of elements in the tree.
  • The insertion and deletion operations, along with the tree rearrangement and recoloring, are also performed in O(log n) time.[1]
  • Tracking the color of each node requires only 1 bit of information per node because there are only two colors.
  • The tree does not contain any other data specific to its being a red–black tree so its memory footprint is almost identical to a classic (uncolored) binary search tree.
  • In many cases the additional bit of information can be stored at no additional memory cost.
  • organize pieces of comparable data, such as text fragments or numbers.
  • leaf nodes do not contain data.
  • leaf nodes need not be explicit in computer memory—a null child pointer can encode the fact that this child is a leaf—but it simplifies some algorithms for operating on red–black trees if the leaves really are explicit nodes.
  • To save memory, sometimes a single sentinel node performs the role of all leaf nodes; all references from internal nodes to leaf nodes then point to the sentinel node.
  • allow efficient in-order traversal (that is: in the order Left–Root–Right) of their elements.
  • search-time from traversal root to leaf, a balanced tree of n nodes, having the least possible tree height, results in O(log n) search time.
  • These constraints enforce a critical property of red–black trees: that the path from the root to the furthest leaf is no more than twice as long as the path from the root to the nearest leaf.
  • The result is that the tree is roughly height-balanced.
  • Since operations such as inserting, deleting, and finding values require worst-case time proportional to the height of the tree, this theoretical upper bound on the height allows red–black trees to be efficient in the worst case, unlike ordinary binary search trees.
  • The shortest possible path has all black nodes, and the longest possible path alternates between red and black nodes.
  • Since all maximal paths have the same number of black nodes, by property 5, this shows that no path is more than twice as long as any other path.

K-ary Tree

full k-ary tree is a k-ary tree where within each level every node has either 0 or k children.

  • perfect k-ary tree is a full [1] k-ary tree in which all leaf nodes are at the same depth.[2]
  • complete k-ary tree is a k-ary tree which is maximally space efficient. It must be completely filled on every level (meaning that each level has k children) except for the last level (which can have at most k children). However, if the last level is not complete, then all nodes of the tree must be “as far left as possible”. [1]

B-tree

  • Not to be confused with Binary tree
  • B-tree is a tree data structure that keeps data sorted
    • searches
    • sequential access
    • insertions
    • deletions
  • keeps keys in sorted order for sequential traversing
  • uses a hierarchical index to minimize the number of disk reads
  • uses partially full blocks to speed insertions and deletions
  • keeps the index balanced with an elegant recursive algorithm
  • minimizes waste by making sure the interior nodes are at least half full.
  • can handle an arbitrary number of insertions and deletions.
  • logarithmic time considerations
  • The B-tree is a generalization of a binary search tree in that a node can have more than two children (Comer 1979, p. 123).
  • Unlike self-balancing binary search trees, the B-tree is optimized for systems that read and write large blocks of data.
  • It is commonly used in databases and filesystems.

The database problem

This section describes a problem faced by database designers, outlines a series of increasingly effective solutions to the problem, and ends by describing how the B-tree solves the problem completely.

Time to search a sorted file

Usually, sorting and searching algorithms have been characterized by the number of comparison operations that must be performed using order notation.

  • binary searchof a sorted table with N records, for example, can be done in roughly lceil log_2 N rceil comparisons.
  • If the table had 1,000,000 records, then a specific record could be located with at most 20 comparisons: lceil log_2 1,000,000 rceil = 20 .
  • Large databases have historically been kept on disk drives.
  • The time to read a record on a disk drive far exceeds the time needed to compare keys once the record is available.
  • The time to read a record from a disk drive involves a seek time and a rotational delay.
  • The seek time may be 0 to 20 or more milliseconds, and the rotational delay averages about half the rotation period. For a 7200 RPM drive, the rotation period is 8.33 milliseconds.
  • For a drive such as the Seagate ST3500320NS, the track-to-track seek time is 0.8 milliseconds and the average reading seek time is 8.5 milliseconds.[4] For simplicity, assume reading from disk takes about 10 milliseconds.
  • Naively, then, the time to locate one record out of a million would take 20 disk reads times 10 milliseconds per disk read, which is 0.2 seconds.
  • The time won’t be that bad because individual records are grouped together in a disk block.
  • A disk block might be 16 kilobytes. If each record is 160 bytes, then 100 records could be stored in each block.
  • The disk read time above was actually for an entire block.
  • Once the disk head is in position, one or more disk blocks can be read with little delay.
  • With 100 records per block, the last 6 or so comparisons don’t need to do any disk reads—the comparisons are all within the last disk block read.
  • To speed the search further, the first 13 to 14 comparisons (which each required a disk access) must be sped up.
  • A significant improvement can be made with an index.
  • In the example above, initial disk reads narrowed the search range by a factor of two.
  • That can be improved substantially by creating an auxiliary index that contains the first record in each disk block (sometimes called a sparse index).
  • This auxiliary index would be 1% of the size of the original database, but it can be searched more quickly.
  • Finding an entry in the auxiliary index would tell us which block to search in the main database; after searching the auxiliary index, we would have to search only that one block of the main database—at a cost of one more disk read.
  • The index would hold 10,000 entries, so it would take at most 14 comparisons. Like the main database, the last 6 or so comparisons in the aux index would be on the same disk block. The index could be searched in about 8 disk reads, and the desired record could be accessed in 9 disk reads.
  • The trick of creating an auxiliary index can be repeated to make an auxiliary index to the auxiliary index. That would make an aux-aux index that would need only 100 entries and would fit in one disk block.
  • Instead of reading 14 disk blocks to find the desired record, we only need to read 3 blocks. Reading and searching the first (and only) block of the aux-aux index identifies the relevant block in aux-index.
  • Reading and searching that aux-index block identifies the relevant block in the main database.
  • Instead of 150 milliseconds, we need only 30 milliseconds to get the record.
  • binary search requiring roughly log_2 N disk reads to one requiring only log_b N disk reads where b is the blocking factor (the number of entries per block: b = 100 entries per block; log_b 1,000,000 = 3 reads).
  • In practice, if the main database is being frequently searched, the aux-aux index and much of the aux index may reside in a disk cache, so they would not incur a disk read.

Delete

  • Deleting records from a database doesn’t cause much trouble. The index can stay the same, and the record can just be marked as deleted.
  • The database stays in sorted order.
  • If there is a large number of deletions, then the searching and storage become less efficient.

Insert

  • Insertions can be very slow in a sorted sequential file because room for the inserted record must be made.
  • Inserting a record before the first record in the file requires shifting all of the records down one.
  • Such an operation is just too expensive to be practical.
  • Instead of densely storing all the records in a block, the block can have some free space to allow for subsequent insertions (marked as if they were “deleted” records)
  • If an insertion won’t fit on the block, then some free space on some nearby block must be found and the auxiliary indices adjusted.
  • The hope is that enough space is nearby such that a lot of blocks do not need to be reorganized.
  • some out-of-sequence disk blocks may be used.

Technical description

Terminology

Unfortunately, the literature on B-trees is not uniform in its terminology (Folk & Zoellick 1992, p. 362).

Bayer & McCreight (1972), Comer (1979), and others define the order of B-tree as the minimum number of keys in a non-root node.

Folk & Zoellick (1992) points out that terminology is ambiguous because the maximum number of keys is not clear.

An order 3 B-tree might hold a maximum of 6 keys or a maximum of 7 keys.

Knuth (1998, p. 483) avoids the problem by defining the order to be maximum number of children (which is one more than the maximum number of keys).

The term leaf is also inconsistent. Bayer & McCreight (1972) considered the leaf level to be the lowest level of keys, but Knuth considered the leaf level to be one level below the lowest keys (Folk & Zoellick 1992, p. 363).

There are many possible implementation choices.

In some designs, the leaves may hold the entire data record; in other designs, the leaves may only hold pointers to the data record. Those choices are not fundamental to the idea of a B-tree.[5]

There are also unfortunate choices like using the variable k to represent the number of children when k could be confused with the number of keys.

For simplicity, most authors assume there are a fixed number of keys that fit in a node. The basic assumption is the key size is fixed and the node size is fixed. In practice, variable length keys may be employed (Folk & Zoellick 1992, p. 379).

Best case and worst case heights

Let h be the height of the classic B-tree. Let n > 0 be the number of entries in the tree.[6] Let m be the maximum number of children a node can have. Each node can have at most m−1 keys.

It can be shown (by induction for example) that a B-tree of height h with all its nodes completely filled has n=mh−1 entries. Hence, the best case height of a B-tree is:

lceil log_{m} (n+1) rceil .

Let d be the minimum number of children an internal (non-root) node can have. For an ordinary B-tree, d=⌈m/2⌉.

Comer (1979, p. 127) and Cormen et al. (2001, pp. 383–384) give the worst case height of a B-tree (where the root node is considered to have height 0) as

h le leftlfloor log_{d}left(frac{n+1}{2}right) rightrfloor .
DEFAULT (1)
FILL (2)
RANGE (3)
COPY (4)
MOVE (5)
INITIALIZER LIST (6)

Example

http://runnable.com/Us5yYIzciVFWAAYA/how-to-create-a-new-linked-list-for-c%2B%2B

//A simple example of hashtable


 http://stackoverflow.com/questions/14945834/search-hash-in-hash-table

http://www.cprogramming.com/tutorial/computersciencetheory/queuecode.html


http://www.programming-techniques.com/2011/11/binary-tree-implementation-on-c.html

cpp-btree

Memory usage comparison

For trees containing more than just a few elements, and especially for those with small value types, the C++ B-tree containers promise to reduce memory usage significantly. The following table shows the average number of bytes per value in the Red-Black set/map compared with the B-tree set/map for several types. Numbers are given for both sorted and random insertion, because the B-tree treats sorted insertion as a special case, yielding a full tree. Keep in mind that in the worst case, B-tree nodes will be 66% full, meaning average bytes per value can be up to 50% greater than the sorted insertion results shown here.

Type Insertion B-Tree (32-bit) Red-Black (32-bit) B-Tree (64-bit) Red-Black (64-bit)
set<int32_t> sorted 4.19 20.00 4.40 40.00
random 4.90 20.00 5.15 40.00
set<int64_t> sorted 8.39 24.00 8.80 40.00
random 9.96 24.00 10.47 40.00
set<string> sorted 24.57 40.00 33.60 64.00
random 29.49 40.00 40.74 64.00
map<int32_t, void*> sorted 8.39 24.00 8.80 48.00
random 9.96 24.00 10.47 48.00
map<int64_t, void*> sorted 12.60 28.00 13.20 48.00
random 15.16 28.00 15.92 48.00
map<string, void*> sorted 28.67 44.00 38.16 72.00
random 34.49 44.00 46.53 72.00

 

Performance comparison

The chart below shows the average time to insert randomly-ordered keys using set<int32_t>btree_set<int32_t>map<int32_t, void*>, and btree_map<int32_t, void*>, on a 64-bit platform, as a function of container size.

The next chart shows the average time to lookup randomly-ordered keys using the same containers, also as a function of container size.

Limitations

As described below, there are a few limitations of the C++ B-tree.

  • Insertions into and deletions from B-tree containers invalidate existing iterators. See the discussion of “safe” maps and sets, below.
  • Iterator increment and decrement operations are slightly slower than their standard counterparts.
  • Due to a certain optimization (see the discussion of 3-way compare functions, below), this library depends on <type_traits> from the C++11 standard.
  • The C++11 emplace operations are not yet implemented (a TODO).

Example Use: btree_map

The following example returns a new map of integer keys and pointers to MyObject. In the inner loop, it checks whether the key already exists, and if not inserts a new pair.

 Example Use: btree_set

This example counts the number of duplicated integers in a series.

Safe B-Tree maps and sets

If iterator invalidation would otherwise prevent you from using the B-tree, it may make sense to consider safe_btree_map or safe_btree_set. These containers keep a generation count, which is incremented every time there is an insertion or deletion. Iterators for these containers keep a copy of their current position’s key and the last-valid generation count. When accessed, the iterator will reposition itself using its last-current key.

Safe versions of btree_multiset and btree_multimap are not provided.

In one common scenario, the only reason to keep an iterator across mutations to the container is to erase elements while performing a sequential scan. For this limited use-case, the erase() method returns an iterator at the next position in the tree,

Intra-node search

When searching a B-tree container, within each node in the tree there are several approaches taken based on the kind of key and the type of compare function provided.

For integer and floating point keys, we choose linear search through the array of keys. Typically, this is faster than binary search due to CPU branch-prediction. For other types of key, we use binary search within each node, as you might expect.

Ordinarily, tree searches are performed using less than, a boolean function. Some data types (e.g., std::string) provide a natural 3-way compare function, which returns more information and can therefore reduce the number of comparisons needed. C++ B-tree containers support the use of 3-way compare functions by providing a Compare class that subclasses btree_key_compare_to_tag, for example:

The user-provided compare function should return < 0 if a is less than b0 if a is equal to b, and > 0 if a is greater than b. C++ B-tree automatically configures the use of 3-way compare for std::string keys.

Biased insertion and deletion

The insertion and deletion routines have special support to avoid unnecessary rebalancing of perfectly full trees. Biased insertion: special-case code ensures that when keys are inserted in ascending or descending order, the resulting tree will be completely full. Biased deletion: special-case code avoids rebalancing when removing the first or last element.

B+ tree

  • B+ tree is an n-ary tree with a variable but often large number of children per node
  • B+ tree consists of a root, internal nodes and leaves.[1]
  •  The root may be either a leaf or a node with two or more children.[2]
  • A B+ tree can be viewed as a B-tree in which each node contains only keys (not key-value pairs), and to which an additional level is added at the bottom with linked leaves.
  • The primary value of a B+ tree is in storing data for efficient retrieval in a block-oriented storage context — in particular, filesystems.
  • B+ trees have very high fanout (number of pointers to child nodes in a node,[1] typically on the order of 100 or more), which reduces the number of I/O operations required to find an element in the tree.
  • The order, or branching factor, b of a B+ tree measures the capacity of nodes (i.e., the number of children nodes) for internal nodes in the tree.
  • The actual number of children for a node, referred to here as m, is constrained for internal nodes so that  lceil b/2 rceil le m le b.
  • The root is an exception: it is allowed to have as few as two children.[1]

For example, if the order of a B+ tree is 7, each internal node (except for the root) may have between 4 and 7 children; the root may have between 2 and 7. Leaf nodes have no children, but are constrained so that the number of keys must be at least  lfloor b/2 rfloor  and at most  b - 1 . In the situation where a B+ tree is nearly empty, it only contains one node, which is a leaf node. (The root is also the single leaf, in this case.) This node is permitted to have as little as one key if necessary, and at most b.

Node Type Children Type Min Children Max Children Example b = 7 Example b = 100
Root Node (when it is the only node in the tree) Records 1 b 1 – 7 1 – 100
Root Node Internal Nodes or Leaf Nodes 2 b 2 – 7 2 – 100
Internal Node Internal Nodes or Leaf Nodes  lceil b/2 rceil b 4 – 7 50 – 100
Leaf Node Records  lfloor b/2 rfloor b – 1 3 – 6 50 – 99

Algorithms

Search

The root of a B+ Tree represents the whole range of values in the tree, where every internal node is a subinterval.

We are looking for a value k in the B+ Tree. Starting from the root, we are looking for the leaf which may contain the value k. At each node, we figure out which internal pointer we should follow. An internal B+ Tree node has at most d ≤ b children, where every one of them represents a different sub-interval. We select the corresponding node by searching on the key values of the node.

This pseudocode assumes that no duplicates are allowed.

prefix key compression

  • It is important to increase fan-out, as this allows to direct searches to the leaf level more efficiently.
  • Index Entries are only to `direct traffic’, thus we can compress them.
  • Consequently, we can fit more Index entries in Index pages!

Insertion

Perform a search to determine what bucket the new record should go into.

  • If the bucket is not full (at most b – 1 entries after the insertion), add the record.
  • Otherwise, split the bucket.
    • Allocate new leaf and move half the bucket’s elements to the new bucket.
    • Insert the new leaf’s smallest key and address into the parent.
    • If the parent is full, split it too.
      • Add the middle key to the parent node.
    • Repeat until a parent is found that need not split.
  • If the root splits, create a new root which has one key and two pointers. (That is, the value that gets pushed to the new root gets removed from the original node)

B-trees grow at the root and not at the leaves.[1]

Deletion

  • Start at root, find leaf L where entry belongs.
  • Remove the entry.
    • If L is at least half-full, done!
    • If L has fewer entries than it should,
      • Try to re-distribute, borrowing from sibling (adjacent node with same parent as L).
      • If re-distribution fails, merge L and sibling.
  • If merge occurred, must delete entry (pointing to L or sibling) from parent of L.
  • Merge could propagate to root, decreasing height.

merging propogates to sink

  • But … occupancy Factor of L dropped below 50% (d=2) which is not acceptable.
  • Thus, L needs to be either
  • i) merged with its sibling
  • or ii) redistributed with its sibling

Bulk-loading

Given a collection of data records, we want to create a B+ tree index on some key field. One approach is to insert each record into an empty tree. However, it is quite expensive, because each entry requires us to start from the root and go down to the appropriate leaf page. An efficient alternative is to use bulk-loading.

  • The first step is to sort the data entries according to a search key.
  • We allocate an empty page to serve as the root, and insert a pointer to the first page of entries into it.
  • When the root is full, we split the root, and create a new root page.
  • Keep inserting entries to the right most index page just above the leaf level, until all entries are indexed.

Note (1) when the right-most index page above the leaf level fills up, it is split; (2) this action may, in turn, cause a split of the right-most index page on step closer to the root; and (3) splits only occur on the right-most path from the root to the leaf level.

Characteristics

For a b-order B+ tree with h levels of index:

  • The maximum number of records stored is n_{max} = b^h - b^{h-1}
  • The minimum number of records stored is n_{min} = 2leftlceilfrac{b}{2}rightrceil^{h-1}
  • The minimum number of keys is n_{kmin} = 2leftlceilfrac{b}{2}rightrceil^{h}-1
  • The maximum number of keys is n_{kmax} = n^h
  • The space required to store the tree is O(n)
  • Inserting a record requires O(log_bn) operations
  • Finding a record requires O(log_bn) operations
  • Removing a (previously located) record requires O(log_bn) operations
  • Performing a range query with k elements occurring within the range requires O(log_bn+k) operations
  • Performing a pagination query with page size s and page number p requires O(p*s) operations

Implementation

The leaves (the bottom-most index blocks) of the B+ tree are often linked to one another in a linked list; this makes range queries or an (ordered) iteration through the blocks simpler and more efficient (though the aforementioned upper bound can be achieved even without this addition). This does not substantially increase space consumption or maintenance on the tree. This illustrates one of the significant advantages of a B+tree over a B-tree; in a B-tree, since not all keys are present in the leaves, such an ordered linked list cannot be constructed. A B+tree is thus particularly useful as a database system index, where the data typically resides on disk, as it allows the B+tree to actually provide an efficient structure for housing the data itself (this is described in [6] as index structure “Alternative 1”).

If a storage system has a block size of B bytes, and the keys to be stored have a size of k, arguably the most efficient B+ tree is one where b=(B/k)-1. Although theoretically the one-off is unnecessary, in practice there is often a little extra space taken up by the index blocks (for example, the linked list references in the leaf blocks). Having an index block which is slightly larger than the storage system’s actual block represents a significant performance decrease; therefore erring on the side of caution is preferable.

If nodes of the B+ tree are organized as arrays of elements, then it may take a considerable time to insert or delete an element as half of the array will need to be shifted on average. To overcome this problem, elements inside a node can be organized in a binary tree or a B+ tree instead of an array.

B+ trees can also be used for data stored in RAM. In this case a reasonable choice for block size would be the size of processor’s cache line.

Space efficiency of B+ trees can be improved by using some compression techniques. One possibility is to use delta encoding to compress keys stored into each block. For internal blocks, space saving can be achieved by either compressing keys or pointers. For string keys, space can be saved by using the following technique: Normally the ith entry of an internal block contains the first key of block i+1. Instead of storing the full key, we could store the shortest prefix of the first key of block i+1 that is strictly greater (in lexicographic order) than last key of block i. There is also a simple way to compress pointers: if we suppose that some consecutive blocks i, i+1…i+k are stored contiguously, then it will suffice to store only a pointer to the first block and the count of consecutive blocks.

All the above compression techniques have some drawbacks. First, a full block must be decompressed to extract a single element. One technique to overcome this problem is to divide each block into sub-blocks and compress them separately. In this case searching or inserting an element will only need to decompress or compress a sub-block instead of a full block. Another drawback of compression techniques is that the number of stored elements may vary considerably from a block to another depending on how well the elements are compressed inside each block.

R-tree

R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinatesrectangles or polygons. The R-tree was proposed by Antonin Guttman in 1984[1] and has found significant use in both theoretical and applied contexts.[2] A common real-world usage for an R-tree might be to store spatial objects such as restaurant locations or the polygons that typical maps are made of: streets, buildings, outlines of lakes, coastlines, etc. and then find answers quickly to queries such as “Find all museums within 2 km of my current location”, “retrieve all road segments within 2 km of my location” (to display them in a navigation system) or “find the nearest gas station” (although not taking roads into account).

R-tree idea

The key idea of the data structure is to group nearby objects and represent them with their minimum bounding rectangle in the next higher level of the tree; the “R” in R-tree is for rectangle. Since all objects lie within this bounding rectangle, a query that does not intersect the bounding rectangle also cannot intersect any of the contained objects. At the leaf level, each rectangle describes a single object; at higher levels the aggregation of an increasing number of objects. This can also be seen as an increasingly coarse approximation of the data set.

Similar to the B-tree, the R-tree is also a balanced search tree (so all leaf nodes are at the same height), organizes the data in pages, and is designed for storage on disk (as used in databases). Each page can contain a maximum number of entries, often denoted as M. It also guarantees a minimum fill (except for the root node), however best performance has been experienced with a minimum fill of 30%–40% of the maximum number of entries (B-trees guarantee 50% page fill, and B*-trees even 66%). The reason for this is the more complex balancing required for spatial data as opposed to linear data stored in B-trees.

As with most trees, the searching algorithms (e.g., intersection, containment, nearest neighbor search) are rather simple. The key idea is to use the bounding boxes to decide whether or not to search inside a subtree. In this way, most of the nodes in the tree are never read during a search. Like B-trees, this makes R-trees suitable for large data sets and databases, where nodes can be paged to memory when needed, and the whole tree cannot be kept in main memory.

The key difficulty of R-trees is to build an efficient tree that on one hand is balanced (so the leaf nodes are at the same height) on the other hand the rectangles do not cover too much empty space and do not overlap too much (so that during search, fewer subtrees need to be processed). For example, the original idea for inserting elements to obtain an efficient tree is to always insert into the subtree that requires least enlargement of its bounding box. Once that page is full, the data is split into two sets that should cover the minimal area each. Most of the research and improvements for R-trees aims at improving the way the tree is built and can be grouped into two objectives: building an efficient tree from scratch (known as bulk-loading) and performing changes on an existing tree (insertion and deletion).

R-trees do not guarantee good worst-case performance, but generally perform well with real-world data.[3] While more of theoretical interest, the (bulk-loaded) Priority R-tree variant of the R-tree is worst-case optimal,[4] but due to the increased complexity, has not received much attention in practical applications so far.

When data is organized in an R-tree, the k nearest neighbors (for any Lp-Norm) of all points can efficiently be computed using a spatial join.[5] This is beneficial for many algorithms based on the k nearest neighbors, for example the Local Outlier Factor. DeLi-Clu,[6] Density-Link-Clustering is a cluster analysis algorithm that uses the R-tree structure for a similar kind of spatial join to efficiently compute an OPTICS clustering.

Algorithm

Data layout

Data in R-trees is organized in pages, that can have a variable number of entries (up to some pre-defined maximum, and usually above a minimum fill). Each entry within a non-leaf node stores two pieces of data: a way of identifying a child node, and the bounding box of all entries within this child node. Leaf nodes store the data required for each child, often a point or bounding box representing the child and an external identifier for the child. For point data, the leaf entries can be just the points themselves. For polygon data (that often requires the storage of large polygons) the common setup is to store only the MBR (minimum bounding rectangle) of the polygon along with a unique identifier in the tree.

Search

The input is a search rectangle (Query box). Searching is quite similar to searching in a B+ tree. The search starts from the root node of the tree. Every internal node contains a set of rectangles and pointers to the corresponding child node and every leaf node contains the rectangles of spatial objects (the pointer to some spatial object can be there). For every rectangle in a node, it has to be decided if it overlaps the search rectangle or not. If yes, the corresponding child node has to be searched also. Searching is done like this in a recursive manner until all overlapping nodes have been traversed. When a leaf node is reached, the contained bounding boxes (rectangles) are tested against the search rectangle and their objects (if there are any) are put into the result set if they lie within the search rectangle.

Insertion

To insert an object, the tree is traversed recursively from the root node. At each step, all rectangles in the current directory node are examined, and a candidate is chosen using a heuristic such as choosing the rectangle which requires least enlargement. The search then descends into this page, until reaching a leaf node. If the leaf node is full, it must be split before the insertion is made. Again, since an exhaustive search is too expensive, a heuristic is employed to split the node into two. Adding the newly created node to the previous level, this level can again overflow, and these overflows can propagate up to the root node; when this node also overflows, a new root node is created and the tree has increased in height.

Choosing the insertion subtree

At each level, the algorithm needs to decide in which subtree to insert the new data object. When a data object is fully contained in a single rectangle, the choice is clear. When there are multiple options or rectangles in need of enlargement, the choice can have a significant impact on the performance of the tree.

In the classic R-tree, objects are inserted into the subtree that needs the least enlargement. In the more advanced R*-tree, a mixed heuristic is employed. At leaf level, it tries to minimize the overlap (in case of ties, prefer least enlargement and then least area); at the higher levels, it behaves similar to the R-tree, but on ties again preferring the subtree with smaller area. The decreased overlap of rectangles in the R*-tree is one of the key benefits over the traditional R-tree (this is also a consequence of the other heuristics used, not only the subtree choosing).

Splitting an overflowing node

Since redistributing all objects of a node into two nodes has an exponential number of options, a heuristic needs to be employed to find the best split. In the classic R-tree, Guttman proposed two such heuristics, called QuadraticSplit and LinearSplit. In quadratic split, the algorithm searches the pair of rectangles that is the worst combination to have in the same node, and puts them as initial objects into the two new groups. It then searches the entry which has the strongest preference for one of the groups (in terms of area increase) and assigns the object to this group until all objects are assigned (satisfying the minimum fill).

There are other splitting strategies such as Greene’s Split,[7] the R*-tree splitting heuristic[8] (which again tries to minimize overlap, but also prefers quadratic pages) or the linear split algorithm proposed by Ang and Tan[9] (which however can produce very unregular rectangles, which are less performant for many real world range and window queries). In addition to having a more advanced splitting heuristic, the R*-tree also tries to avoid splitting a node by reinserting some of the node members, which is similar to the way a B-tree balances overflowing nodes. This was shown to also reduce overlap and thus increase tree performance.

Finally, the X-tree[10] can be seen as a R*-tree variant that can also decide to not split a node, but construct a so-called super-node containing all the extra entries, when it doesn’t find a good split (in particular for high-dimensional data).

Deletion

Deleting an entry from a page may require updating the bounding rectangles of parent pages. However, when a page is underfull, it will not be balanced with its neighbors. Instead, the page will be dissolved and all its children (which may be subtrees, not only leaf objects) will be reinserted. If during this process the root node has a single element, the tree height can decrease.

Bulk-loading

  • Nearest-X – Objects are sorted by their first coordinate (“X”) and then split into pages of the desired size.
  • Packed Hilbert R-tree – variation of Nearest-X, but sorting using the Hilbert value of the center of a rectangle instead of using the X coordinate. There is no guarantee the pages will not overlap.
  • Sort-Tile-Recursive (STR):[11] Another variation of Nearest-X, that estimates the total number of leaves required as l=lceil text{number of objects} / text{capacity}rceil, the required split factor in each dimension to achieve this as s=lceil l^{1/d}rceil, then repeatedly splits each dimensions successively into s equal sized partitions using 1-dimensional sorting. The resulting pages, if they occupy more than one page, are again bulk-loaded using the same algorithm. For point data, the leaf nodes will not overlap, and “tile” the data space into approximately equal sized pages.
  • Priority R-tree

R+ tree

An R+ tree is a method for looking up data using a location, often (x, y) coordinates, and often for locations on the surface of the earth. Searching on one number is a solved problem; searching on two or more, and asking for locations that are nearby in both x and y directions, requires craftier algorithms.

Fundamentally, an R+ tree is a tree data structure, a variant of the R tree, used for indexing spatial information.

Difference between R+ trees and R trees

R+ trees are a compromise between R-trees and kd-trees: they avoid overlapping of internal nodes by inserting an object into multiple leaves if necessary. Coverage is the entire area to cover all related rectangles. Overlap is the entire area which is contained in two or more nodes.[1] Minimal coverage reduces the amount of “dead space” (empty area) which is covered by the nodes of the R-tree. Minimal overlap reduces the set of search paths to the leaves (even more critical for the access time than minimal coverage). Efficient search requires minimal coverage and overlap.

R+ trees differ from R trees in that:

  • Nodes are not guaranteed to be at least half filled
  • The entries of any internal node do not overlap
  • An object ID may be stored in more than one leaf node

Advantages

  • Because nodes are not overlapped with each other, point query performance benefits since all spatial regions are covered by at most one node.
  • A single path is followed and fewer nodes are visited than with the R-tree

Disadvantages

  • Since rectangles are duplicated, an R+ tree can be larger than an R tree built on same data set.
  • Construction and maintenance of R+ trees is more complex than the construction and maintenance of R trees and other variants of the R tree.

R* tree

R*-trees are a variant of R-trees used for indexing spatial information. R*-trees have slightly higher construction cost than standard R-trees, as the data may need to be reinserted; but the resulting tree will usually have a better query performance. Like the standard R-tree, it can store both point and spatial data. It was proposed by Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger in 1990.[1]

Difference between R*-trees and R-trees

Minimization of both coverage and overlap is crucial to the performance of R-trees. Overlap means that, on data query or insertion, more than one branch of the tree needs to be expanded (due to the way data is being split in regions which may overlap). A minimized coverage improves pruning performance, allowing to exclude whole pages from search more often, in particular for negative range queries.

The R*-tree attempts to reduce both, using a combination of a revised node split algorithm and the concept of forced reinsertion at node overflow. This is based on the observation that R-tree structures are highly susceptible to the order in which their entries are inserted, so an insertion-built (rather than bulk-loaded) structure is likely to be sub-optimal. Deletion and reinsertion of entries allows them to “find” a place in the tree that may be more appropriate than their original location.

When a node overflows, a portion of its entries are removed from the node and reinserted into the tree. (In order to avoid an indefinite cascade of reinsertions caused by subsequent node overflow, the reinsertion routine may be called only once in each level of the tree when inserting any one new entry.) This has the effect of producing more well-clustered groups of entries in nodes, reducing node coverage. Furthermore, actual node splits are often postponed, causing average node occupancy to rise. Re-insertion can be seen as a method of incremental tree optimization triggered on node overflow.

Performance

  • Improved split heuristic produces pages that are more rectangular and thus better for many applications.
  • Reinsertion method optimizes the existing tree, but increases complexity.
  • Efficiently supports point and spatial data at the same time.
Effect of different splitting heuristics on a database with Germany postal districts
R-Tree with Guttman quadratic split.[2]
There are many pages that extend from east to west all over Germany, and pages overlap a lot. This is not beneficial for most applications, that often only need a small rectangular area that intersects with many slices.
R-Tree with Ang-Tan linear split.[3]
While the slices do not extend as far as with Guttman, the slicing problem affects almost every leaf page. Leaf pages overlap little, but directory pages do.
R* tree topological split.[1]
The pages overlap very little since the R*-tree tries to minimize page overlap, and the reinsertions further optimized the tree. The split strategy also does not prefer slices, the resulting pages are much more useful for common map applications.

Algorithm and complexity

  • The R*-tree uses the same algorithm as the regular R-tree for query and delete operations.
  • When inserting, the R*-tree uses a combined strategy. For leaf nodes, overlap is minimized, while for inner nodes, enlargement and area are minimized.
  • When splitting, the R*-tree uses a topological split that chooses a split axis based on perimeter, then minimizes overlap.
  • In addition to an improved split strategy, the R*-tree also tries to avoid splits by reinserting objects and subtrees into the tree, inspired by the concept of balancing a B-tree.

Obviously, worst case query and delete complexity are thus identical to the R-Tree. The insertion strategy to the R*-tree is with mathcal{O}(M log M) more complex than the linear split strategy (mathcal{O}(M)) of the R-tree, but less complex than the quadratic split strategy (mathcal{O}(M^2)) for a page size of M objects and has little impact on the total complexity. The total insert complexity is still comparable to the R-tree: reinsertions affect at most one branch of the tree and thus mathcal{O}(log n) reinsertions, comparable to performing a split on a regular R-tree. So on overall, the complexity of the R*-tree is the same as that of a regular R-tree.

An implementation of the full algorithm must address many corner cases and tie situations not discussed here.

Hilbert R-tree

Hilbert R-tree, an R-tree variant, is an index for multidimensional objects like lines, regions, 3-D objects, or high-dimensional feature-based parametric objects. It can be thought of as an extension to B+-tree for multidimensional objects.

The performance of R-trees depends on the quality of the algorithm that clusters the data rectangles on a node. Hilbert R-trees use space-filling curves, and specifically the Hilbert curve, to impose a linear ordering on the data rectangles.

There are two types of Hilbert R-trees, one for static databases, and one for dynamic databases. In both cases Hilbert space-filling curves are used to achieve better ordering of multidimensional objects in the node. This ordering has to be ‘good’, in the sense that it should group ‘similar’ data rectangles together, to minimize the area and perimeter of the resulting minimum bounding rectangles (MBRs). Packed Hilbert R-trees are suitable for static databases in which updates are very rare or in which there are no updates at all.

The dynamic Hilbert R-tree is suitable for dynamic databases where insertions, deletions, or updates may occur in real time. Moreover, dynamic Hilbert R-trees employ flexible deferred splitting mechanism to increase the space utilization. Every node has a well defined set of sibling nodes.

By adjusting the split policy the Hilbert R-tree can achieve a degree of space utilization as high as is desired. This is done by proposing an ordering on the R-tree nodes. The Hilbert R-tree sorts rectangles according to the Hilbert value of the center of the rectangles (i.e., MBR). (The Hilbert value of a point is the length of the Hilbert curve from the origin to the point.) Given the ordering, every node has a well-defined set of sibling nodes; thus, deferred splitting can be used. By adjusting the split policy, the Hilbert R-tree can achieve as high utilization as desired. To the contrary, other R-tree variants have no control over the space utilization.

The basic idea

Although the following example is for a static environment, it explains the intuitive principles for good R-tree design. These principles are valid for both static and dynamic databases. Roussopoulos and Leifker proposed a method for building a packed R-tree that achieves almost 100% space utilization. The idea is to sort the data on the x or y coordinate of one of the corners of the rectangles. Sorting on any of the four coordinates gives similar results. In this discussion points or rectangles are sorted on the x coordinate of the lower left corner of the rectangle. In the discussion below the Roussopoulos and Leifker’s method is referred to as the lowx packed R-tree. The sorted list of rectangles is scanned; successive rectangles are assigned to the same R-tree leaf node until that node is full; a new leaf node is then created and the scanning of the sorted list continues. Thus, the nodes of the resulting R-tree will be fully packed, with the possible exception of the last node at each level. Thus, the utilization is ≈100%. Higher levels of the tree are created in a similar way.

Figure 1 highlights the problem of the lowx packed R-tree. Figure 1[Right] shows the leaf nodes of the R-tree that the lowx packing method will create for the points of Figure 1 [Left]. The fact that the resulting father nodes cover little area explains why the lowx packed R-tree achieves excellent performance for point queries. However, the fact that the fathers have large perimeters, explains the degradation of performance for region queries. This is consistent with the analytical formulas for R-tree performance.[1] Intuitively, the packing algorithm should ideally assign nearby points to the same leaf node. Ignorance of the y coordinate by the lowx packed R-tree tends to violate this empirical rule.

figure1 left figure1 right

Figure 1: [Left] 200 points uniformly distributed; [Right] MBR of nodes generated by the ‘lowx packed R-tree’ algorithm

This section describes two variants of the Hilbert R-trees. The first index is suitable for the static database in which updates are very rare or in which there are no updates at all. The nodes of the resulting R-tree will be fully packed, with the possible exception of the last node at each level. Thus, the space utilization is ≈100%; this structure is called a packed Hilbert R-tree. The second index, called a Dynamic Hilbert R-tree, supports insertions and deletions, and is suitable for a dynamic environment.

Packed Hilbert R-trees

The following provides a brief introduction to the Hilbert curve. The basic Hilbert curve on a 2×2 grid, denoted by H1 is shown in Figure 2. To derive a curve of order i, each vertex of the basic curve is replaced by the curve of order i – 1, which may be appropriately rotated and/or reflected. Figure 2 also shows the Hilbert curves of order two and three. When the order of the curve tends to infinity, like other space filling curves, the resulting curve is a fractal, with a fractal dimension of two.[1][2] The Hilbert curve can be generalized for higher dimensionalities. Algorithms for drawing the two-dimensional curve of a given order can be found in [3] and.[2] An algorithm for higher dimensionalities is given in.[4]

The path of a space filling curve imposes a linear ordering on the grid points; this path may be calculated by starting at one end of the curve and following the path to the other end. The actual coordinate values of each point can be calculated. However, for the Hilbert curve this is much harder than for example the Z-order curve. Figure 2 shows one such ordering for a 4×4 grid (see curve H2). For example, the point (0,0) on the H2 curve has a Hilbert value of 0, while the point (1,1) has a Hilbert value of 2.

Hilbert curves of order 1, 2, and 3

Figure 2: Hilbert curves of order 1, 2, and 3

The Hilbert curve imposes a linear ordering on the data rectangles and then traverses the sorted list, assigning each set of C rectangles to a node in the R-tree. The final result is that the set of data rectangles on the same node will be close to each other in the linear ordering, and most likely in the native space; thus, the resulting R-tree nodes will have smaller areas. Figure 2 illustrates the intuitive reasons why our Hilbert-based methods will result in good performance. The data is composed of points (the same points as given in Figure 1). By grouping the points according to their Hilbert values, the MBRs of the resulting R-tree nodes tend to be small square-like rectangles. This indicates that the nodes will likely have small area and small perimeters. Small area values result in good performance for point queries; small area and small perimeter values lead to good performance for larger queries.

Algorithm Hilbert-Pack

(packs rectangles into an R-tree)
Step 1. Calculate the Hilbert value for each data rectangle
Step 2. Sort data rectangles on ascending Hilbert values
Step 3. /* Create leaf nodes (level l=0) */

  • While (there are more rectangles)
    • generate a new R-tree node
    • assign the next C rectangles to this node

Step 4. /* Create nodes at higher level (l + 1) */

  • While (there are > 1 nodes at level l)
    • sort nodes at level l ≥ 0 on ascending creation time
    • repeat Step 3

The assumption here is that the data are static or the frequency of modification is low. This is a simple heuristic for constructing an R-tree with 100% space utilization which at the same time will have as good response time as possible.

Dynamic Hilbert R-trees

The performance of R-trees depends on the quality of the algorithm that clusters the data rectangles on a node. Hilbert R-trees use space-filling curves, and specifically the Hilbert curve, to impose a linear ordering on the data rectangles. The Hilbert value of a rectangle is defined as the Hilbert value of its center.

Tree structure

The Hilbert R-tree has the following structure. A leaf node contains at most Cl entries each of the form (R, obj _id) where Cl is the capacity of the leaf, R is the MBR of the real object (xlow, xhigh, ylow, yhigh) and obj-id is a pointer to the object description record. The main difference between the Hilbert R-tree and the R*-tree [5] is that non-leaf nodes also contain information about the LHVs (Largest Hilbert Value). Thus, a non-leaf node in the Hilbert R-tree contains at most Cn entries of the form (R, ptr, LHV) where Cn is the capacity of a non-leaf node, R is the MBR that encloses all the children of that node, ptr is a pointer to the child node, and LHV is the largest Hilbert value among the data rectangles enclosed by R. Notice that since the non-leaf node picks one of the Hilbert values of the children to be the value of its own LHV, there is not extra cost for calculating the Hilbert values of the MBR of non-leaf nodes. Figure 3 illustrates some rectangles organized in a Hilbert R-tree. The Hilbert values of the centers are the numbers near the ‘x’ symbols (shown only for the parent node ‘II’). The LHV’s are in [brackets]. Figure 4 shows how the tree of Figure 3 is stored on the disk; the contents of the parent node ‘II’ are shown in more detail. Every data rectangle in node ‘I’ has a Hilbert value v ≤33; similarly every rectangle in node ‘II’ has a Hilbert value greater than 33 and ≤ 107, etc.

Data rectangles organized in a Hilbert R-tree

Figure 3: Data rectangles organized in a Hilbert R-tree (Hilbert values and LHV’s are in Brackets)

A plain R-tree splits a node on overflow, creating two nodes from the original one. This policy is called a 1-to-2 splitting policy. It is possible also to defer the split, waiting until two nodes split into three. Note that this is similar to the B*-tree split policy. This method is referred to as the 2-to-3 splitting policy. In general, this can be extended to s-to-(s+1) splitting policy; where s is the order of the splitting policy. To implement the order-s splitting policy, the overflowing node tries to push some of its entries to one of its s – 1 siblings; if all of them are full, then s-to-(s+1) split need to be done. The s -1 siblings are called the cooperating siblings. Next, the algorithms for searching, insertion, and overflow handling are described in details.

Searching

The searching algorithm is similar to the one used in other R-tree variants. Starting from the root, it descends the tree and examines all nodes that intersect the query rectangle. At the leaf level, it reports all entries that intersect the query window w as qualified data items.

Algorithm Search(node Root, rect w):
S1. Search nonleaf nodes:

Invoke Search for every entry whose MBR intersects the query window w.

S2. Search leaf nodes:

Report all entries that intersect the query window w as candidates.

Data rectangles organized in a Hilbert R-tree

Figure 4: The file structure for the Hilbert R-tree

Insertion

To insert a new rectangle r in the Hilbert R-tree, the Hilbert value h of the center of the new rectangle is used as a key. At each level the node with the minimum LHV value greater than h of all its siblings is chosen. When a leaf node is reached, the rectangle r is inserted in its correct order according to h. After a new rectangle is inserted in a leaf node N, AdjustTree is called to fix the MBR and LHV values in the upper-level nodes.

Algorithm Insert(node Root, rect r): /* Inserts a new rectangle r in the Hilbert R-tree. h is the Hilbert value of the rectangle*/
I1. Find the appropriate leaf node:

Invoke ChooseLeaf(r, h) to select a leaf node L in which to place r.

I2. Insert r in a leaf node L:

If L has an empty slot, insert r in L in the
appropriate place according to the Hilbert order and return.
If L is full, invoke HandleOverflow(L,r), which
will return new leaf if split was inevitable,

I3. Propagate changes upward:

Form a set S that contains L, its cooperating siblings
and the new leaf (if any)
Invoke AdjustTree(S).

I4. Grow tree taller:

If node split propagation caused the root to split, create
a new root whose children are the two resulting nodes.

Algorithm ChooseLeaf(rect r, int h):
/* Returns the leaf node in which to place a new rectangle r. */
C1. Initialize:

Set N to be the root node.

C2. Leaf check:

If N is a leaf_ return N.

C3. Choose subtree:

If N is a non-leaf node, choose the entry (R, ptr, LHV)
with the minimum LHV value greater than h.

C4. Descend until a leaf is reached:

Set N to the node pointed by ptr and repeat from C2.

Algorithm AdjustTree(set S):
/* S is a set of nodes that contains the node being updated, its cooperating siblings (if overflow has occurred) and the newly
created node NN (if split has occurred). The routine ascends from the leaf level towards the root, adjusting MBR and LHV of nodes that cover the nodes in S. It propagates splits (if any) */
A1. If root level is reached, stop.
A2. Propagate node split upward:

Let Np be the parent node of N.
If N has been split, let NN be the new node.
Insert NN in Np in the correct order according to its Hilbert
value if there is room. Otherwise, invoke HandleOverflow(Np , NN ).
If Np is split, let PP be the new node.

A3. Adjust the MBR’s and LHV’s in the parent level:

Let P be the set of parent nodes for the nodes in S.
Adjust the corresponding MBR’s and LHV’s of the nodes in P appropriately.

A4. Move up to next level:

Let S become the set of parent nodes P, with
NN = PP, if Np was split.
repeat from A1.

Deletion

In the Hilbert R-tree, there is no need to re-insert orphaned nodes whenever a father node underflows. Instead, keys can be borrowed from the siblings or the underflowing node is merged with its siblings. This is possible because the nodes have a clear ordering (according to Largest Hilbert Value, LHV); in contrast, in R-trees there is no such concept concerning sibling nodes. Notice that deletion operations require s cooperating siblings, while insertion operations require s – 1 siblings.

Algorithm Delete(r):
D1. Find the host leaf:

Perform an exact match search to find the leaf node L
that contains r.

D2. Delete r :

Remove r from node L.

D3. If L underflows

borrow some entries from s cooperating siblings.
if all the siblings are ready to underflow.
merge s + 1 to s nodes,
adjust the resulting nodes.

D4. Adjust MBR and LHV in parent levels.

form a set S that contains L and its cooperating
siblings (if underflow has occurred).
invoke AdjustTree(S).

Overflow handling[edit]

The overflow handling algorithm in the Hilbert R-tree treats the overflowing nodes either by moving some of the entries to one of the s – 1 cooperating siblings or by splitting s nodes into s +1 nodes.

Algorithm HandleOverflow(node N, rect r):
/* return the new node if a split occurred. */
H1. Let ε be a set that contains all the entries from N

and its s – 1 cooperating siblings.

H2. Add r to ε.
H3. If at least one of the s – 1 cooperating siblings is not full,

distribute ε evenly among the s nodes according to Hilbert values.

H4. If all the s cooperating siblings are full,

create a new node NN and
distribute ε evenly among the s + 1 nodes according
to Hilbert values
return NN.