Notes: Ontology

Douglas Adams, I find, has always has good advice…even when i feel like an infinitely morose robot… ;D

  1. possible
  2. conceptual
  3. intended
  • Is existence a property?
  • How do the properties of an object relate to the object itself?

Build on XML’s ability to define customized tagging schemes and RDF’s flexible approach to representing data.

The first level above RDF required for the Semantic Web is an ontology language what can formally describe the meaning of terminology used in Web documents; beyond the basic semantics of RDF Schema.

  • XML provides a surface syntax for structured documents, but imposes no semantic constraints on the meaning of these documents.
  • XML Schema is a language for restricting the structure of XML documents and also extends XML with datatypes.
  • RDF is a datamodel for objects (“resources”) and relations between them, provides a simple semantics for this datamodel, and these datamodels can be represented in an XML syntax.
  • RDF Schema is a vocabulary for describing properties and classes of RDF resources, with a semantics for generalization-hierarchies of such properties and classes.
  • OWL adds more vocabulary for describing properties and classes: among others, relations between classes (e.g. disjointness), cardinality (e.g. “exactly one”), equality, richer typing of properties, characteristics of properties (e.g. symmetry), and enumerated classes.

http://www.w3.org/2003/08/owlfaq

Q. What does the acronym “OWL” stand for?

A. Actually, OWL is not a real acronym. The language started out as the “Web Ontology Language” but the Working Group disliked the acronym “WOL.” We decided to call it OWL. The Working Group became more comfortable with this decision when one of the members pointed out the following justification for this decision from the noted ontologist A.A. Milne who, in his influential book “Winnie the Pooh” stated of the wise character OWL:

“He could spell his own name WOL, and he could spell Tuesday so that you knew it wasn’t Wednesday…”

Ontologies are often equated with taxonomic hierarchies of classes, class definitions, and the subsumption relation, but ontologies need not be limited to these forms. Ontologies are also not limited to conservative definitions — that is, definitions in the traditional logic sense that only introduce terminology and do not add any knowledge about the world.[11] To specify a conceptualization, one needs to state axioms that do constrain the possible interpretations for the defined terms.[1]

http://www.ontotext.com/

ontology in a functional frame.

not gonna say i disagree.

Werner Ceusters has noted the confusion caused by the significant differences in the meaning of word ontology when used by philosophy compared with the use of the word ontology in computer science, and advocates for greater precision in use of the word ontology so that members of the various disciplines using various definitions of the word ontology can communicate. He writes ‘before one is able to answer the question ‘what is an ontology?’, one must provide first an answer to the question ‘what does the word ontology mean?’.[58]

Pizza has base?

  •  RDF/XML is the normative syntax and should be used to exchange information between systems.
  • Notice that ontology does not have to express all the possible constraints – the level of details in conceptualization depends on the requirements of the intended application and expressing conceptualization in ontology in addition depends on the used ontology language.
  • Every knowledge base, knowledge-based system or agent is committed to some conceptualization, explicitly or implicitly. For these systems, what “exists” is what can be represented.

http://www.obitko.com/tutorials/ontologies-semantic-web/owl-dl-semantics.html

Ontology

Parmenides was among the first to propose an ontological characterization of the fundamental nature of reality. Ontology is the philosophical study of the nature of

  • what entities exist or can be said to exist
  • how such entities can be grouped, related within a hierarchy,
  •  subdivided according to similarities and differences

Some philosophers contend that all nouns (including abstract nouns) refer to existent entities. Other philosophers contend that nouns do not always name entities, but that some provide a kind of shorthand for reference to a collection of either objects or events.

  1. mind, instead of referring to an entity, refers to a collection of mental events experienced by a person;
  2. society refers to a collection of persons with some shared characteristics, and
  3. geometry refers to a collection of a specific kind of intellectual activity.[1]

Some fundamental questions

Principal questions of ontology include:

  • “What can be said to exist?”
  • “Into what categories, if any, can we sort existing things?”
  • “What are the meanings of being?”
  • “What are the various modes of being of entities?”

One common approach is to divide the extant subjects and predicates into groups called categories. The categories are, properly speaking,[2] the ways in which a being can be addressed simply as a being, such as

  • what it is (its ‘whatness’, quidditas or essence)
  • how it is (its ‘howness’ or qualitativeness), how much it is (quantitativeness)
  • where it is
  • its relatedness to other beings, etc

Further examples of ontological questions include:[citation needed]

  • What is existence, i.e. what does it mean for a being to be ?
  • Is existence a property?
  • Is existence a genus or general class that is simply divided up by specific differences?
  • Which entities, if any, are fundamental?
  • Are all entities objects?
  • How do the properties of an object relate to the object itself?
  • What features are the essential, as opposed to merely accidental attributes of a given object?
  • How many levels of existence or ontological levels are there? And what constitutes a ‘level’?
  • What is a physical object?
  • Can one give an account of what it means to say that a physical object exists?
  • Can one give an account of what it means to say that a non-physical entity exists?
  • What constitutes the identity of an object?
  • When does an object go out of existence, as opposed to merely changing?
  • Do beings exist other than in the modes of objectivity and subjectivity, i.e. is the subject/object split of modern philosophy inevitable?

Concepts

Essential ontological dichotomies include:

Types

Philosophers can classify ontologies in various ways using criteria such as the degree of abstraction and field of application:[3]

  1. Upper ontology: concepts supporting development of an ontology, meta-ontology
  2. Domain ontology: concepts relevant to a particular topic or area of interest, for example, information technology or computer languages, or particular branches of science
  3. Interface ontology: concepts relevant to the juncture of two disciplines
  4. Process ontology: inputs, outputs, constraints, sequencing information, involved in business or engineering processes

Heidegger distinguished human being as existence from the being of things in the world. Heidegger proposes that our way of being human and the way the world is for us are cast historically through a fundamental ontological questioning. These fundamental ontological categories provide the basis for communication in an age: a horizon of unspoken and seemingly unquestionable background meanings, such as human beings understood unquestioningly as subjects and other entities understood unquestioningly as objects. Because these basic ontological meanings both generate and are regenerated in everyday interactions, the locus of our way of being in a historical epoch is the communicative event of language in use.[11] For Heidegger, however, communication in the first place is not among human beings, but language itself shapes up in response to questioning (the inexhaustible meaning of) being.[14]Even the focus of traditional ontology on the ‘whatness’ or ‘quidditas’ of beings in their substantial, standing presence can be shifted to pose the question of the ‘wholeness’ of human being itself.[15]How to determine the ‘fitness’ of a ‘language’ to the world then becomes a subject for investigation.

Reality and actuality

According to A.N. Whitehead, for ontology, it is useful to distinguish the terms ‘reality’ and ‘actuality’.

  • There is no going behind an actual entity, to find something more fundamental in fact or in efficacy. This criterion is to be regarded as expressing an axiom, or postulated distinguished doctrine.
  • An actual entity must be completely determinate in the sense that there can be no confusion about its identity that would allow it to be confounded with another actual entity. In this sense an actual entity is completely concrete, with no potential to be something other than itself. It is what it is. It is of course a source of potentiality for the creation of other actual entities, of which it may be said to be a part cause. Likewise it is the concretion or realization of potentialities of other actual entities which are its partial causes.
  • Causation between actual entities is essential to their actuality. Consequently, for Whitehead, each actual entity has its distinct and definite extension in physical Minkowski space, and so is uniquely identifiable. A description in Minkowski space supports descriptions in time and space for particular observers.
  • It is part of the aim of the philosophy of such an ontology as Whitehead’s that the actual entities should be all alike, qua actual entities; they should all satisfy a single definite set of well stated ontological criteria of actuality.
  • Whitehead proposed that his notion of an occasion of experience satisfies the criteria for its status as the philosophically preferred definition of an actual entity.
  • From a purely logical point of view, each occasion of experience has in full measure the characters of both objective and subjective reality.
  • Subjectivity and objectivity refer to different aspects of an occasion of experience, and in no way do they exclude each other.[22

Examples of other philosophical proposals or candidates as actual entities, in this view, are Aristotle’s ‘substances’, Leibniz’ monads, and Descartes ′res verae’ , and the more modern ‘states of affairs’.

  • Aristotle’s substances, such as Socrates, have behind them as more fundamental the ‘primary substances’, and in this sense do not satisfy Whitehead’s criteria.
  • Whitehead is not happy with L Leibniz’ monads are “windowless” and do not cause each other.
  • ‘States of affairs’ are often not closely defined, often without specific mention of extension in physical Minkowski space; they are therefore not necessarily processes of becoming, but may be as their name suggests, simply static states in some sense.
  • States of affairs are contingent on particulars, and are therefore have something behind them.[23]
  • One summary of the Whiteheadian actual entity is that it is a process of becoming.
  • Another summary, referring to its causal linkage to other actual entities, is that it is “all window”, in contrast with Leibniz’ windowless monads. they have existence as abstractions, with reality only derived from their reference to actual entities.
  • A Whiteheadian actual entity has a unique and completely definite place and time.
  • Whiteheadian abstractions are not so tightly defined in time and place, and in the extreme, some are timeless and placeless, or ‘eternal’ entities. All abstractions have logical or conceptual rather than efficacious existence; their lack of definite time does not make them unreal if they refer to actual entities. Whitehead calls this ‘the ontological principle’.

Ontology (information science)

  1. set of types
  2. set of properties
  3. set of relationship types

There is also generally an expectation that the features of the model in an ontology should closely resemble the real world (related to the object).[3] What many ontologies have in common in both computer science and in philosophy is the representation of entities, ideas, and events, along with their properties and relations, according to a system of categories. According to Gruber (1993):

Ontologies are often equated with taxonomic hierarchies of classes, class definitions, and the subsumption relation, but ontologies need not be limited to these forms. Ontologies are also not limited to conservative definitions — that is, definitions in the traditional logic sense that only introduce terminology and do not add any knowledge about the world.[11] To specify a conceptualization, one needs to state axioms that do constrain the possible interpretations for the defined terms.[1]

Components

Contemporary ontologies share many structural similarities, regardless of the language in which they are expressed. As mentioned above, most ontologies describe individuals (instances), classes (concepts), attributes, and relations. In this section each of these components is discussed in turn. Common components of ontologies include:

  • Individuals: instances or objects (the basic or “ground level” objects)
  • Classes: sets, collections, concepts, classes in programmingtypes of objects, or kinds of things
  • Attributes: aspects, properties, features, characteristics, or parameters that objects (and classes) can have
  • Relations: ways in which classes and individuals can be related to one another
  • Function terms: complex structures formed from certain relations that can be used in place of an individual term in a statement
  • Restrictions: formally stated descriptions of what must be true in order for some assertion to be accepted as input
  • Rules: statements in the form of an if-then (antecedent-consequent) sentence that describe the logical inferences that can be drawn from an assertion in a particular form
  • Axioms: assertions (including rules) in a logical form that together comprise the overall theory that the ontology describes in its domain of application. This definition differs from that of “axioms” in generative grammar and formal logic. In those disciplines, axioms include only statements asserted as a priori knowledge. As used here, “axioms” also include the theory derived from axiomatic statements
  • Events: the changing of attributes or relations

Ontologies are commonly encoded using ontology languages.

OWL DL Semantics

Let us illustrate the use of OWL vocabulary on an example ontology (inspired by OWL Pizzas): “Pizza has PizzaBase as its base; Pizza is disjoint with PizzaBase; NonVegetarianPizza is exactly Pizza that is not VegetarianPizza; isIngredientOf is a transitive property; isIngredientOf is inverse of hasIngredient”. The example expressed in the description logic syntax follows: owl pizza The same example expressed using OWL Abstract Syntax formulates the same information using LISP-like notation, and in addition uses URI for identification of all classes and properties:

When embedding the example OWL ontology to RDF, every statement must be converted to triples – see the figure below. For example, the ∃R.C restriction is formed by anonymous resource of type owl:Restriction. This anonymous resource (blank node) is a subject for two properties owl:onProperty and owl:someValuesFrom that relate the restriction relation (property) and concept (class). The anonymous resource is then used to be related to the constrained class (by rdfs:subClassOf in our case). The example expressed in triples and serialized in N3 follows:

pizza owl rdf graph

Pizza OWL ontology expressed in RDF triples

The OWL DL uses all the SHOIN(D) features. The overview of the possible descriptions, data ranges, properties, individuals and data values is shown in the table in the previous pageThe DL description of the semantics was introduced in one of the previous sections. The domain of individuals in the model is ΔI, the domain of data values ΔID was added to specify semantics of data ranges. The ontology is formed by constraints on a model. The axioms that can be used to constrain a model are summarized in the table in the previous page. In addition to the standard description logic features there are so called annotation properties added. In addition to RDFS annotation properties (such as rdfs:comment andrdfs:label) there are properties that allow to state for example version information, state compatibility or incompatibility between ontologies. There is also a constructowl:imports

Notes: Database, Part 2

http://en.wikipedia.org/wiki/Turtle_(robot) (Seymour Aubrey Papert and Logo)

Google: hidden classes in V8 behind JavaScript so what is a “prototypal” language?  Smalltalk and Self.  Prototype is in the OO conceptual line but a modification of the form due to issues of fragile class hierarchy over time.

Prototype-based programming is a style of object-oriented programming in which behaviour reuse (known as inheritance) is performed via a process of cloning existing objects that serve as prototypes. This model can also be known as prototypalprototype-oriented, classless, or instance-based programming. Delegation is the language feature that supports prototype-based programming.

Alan Kay has commented that despite the attention given to objects, messaging is the most important concept in Smalltalk: “The big idea is ‘messaging’ — that is what the kernel of Smalltalk/Squeak is all about (and it’s something that was never quite completed in our Xerox PARC phase).”[18]

Experience with early OO languages like Smalltalk showed that this sort of issue came up again and again. Systems would tend to grow to a point and then become very rigid, as the basic classes deep below the programmer’s code grew to be simply “wrong”. Without some way to easily change the original class, serious problems could arise.

This issue is one of the motivating factors behind prototypes. Unless one can predict with certainty what qualities a set of objects and classes will have in the distant future, one cannot design a class hierarchy properly.

class/prototype.  source/image.

celluar automata/Codd and 12 rules and whatnot.

the first DOM models from HTML and XML are really tied together (more than i previously knew).  XML has a strange history imo…just wanna learn a tiny bit more about it.

These days feeling like going to the ppl and their motivations is helping to make more sense from some of the more confusing aspects of data formats and history.

from original XML spec…just pondering what it means…lol…

Markup encodes a description of the document’s storage layout, structure, and arbitrary attribute-value pairs associated with that structure.

(yet another) funny t.b. lee (’93…?) document on SGML http://www.w3.org/MarkUp/SGML/TimComments

HTML and how and *why* it’s connected to XML.  what isn’t data or procedure (function)?  IBM and GML.  i forget that IBM was dominant for most of the first part of computer history…HAL…lol.

interesting thinking about Tim Berners-Lee document pondering (lol, really funny doc, imo) turning SGML from “markup” to a “language”.  what was the intention, really, here with GML and the relation to document presentation.

The system was an IBM 1130 computer, a machine the size of a desk with 8KB (sic!) of main memory, a 512KB disk drive, a Teletype CX paper tape reader and BRPE paper tape punch, and a Photon 713 photomechanical typesetter. The assignment was my first experience with managing a machine-readable document database: I learned to roll the punched paper tape carefully so that it could be stored neatly in cylindrical waste paper baskets.

In the meantime, though I didn’t know about it, the roots of generalized markup were being planted. Historically, electronic manuscripts contained control codes or macros that caused the document to be formatted in a particular way (“specific coding”). In contrast, generic coding, which began in the late 1960s, uses descriptive tags (for example, “heading”, rather than “format-17”).

Many credit the start of the generic coding movement to a presentation made by William Tunnicliffe, chairman of the Graphic Communications Association (GCA) Composition Committee, during a meeting at the Canadian Government Printing Office in September 1967: his topic — the separation of information content of documents from their format.

Bill went on teaching the world about “generic coding” under the auspices of Norm Scharpf and the GCA, then as now (and for all the years in between) unflagging believers, contributors, and promoters of the cause. At the same time, a New York book designer named Stanley Rice was publishing articles about “Standardized Editorial Structures”, parameterized style macros based on the structural elements of publications.

Later in 1969, together with Ed Mosher and Ray Lorie, I invented Generalized Markup Language (GML) to solve the data representation problem. GML was not merely an alternative to procedural markup, but the logical representation that motivated all processing. Ed recalls:

We called it Text Description Language at first, because I think that’s what we thought it was. We certainly very early intended to use it as a common and general markup to be “translated” into Script [formatting] controls, ATMS & TERMTEXT & MTSC [formatting] codes, STAIRS [information retrieval descriptor] paragraph codes, as well as using an un-filled-in outline of tags as a prompter from which to create a new document.

i still look at this early IBM markup language and just…i dunno…

actually “Text Description Language”.  it just helps me to know that was, at least at one time, their picture of it.

http://ccollins.wordpress.com/2008/03/03/a-brief-history-of-xml/

Extensible Markup Language (XML)

W3C Working Draft 14-Nov-96

This version:
http://www.w3.org/pub/WWW/TR/WD-xml-961114.html
Previous versions:
Latest version:
http://www.textuality.com/sgml-erb/WD-xml.html
Editors:
Tim Bray (Textuality) <tbray@textuality.com>
C. M. Sperberg-McQueen (University of Illinois at Chicago) <cmsmcq@uic.edu>

Status of this memo

This is a W3C Working Draft for review by W3C members and other interested parties. It is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than “work in progress”. A list of current W3C working drafts can be found at: http://www.w3.org/pub/WWW/TR

Note: since working drafts are subject to frequent change, you are advised to reference the above URL, rather than the URLs for working drafts themselves.

This work is part of the W3C SGML Activity.


Abstract

Extensible Markup Language (XML) is an extremely simple dialect of SGML which is completely described in this document. The goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. For this reason, XML has been designed for ease of implementation, and for interoperability with both SGML and HTML.
Note on status of this document: This is even more of a moving target than the typical W3C working draft. Several important decisions on the details of XML are still outstanding – members of the W3C SGML Working Group will recognize these areas of particular volatility in the spec, but those who are not intimately familiar with the deliberative process should be careful to avoid actions based on the content of this document, until the notice you are now reading has been removed.


1. Introduction

Extensible Markup Language, abbreviated XML, describes a class of data objects stored on computers and partially describes the behavior of programs which process these objects. Such objects are called XML documents. XML is an application profile or restricted form of SGML, the Standard Generalized Markup Language [ISO 8879].

XML documents are made up of storage units called entities, which contain either text or binary data. Text is made up of characters, some of which form the character data in the document, and some of which form markup. Markup encodes a description of the document’s storage layout, structure, and arbitrary attribute-value pairs associated with that structure. XML provides a mechanism to impose constraints on the storage layout and logical structure.

A software module called an XML processor is used to read XML documents and provide access to their content and structure. It is assumed that an XML processor is doing its work on behalf of another module, referred to as the application. This specification describes some of the required behavior of an XML processor in terms of the manner it must read XML data, and the information it must provide to the application.

1.1 Origin and Goals

XML was developed by a Generic SGML Editorial Review Board formed under the auspices of the W3 Consortium in 1996 and chaired by Jon Bosak of Sun Microsystems, with the very active participation of a Generic SGML Working Group also organized by the W3C. The membership of these groups is given in an appendix.

The design goals for XML are:

  1. XML shall be straightforwardly usable over the Internet.
  2. XML shall support a wide variety of applications.
  3. XML shall be compatible with SGML.
  4. It shall be easy to write programs which process XML documents.
  5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
  6. XML documents should be human-legible and reasonably clear.
  7. The XML design should be prepared quickly.
  8. The design of XML shall be formal and concise.
  9. XML documents shall be easy to create.
  10. Terseness is of minimal importance.

This specification, together with the associated standards, provides all the information necessary to understand XML version 1.0 and construct computer programs to process it.

This version of the XML specification (0.01) is for internal discussion within the SGML ERB only. It should not be distributed outside the ERB.

Known problems in version 0.01:

  1. Several items in the bibliography have no references to them; several references in the text do not point to anything in the bibliograpy.
  2. The EBNF grammar has not been checked for completeness, and has at least two start productions.
  3. The description of conformance in the final section is incomplete.
  4. Language exists in the spec which describes the effect of several decisions which have not been taken. Specifically, XML may have INCLUDE/IGNORE marked sections as does SGML, the comment syntax may change, XML may have CONREF attributes, the 8879 syntax for EMPTY elements may be outlawed, XML may choose to rule out what 8879 calls “ambiguous” content models, XML may choose to prohibit overlap between enumerated attribute values for different attributes, the handling for attribute values in the absence of a DTD may be specified, there may be a way to signal whether the DTD is complete, the DTD summary may be dropped, and XML may support parameter entities, and XML may predefine a large number of character entities, for example those from HTML 3.2.

1.2 Relationship to Other Standards

Other standards relevant to users and implementors of XML include:

  1. SGML (ISO 8879-1986). Valid XML Documents are SGML documents in the sense described in ISO standard 8879.
  2. Unicode, ISO 10646. This specification depends on ISO standard 10646 and the technically identical Unicode Standard, Version 2.0, which define the encodings and meanings of the characters which make up XML text data.
  3. IETF RFC 1738. This specification defines the syntax and semantics of Uniform Resource Locators, or URLs.
  4. World-Wide Web Consortium Working Draft WD-html32-960909: HTML 3.2 Reference Specification. This includes the repertoire of characters to be predefined by an XML processor.

SGML definition ’96: http://xml.coverpages.org//sgmlsyn/sgmlsyn.htm#C12.1

Charles F. Goldfarb is known as the father of SGML and grandfather of HTML and the World Wide Web. He co-invented the concept of markup languages. In 1969 Charles Goldfarb, leading a small team at IBM, developed the first markup language, called Generalized Markup Language, or GML. In an interview with Web Techniques Magazine editor Michael Floyd, Dr. Goldfarb explains that he coined the term GML to be an initialism for the three researchers, Charles Goldfarb, Ed Mosher and Ray Lorie who worked on the project.

In 1974, he designed SGML and subsequently wrote the first SGML parser, ARCSGML. Goldfarb went on working to turn SGML into the ISO 8879 standard, and served as its editor in the standardization committee.

Goldfarb holds an LL.B. from Harvard Law School. He worked at IBM‘s Almaden Research Center for many years and is now an independent consultant based inSaratoga, California.

http://www.sgmlsource.com/

Charles F. Goldfarb’s

SGML SOURCE HOME PAGE

Charles F. Goldfarb Welcome to the SGML Source Home Page.SGML is the International Standard (ISO 8879) language for structured data and document representation, the basis of HTML and XML and many others. I invented SGML in 1974 and led a 12-year technical effort by several hundred people to develop its present form as an International Standard.

http://www.w3.org/DesignIssues/N3Logic

Adjacency List (including Parent/Child) and Nested Sets – sql/directed graph: http://explainextended.com/2009/09/24/adjacency-list-vs-nested-sets-postgresql/

^i find the lateness in coming of recursive thought in the major rdbms possibly telling…in a way…i dunno it’s like there’s been a disconnect.  between data structure and function.

http://computationalculture.net/article/relational-and-non-relational-models-in-the-entextualization-of-bureaucracy

RELATIONAL AND NON-RELATIONAL MODELS IN THE ENTEXTUALIZATION OF BUREAUCRACY

Abstract

To what can we ascribe the commercial dominance and organizational ubiquity of relational databases in the 1990s and early 2000s? By integrating perspectives from classical sociological theory, the history of administrative computing, and linguistic anthropology with papers and proceedings from the nascent database research community, this paper argues that the success of relational databases (and their corresponding tabular data models) can be ascribed to three distinctive components: 1) a semiotic distinctiveness from previously-existing (and now largely forgotten) hierarchical and network models; 2) a bureaucratic correspondence with practices of file organization and index-based retrieval; and 3) support for transaction processing, which facilitated adoption by a wide variety of organizations in commerce, banking, and finance. These three aspects are in turn used to suggest ways for understanding the potential significance (or lack thereof) of an emerging 21st-century market for non-relational and/or “Big Data” database systems.

http://www.researchgate.net/publication/239595573_Software_for_random_access_processing

Charles Bachman

Charles Bachman 2012.jpg

http://www.tavne.com/3_Specification_and_Design.html

Basic structure of navigational CODASYL database model[1]

Charles William “Charlie” Bachman III (born December 11, 1924) is an American computer scientist, who spent his entire career as an industrial researcher, developer, and manager rather than in academia. He is particularly known for his work in the early development of database management systems. His techniques of layered architecture include his namesake Bachman diagrams.

Biography

Charles Bachman was born in Manhattan, Kansas in 1924, where his father, Charles Bachman Jr., was the head football coach at Kansas State College. He attended high school in East Lansing, Michigan.

In World War II he joined the United States Army and spent March 1944 through February 1946 in the South West Pacific Theater serving in the Anti-Aircraft Artillery Corps in New Guinea, Australia, and the Philippine Islands. There he was first exposed to and used fire control computers for aiming 90 mm guns.[2]

After his discharge in 1946 he attended Michigan State College and graduated in 1948 with a bachelor’s degree in Mechanical Engineering where he was a member of Tau Beta Pi. In mid-1949 he married Connie Hadley.[3] He then attended the University of Pennsylvania. In 1950, he graduated with a master’s degree in Mechanical Engineering, and had also completed three-quarters of the requirements for an MBA from the university’s Wharton School of Business.[2]

Work

Bachman spent his entire career as a practicing software engineer or manager in industry rather than in academia. In 1950 he started working at Dow Chemical in Midland, Michigan. In 1957 he became Dow’s first data processing manager. He worked with the IBM user group SHARE, on developing new version of report generator software, which became known as 9PAC. However, the planned IBM 709 order was cancelled before it arrived.[4] In 1960 he joinedGeneral Electric, where by 1963 he developed the Integrated Data Store (IDS) one of the first database management systems using what came to be known as the navigational database model in the Manufacturing Information And Control System (MIACS) product. Working for customer Weyerhaeuser Lumber, he developed the first multiprogramming network access to the IDS database, an early online transaction processing system called WEYCOS in 1965. Later at GE he developed the “dataBasic” product that offered database support to the Basic language timesharing users. In 1970, GE sold its computer business to Honeywell Information Systems, and so he an his family moved from Phoenix, Arizona to Lexington, Massachusetts.[5]

He received the Turing Award from the Association for Computing Machinery (ACM) in 1973 for “his outstanding contributions to database technology”. He was elected as a Distinguished Fellow of the British Computer Society in 1977 for his pioneering work in database systems.[6] In 1981, he joined a smaller firm, Cullinane Information Systems (later Cullinet), which offered a version of IDS that was called IDMS and supported IBM mainframes.[5]

Bachman Information Systems

In 1983, he founded Bachman Information Systems, which developed a line of computer-aided software engineering(CASE) products. The centerpiece of these products was the BACHMAN/Data Analyst, which provided graphic support to the creation and maintenance of Bachman Diagrams. It was featured in IBM’s Reengineering Cycle marketing program,[citation needed] combining:

  1. the reverse engineering of obsolete mainframe databases,
  2. data modeling,
  3. forward engineering to new physical databases, and
  4. optimization of physical database designs for performance and DBMS specifics.

In 1991 Bachman Information Systems had their initial public offering, trading on the NASDAQ with the symbol BACH. After reaching a high of $37.75 in February 1992, the price hit $1.75 in 1995. In 1996, his company merged with Cadre Technology to form Cayenne Software.[7] He served as president of the combined company for a year, and then retired to Tucson, Arizona. He continued to serve as chairman of the board of Cayenne, which was acquired by Sterling Software in 1998.[2][8]

Publications

Bachman published dozens of publications and papers.[9] A selection:

  • 1962. “Precedence Diagrams: The Key to Production Planning, Scheduling and Control.” In: ProCo Features. Supplement No 24, August 24. .
  • 1965. “Integrated Data Store.” in: DPMA Quarterly, January 1965.
  • 1969. “Software for Random Access Processing.” in: Datamation April 1965.
  • 1969. “Data Structure Diagrams.” in: DataBase: A Quarterly Newsletter of SIGBDP. vol. 1, no. 2, Summer 1969.
  • 1972. “Architecture Definition Technique: Its Objectives, Theory, Process, Facilities, and Practice.” co-authored with J. Bouvard. in: Data Description, Access and Control: Proceedings of the 1972 ACM-SIGFIDET Workshop, November 29-December 1, 1972.
  • 1972. “The Evolution of Storage Structures.” In: Communications of the ACM vol. 15, no. 7, July 1972.
  • 1972-73. “Set Concept for Data Structure.” In: Encyclopedia of Computer Science, 1972-1973.
  • 1973. “The Programmer as Navigator.” 1973 ACM Turing Award lecture. In: Communications of the ACM vol. 16, no. 11, November 1973. (pdf)
  • 1974. “Implementation Techniques for Data Structure Sets.” In: Data Base Management Systems, 1974.
  • 1977. “Why Restrict the Modeling Capability of Codasyl Data Structure Sets?” In: National Computer Conference vol. 46, 1977.
  • 1978. “Commentary on the CODASYL Systems Committee’s Interim Report on Distributed Database Technology.” National Computer Conference vol. 47, 1978.
  • 1978. “DDP Will Be Infinitely Affected, So Managers Beware!” in: DM, March 1978.
  • 1980. “The Impact of Structured Data Throughout Computer-Based Information Systems.” In: Information Processing 80, 1980.
  • 1980. “The Role Data Model Approach to Data Structures.” In; International Conference on Data Bases, March 24, 1980.
  • 1982. “Toward a More Complete Reference Model of Computer-Based Information Systems.” Co-authored with Ronald G. Ross. In: Computers and Standards 1, 1982.
  • 1983. “The Structuring Capabilities of the Molecular Data Model.” In; Entity-Relationship Approach to Software Engineering. C. G. Davis, S. Jajodia, and R. T. Yeh. eds. June 1983.
  • 1987. “A Case for Adaptable Programming.” In: Logic vol. 2, no. 1, Spring 1987.
  • 1989. “A Personal Chronicle: Creating Better Information Systems, with Some Guiding Principles.” In: IEEE Transactions on Knowledge and Data Engineering vol. 1, no. 1, March 1989.

After his retirement, Bachman volunteered to help record the history of early software development. In 2002 he gave a lecture at the Computer History Museum on assembling the Integrated Data Store,[10] and an oral history for the ACM in 2004.[4] Bachman papers from 1951 to 2007 are available from the Charles Babbage Instituteat the University of Minnesota.[9] In 2011, he contributed an oral history to the Institute of Electrical and Electronics Engineers.[5]

Edgar F Codd.jpg

Edgar Frank “Ted” Codd (August 19, 1923 – April 18, 2003) was an English computer scientistwho, while working for IBM, invented the relational model for database management, the theoretical basis for relational databases. He made other valuable contributions tocomputer science, but the relational model, a very influential general theory of data management, remains his most mentioned achievement.[6][7]

Database

database is an organized collection of data. The data are typically organized to model relevant aspects of reality in a way that supports processes requiring this information.

Database management systems (DBMSs) are specially designed software applications that interact with

  • the user
  • other applications
  • and the database itself

to capture and analyze data

A general-purpose DBMS is a software system designed to allow the definition, creation, querying, update, and administration of databases.

Well-known DBMSs include MySQLMariaDBPostgreSQLSQLiteMicrosoft SQL Server,Oracle,SAP HANAdBASEFoxProIBM DB2LibreOffice Base and FileMaker Pro.

A database is not generally portable across different DBMSs, but different DBMSs can interoperate by using standards such as SQL and ODBC or JDBC to allow a single application to work with more than one database.

Formally, “database” refers to the data themselves and supporting data structures.

Databases are set up so that one set of software programs provides all users with access to all the data.

A “database management system” (DBMS) is a suite of computer software providing the interface between users and a database or databases.

The interactions catered for by most existing DBMSs fall into four main groups:

  • Data definition – Defining new data structures for a database, removing data structures from the database, modifying the structure of existing data.
  • Update – Inserting, modifying, and deleting data.
  • Retrieval – Obtaining information either for end-user queries and reports or for processing by applications.
  • Administration – Registering and monitoring users, enforcing data security, monitoring performance, maintaining data integrity, dealing with concurrency control, and recovering information if the system fails.

A DBMS is responsible for maintaining the integrity and security of stored data, and for recovering information if the system fails.

Both a database and its DBMS conform to the principles of a particular database model.[2]“Database system” refers collectively to the

  • database model
  • database management system
  • database

Physically, database servers are dedicated computers that hold the actual databases and run only the DBMS and related software.

Hardware database accelerators, connected to one or more servers via a high-speed channel, are also used in large volume transaction processing environments.

DBMSs may be built around a custom multitasking kernel with built-in networking support, but modern DBMSs typically rely on a standard operating system to provide these functions.[citation needed]

Databases and DBMSs can be categorized according to the

  1. database model(s) that they support (such as relational or XML), the type(s) of computer they run on (from a server cluster to a mobile phone),
  2. the query language(s) used to access the database (such as SQL or XQuery),
  3. and their internal engineering, which affects performance, scalability, resilience, and security.

Client-server or transactional DBMSs are often complex to maintain highperformance,availability and security when many users are querying and updating the database at the same time.

Personal, desktop-based database systems tend to be less complex.

For example, FileMaker and Microsoft Access come with built-in graphical user interfaces.

Increasingly, databases are not only used to support the internal operations of the organization, but also to underpin its online interactions with customers and suppliers (see Enterprise software).

Some general-purpose DBMSs such as AdabasOracle and DB2 have been undergoing upgrades since the 1970s.

General-purpose DBMSs aim to meet the needs of as many applications as possible, which adds to the complexity

However, a general-purpose DBMS is not always the optimal solution: in some cases a general-purpose DBMS may introduce unnecessary overhead. Therefore, there are many examples of systems that use special-purpose databases. A common example is an email system

wire protocol

or more likely through an application programming interface.

The development of database technology can be divided into three eras based on data model or structure:

  1. navigational
  2. SQL/relational
  3. post-relational

The two main early navigational data models were the hierarchical model, epitomized by IBM’s IMS system, and the CODASYL model (network model), implemented in a number of products such as IDMS.

The relational model, first proposed in 1970 by Edgar F. Codd, departed from this tradition by insisting that applications should search for data by content, rather than by following links.

The relational model employs sets of ledger-style tables, each used for a different type of entity.

Only in the mid-1980s did computing hardware became powerful enough to allow the wide deployment of relational systems (DBMSs plus applications).

By the early 1990s, however, relational systems dominated in all large-scale data processing applications, and as of 2014 they remain dominant except in niche areas.

The dominant database language, standardised SQL for the relational model, has influenced database languages for other data models.

CODASYL

CODASYL (often spelled Codasyl) is an acronym for “Conference on Data Systems Languages”. This was a consortium formed in 1959 to guide the development of a standard programming language that could be used on many computers. This effort led to the development of COBOLand other standards.

CODASYL’s members were individuals from industry and government involved in data processing activity. Its larger goal was to promote more effective data systems analysis, design, and implementation. The organization published specifications for various languages over the years, handing these over to official standards bodies (ISO,ANSI, or their predecessors) for formal standardization.

History

CODASYL is remembered almost entirely for two activities: its work on the development of theCOBOL language and its activities in standardizing database interfaces. It also worked on a wide range of other topics, including end-user form interfaces and operating-system control languages, but these projects had little lasting impact.

The remainder of this section is concerned with CODASYL’s database activities.

In 1965 CODASYL formed a List Processing Task Force. This group was chartered to develop COBOL language extensions for processing collections of records; the name arose because Charles Bachman‘s IDS system (which was the main technical input to the project) managed relationships between records using chains of pointers. In 1967 the group renamed itself the Data Base Task Group (DBTG), and its first report in January 1968 was entitled COBOL extensions to handle data bases.

The “set”, the basic structure of the CODASYL database model. A set consists of one owner record and n member records (these are labeled as “parent” and “child” in the diagram, but the Codasyl terminology is “owner” and “member”). In the above example, we were looking at a basic set which embodies a 1:N (Owner:Member) relationship.[1]

In October 1969 the DBTG published its first language specifications for the network database model which became generally known as the Codasyl Data Model.

This specification in fact defined several separate languages:

  • data definition language (DDL) to define the schema of the database, another DDL to create one or more subschemas defining application views of the database;
  • and a data manipulation language (DML) defining verbs for embedding in the COBOL programming language to request and update data in the database.

Although the work was focused on COBOL, the idea of a host-language independent database was starting to emerge, prompted by IBM‘s advocacy of PL/I as a COBOL replacement.

In 1971, largely in response to the need for programming language independence, the work was reorganized: development of the Data Description Language was continued by the Data Description Language Committee, while the COBOL DML was taken over by the COBOL language committee.

With hindsight, this split had unfortunate consequences. The two groups never quite managed to synchronize their specifications, leaving vendors to patch up the differences. The inevitable consequence was a lack of interoperability among implementations.

A number of vendors implemented database products conforming (roughly) to the DBTG specifications: the most well-known implementations were

ANSI and ISO adopted the Codasyl database specifications under the name Network Database Language (NDL), with work taking place within the same working group (X3H2) as SQLstandardization. An ISO standard for NDL was ratified as ISO 8907:1987,[2] but, as it never had any practical effect on implementations, it was formally withdrawn in 1998.

Some of the CODASYL committees continue their work today, but CODASYL itself no longer exists. The records of CODASYL were donated to the Charles Babbage Institute and a catalog may be found at its website.

Interest in CODASYL gradually faded due to growing interest in relational databases beginning in the early 1980s.

Example of CODASYL: http://www.scit.wlv.ac.uk/~cm1958/cp4011/codasyl/codasyl.htm

Relational database

relational database is a database that has a collection of tables of data items, all of which is formally described and organized according to the relational model

Data in a single table represents a relation, from which the name of the database type comes.
In typical solutions, tables may have additionally defined relationships with each other.

In the relational model, each table schema must identify a column or group of columns, called the primary key, to uniquely identify each row.

The term relational does not just refer to relationships between tables: firstly, it refers to the table itself[citation needed], or rather, the relationship between columns within a table; and secondly, it refers to links between tables.

Relational databases displaced hierarchical databases because the ability to add new relations made it possible to add new information that was valuable but “broke” a database’s original hierarchical conception.

The trend continues as a networked planet and social media create the world of “big data”which is larger and less structured than the datasets and tasks that relational databases handle well (it is instructive to compare Hadoop).

Tuple

In mathematicscomputer sciencelinguistics,[1] and philosophy[2] a tuple is an ordered list of elements. In set theory, an (ordered) n-tuple is a sequence (or ordered list) of n elements, where n is a non-negative integer. There is only one 0-tuple, an empty sequence. An n-tuple is defined inductively using the construction of anordered pair. Tuples are usually written by listing the elements within parentheses “(text{ })” and separated by commas; for example, (2, 7, 4, 1, 7) denotes a 5-tuple. Sometimes other delimiters are used, such as square brackets “[text{ }]” or angle brackets “langletext{ }rangle“. Braces “{}” are almost never used for tuples, as they are the standard notation for sets.

Tuples are often used to describe other mathematical objects, such as vectors. In computer science, tuples are directly implemented as product types in most functional programming languages. More commonly, they are implemented as record types, where the components are labeled instead of being identified by position alone. This approach is also used in relational algebra. Tuples are also used in relation to programming the semantic web with Resource Description Framework or RDF.

Codd’s 12 rules

Codd’s twelve rules: database management system in order for it to be considered relational, i.e., a relational database management system (RDBMS).[1][2] They are sometimes jokingly referred to as “Codd’s Twelve Commandments”.

Codd produced these rules as part of a personal campaign to prevent his vision of the relational database being diluted, as database vendors scrambled in the early 1980s to repackage existing products with a relational veneer. Rule 12 was particularly designed to counter such a positioning. Even if such repackaged non-relational products eventually gave way to SQL DBMSs, no popular “relational” DBMSs are actually relational, be it by Codd’s twelve rules or by the more formal definitions in his papers, in his books or in succeeding works in the academia or by its coworkers and successors, Christopher J. DateHugh Darwen, David McGoveran and Fabian Pascal. Only less known DBMSs, most of them academic, strive to comply. The only commercial example, as of December 2010, is Dataphor. Some rules are controversial, especially rule three, because of the debate on three-valued logic.

Rules

Rule 0: The Foundation rule:

A relational database management system must manage its stored data using only its relational capabilities. The system must qualify as relational, as a database, and as a management system. For a system to qualify as a relational database management system (RDBMS), that system must use its relational facilities (exclusively) to manage the database.

Rule 1: The information rule:

All information in a relational database (including table and column names) is represented in only one way, namely as a value in a table.

Rule 2: The guaranteed access rule:

All data must be accessible. This rule is essentially a restatement of the fundamental requirement for primary keys. It says that every individual scalar value in the database must be logically addressable by specifying the name of the containing table, the name of the containing column and the primary key value of the containing row.

Rule 3: Systematic treatment of null values:

The DBMS must allow each field to remain null (or empty). Specifically, it must support a representation of “missing information and inapplicable information” that issystematic, distinct from all regular values (for example, “distinct from zero or any other number”, in the case of numeric values), and independent of data type. It is also implied that such representations must be manipulated by the DBMS in a systematic way.

Rule 4: Active online catalog based on the relational model:

The system must support an online, inline, relational catalog that is accessible to authorized users by means of their regular query language. That is, users must be able to access the database’s structure (catalog) using the same query language that they use to access the database’s data.

Rule 5: The comprehensive data sublanguage rule:

The system must support at least one relational language that

  1. Has a linear syntax
  2. Can be used both interactively and within application programs,
  3. Supports data definition operations (including view definitions), data manipulation operations (update as well as retrieval), security and integrity constraints, and transaction management operations (begin, commit, and rollback).

Rule 6: The view updating rule:

All views that are theoretically updatable must be updatable by the system.

Rule 7: High-level insert, update, and delete:

The system must support set-at-a-time insertupdate, and delete operators. This means that data can be retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables. This rule states that insert, update, and delete operations should be supported for any retrievable set rather than just for a single row in a single table.

Rule 8: Physical data independence:

Changes to the physical level (how the data is stored, whether in arrays or linked lists etc.) must not require a change to an application based on the structure.

Rule 9: Logical data independence:

Changes to the logical level (tables, columns, rows, and so on) must not require a change to an application based on the structure. Logical data independence is more difficult to achieve than physical data independence.

Rule 10: Integrity independence:

Integrity constraints must be specified separately from application programs and stored in the catalog. It must be possible to change such constraints as and when appropriate without unnecessarily affecting existing applications.

Rule 11: Distribution independence:

The distribution of portions of the database to various locations should be invisible to users of the database. Existing applications should continue to operate successfully:

  1. when a distributed version of the DBMS is first introduced; and
  2. when existing distributed data are redistributed around the system.

Rule 12: The nonsubversion rule:

If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used to subvert the system, for example, bypassing a relational security or integrity constraint.

Terminology

Relational database terminology.

The relational database was first defined in June 1970 by Edgar Codd, of IBM’s San Jose Research Laboratory.[1] Codd’s view of what qualifies as an RDBMS is summarized in Codd’s 12 rules. A relational database has become the predominant choice in storing data. Other models besides the relational modelinclude the hierarchical database model and the network model.

The table below summarizes some of the most important relational database terms and theirSQLequivalents.

SQL TERM RELATIONAL DATABASE TERM DESCRIPTION
Row Tuple or record A data set representing a single item
Column Attribute or field A labeled element of a tuple, e.g. “Address” or “Date of birth”
Table Relation or Baserelvar A set of tuples sharing the same attributes; a set of columns and rows
View or result set Derived relvar Any set of tuples; a data report from the RDBMS in response to a query

Relations or Tables

relation is defined as a set of tuples that have the same attributes.

A tuple usually represents an object and information about that object.

Objects are typically physical objects or concepts. A relation is usually described as a table, which is organized into rows and columns. All the data referenced by an attribute are in the same domain and conform to the same constraints.

The relational model specifies that the tuples of a relation have no specific order and that the tuples, in turn, impose no order on the attributes.

Since every attribute has an associated domain, there are constraints (domain constraints). The two principal rules for the relational model are known as entity integrity andreferential integrity.

B-tree indexes result in query times proportional to log(n) where n is the number of rows in a table and hash indexes result in constant time queries (no size dependency so long as the relevant part of the index fits into memory).

The NTFSReiserFSNSSXFSJFS, and ReFS filesystems all use this type of tree for metadata indexing. Relational database management systems such as IBM DB2,[3] Informix,[3] Microsoft SQL Server,[3] Oracle 8,[3] Sybase ASE,[3] and SQLite[4]support this type of tree for table indices. Key-value database management systems such as CouchDB[5] and Tokyo Cabinet[6]support this type of tree for data access.

The B tree was first described in the paper Organization and Maintenance of Large Ordered Indices. Acta Informatica 1: 173–189 (1972) by Rudolf Bayer and Edward M. McCreight. There is no single paper introducing the B+ tree concept. Instead, the notion of maintaining all data in leaf nodes is repeatedly brought up as an interesting variant. An early survey of B trees also covering B+ trees is Douglas Comer: “The Ubiquitous B-Tree“, ACM Computing Surveys 11(2): 121–137 (1979). Comer notes that the B+ tree was used in IBM’s VSAM data access software and he refers to an IBM published article from 1973.

Queries made against the relational database, and the derived relvars in the database are expressed in a relational calculus or a relational algebra. In his original relational algebra, Codd introduced eight relational operators in two groups of four operators each. The first four operators were based on the traditional mathematical set operations:

  • The union operator combines the tuples of two relations and removes all duplicate tuples from the result. The relational union operator is equivalent to the SQL UNION operator.
  • The intersection operator produces the set of tuples that two relations share in common. Intersection is implemented in SQL in the form of the INTERSECT operator.
  • The difference operator acts on two relations and produces the set of tuples from the first relation that do not exist in the second relation. Difference is implemented in SQL in the form of the EXCEPT or MINUS operator.
  • The cartesian product of two relations is a join that is not restricted by any criteria, resulting in every tuple of the first relation being matched with every tuple of the second relation. The cartesian product is implemented in SQL as the CROSS JOIN operator.

The remaining operators proposed by Codd involve special operations specific to relational databases:

  • The selection, or restriction, operation retrieves tuples from a relation, limiting the results to only those that meet a specific criterion, i.e. a subset in terms of set theory. The SQL equivalent of selection is the SELECT query statement with a WHERE clause.
  • The projection operation extracts only the specified attributes from a tuple or set of tuples.
  • The join operation defined for relational databases is often referred to as a natural join. In this type of join, two relations are connected by their common attributes. SQL’s approximation of a natural join is the INNER JOIN operator. In SQL, an INNER JOIN prevents a cartesian product from occurring when there are two tables in a query. For each table added to a SQL Query, one additional INNER JOIN is added to prevent a cartesian product. Thus, for N tables in a SQL query, there must be N-1 INNER JOINS to prevent a cartesian product.
  • The relational division operation is a slightly more complex operation, which involves essentially using the tuples of one relation (the dividend) to partition a second relation (the divisor). The relational division operator is effectively the opposite of the cartesian product operator (hence the name).

Other operators have been introduced or proposed since Codd’s introduction of the original eight including relational comparison operators and extensions that offer support for nesting and hierarchical data, among others.

Database normalization

Edgar F. Codd, the inventor of the relational model, introduced the concept of normalization and what we now know as the First Normal Form (1NF) in 1970.[1] Codd went on to define the Second Normal Form (2NF) and Third Normal Form (3NF) in 1971,[2] and Codd and Raymond F. Boyce defined the Boyce-Codd Normal Form (BCNF) in 1974.[3] Informally, a relational database table is often described as “normalized” if it is in the Third Normal Form.[4] Most 3NF tables are free of insertion, update, and deletion anomalies.

A standard piece of database design guidance is that the designer should first create a fully normalized design; then selective denormalization can be performed for performance reasons.[5]

A basic objective of the first normal form defined by Edgar Frank “Ted” Codd in 1970 was to permit data to be queried and manipulated using a “universal data sub-language” grounded infirst-order logic.[6] (SQL is an example of such a data sub-language, albeit one that Codd regarded as seriously flawed.)[7]

The objectives of normalization beyond 1NF (First Normal Form) were stated as follows by Codd:

1. To free the collection of relations from undesirable insertion, update and deletion dependencies;
2. To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs;
3. To make the relational model more informative to users;
4. To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.

— E.F. Codd, “Further Normalization of the Data Base Relational Model”[8]

The sections below give details of each of these objectives.

Free the database of modification anomalies

An update anomaly. Employee 519 is shown as having different addresses on different records.

An insertion anomaly. Until the new faculty member, Dr. Newsome, is assigned to teach at least one course, his details cannot be recorded.

deletion anomaly. All information about Dr. Giddens is lost if he temporarily ceases to be assigned to any courses.

Background to normalization: definitions

Functional dependency
In a given table, an attribute Y is said to have a functional dependency on a set of attributes X (written X → Y) if and only if each X value is associated with precisely one Yvalue. For example, in an “Employee” table that includes the attributes “Employee ID” and “Employee Date of Birth”, the functional dependency {Employee ID} → {Employee Date of Birth} would hold. It follows from the previous two sentences that each {Employee ID} is associated with precisely one {Employee Date of Birth}.
Full functional dependency
An attribute is fully functionally dependent on a set of attributes X if it is:

  • functionally dependent on X, and
  • not functionally dependent on any proper subset of X. {Employee Address} has a functional dependency on {Employee ID, Skill}, but not a full functional dependency, because it is also dependent on {Employee ID}.Even by the removal of {Skill} functional dependency still holds between {Employee Address} and {Employee ID}.
Transitive dependency
transitive dependency is an indirect functional dependency, one in which XZ only by virtue of XY and YZ.
Trivial functional dependency
A trivial functional dependency is a functional dependency of an attribute on a superset of itself. {Employee ID, Employee Address} → {Employee Address} is trivial, as is {Employee Address} → {Employee Address}.
Multivalued dependency
multivalued dependency is a constraint according to which the presence of certain rows in a table implies the presence of certain other rows.
Join dependency
A table T is subject to a join dependency if T can always be recreated by joining multiple tables each having a subset of the attributes of T.
Superkey
superkey is a combination of attributes that can be used to uniquely identify a database record. A table might have many superkeys.
Candidate key
candidate key is a special subset of superkeys that do not have any extraneous information in them: it is a minimal superkey.
Example:

A table with the fields <Name>, <Age>, <SSN> and <Phone Extension> has many possible superkeys. Three of these are <SSN>, <Phone Extension, Name> and <SSN, Name>. Of those, only <SSN> is a candidate key as the others contain information not necessary to uniquely identify records (‘SSN’ here refers to Social Security Number, which is unique to each person).
Non-prime attribute
A non-prime attribute is an attribute that does not occur in any candidate key. Employee Address would be a non-prime attribute in the “Employees’ Skills” table.
Prime attribute
A prime attribute, conversely, is an attribute that does occur in some candidate key.
Primary key
One candidate key in a relation may be designated the primary key. While that may be a common practice (or even a required one in some environments), it is strictly notational and has no bearing on normalization. With respect to normalization, all candidate keys have equal standing and are treated the same.

Normal forms

The normal forms (abbrev. NF) of relational database theory provide criteria for determining a table’s degree of immunity against logical inconsistencies and anomalies. The higher the normal form applicable to a table, the less vulnerable it is. Each table has a “highest normal form” (HNF): by definition, a table always meets the requirements of its HNF and of all normal forms lower than its HNF; also by definition, a table fails to meet the requirements of any normal form higher than its HNF.

The normal forms are applicable to individual tables; to say that an entire database is in normal form n is to say that all of its tables are in normal form n.

Newcomers to database design sometimes suppose that normalization proceeds in an iterative fashion, i.e. a 1NF design is first normalized to 2NF, then to 3NF, and so on. This is not an accurate description of how normalization typically works. A sensibly designed table is likely to be in 3NF on the first attempt; furthermore, if it is 3NF, it is overwhelmingly likely to have an HNF of 5NF. Achieving the “higher” normal forms (above 3NF) does not usually require an extra expenditure of effort on the part of the designer, because 3NF tables usually need no modification to meet the requirements of these higher normal forms.

The main normal forms are summarized below.

NORMAL FORM DEFINED BY IN BRIEF DEFINITION
1NF First normal form Two versions:E.F. Codd(1970), C.J. Date (2003) 1970[1]and 2003[9] The domain of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain.[10]
2NF Second normal form E.F. Codd 1971[2] No non-prime attribute in the table is functionally dependent on a proper subset of any candidate key
3NF Third normal form Two versions:E.F. Codd(1971), C. Zaniolo (1982) 1971[2]and 1982[11] Every non-prime attribute is non-transitively dependent on every candidate key in the table. The attributes that do not contribute to the description of the primary key are removed from the table. In other words, no transitive dependency is allowed.
EKNF Elementary Key Normal Form C. Zaniolo 1982[11] Every non-trivial functional dependency in the table is either the dependency of an elementary key attribute or a dependency on a superkey
BCNF Boyce–Codd normal form Raymond F. BoyceandE.F. Codd 1974[12] Every non-trivial functional dependency in the table is a dependency on a superkey
4NF Fourth normal form Ronald Fagin 1977[13] Every non-trivial multivalued dependency in the table is a dependency on a superkey
5NF Fifth normal form Ronald Fagin 1979[14] Every non-trivial join dependency in the table is implied by the superkeys of the table
DKNF Domain/key normal form Ronald Fagin 1981[15] Every constraint on the table is a logical consequence of the table’s domain constraints and key constraints
6NF Sixth normal form C.J. Date,Hugh Darwen, and Nikos Lorentzos 2002[16] Table features no non-trivial join dependencies at all (with reference to generalized join operator)

Denormalization

  • OLTP high volume of small transactions; each transaction will leave the database in a consistent state.
  • OLAP operations are primarily “read mostly” databases.  extract historical data that has accumulated over a long period of time.
  • dimensional tables in a star schema often contain denormalized data.
  • alternative to the star schema is the snowflake schema
  • The denormalized or redundant data must be carefully controlled during extract, transform, load (ETL) processing, and users should not be permitted to see the data until it is in a consistent state.

In many cases, the need for denormalization has waned as computers and RDBMS software have become more powerful, but since data volumes have generally increased along with hardware and software performance, OLAP databases often still use denormalized schemas.

Denormalization is also used to improve performance on smaller computers as in computerized cash-registers and mobile devices, since these may use the data for look-up only (e.g. price lookups). Denormalization may also be used when no RDBMS exists for a platform (such as Palm), or no changes are to be made to the data and a swift response is crucial.

Non-first normal form (NF² or N1NF)

Denormalization is the opposite of normalization. In recognition that denormalization can be deliberate and useful, the non-first normal form is a definition of database designs which do not conform to first normal form, by allowing

“sets and sets of sets to be attribute domains” (Schek 1982).

The languages used to query and manipulate data in the model must be extended accordingly to support such values.

One way of looking at this is to consider such structured values as being specialized types of values (domains), with their own domain-specific languages. However, what is usually meant by non-1NF models is the approach in which the relational model and the languages used to query it are extended with a general mechanism for such structure; for instance, the nested relational model supports the use of relations as domain values, by adding two additional operators (nest and unnest) to the relational algebra that can create and flatten nested relations, respectively.

Consider the following table:

First Normal Form
PERSON FAVOURITE COLOUR
Bob blue
Bob red
Jane green
Jane yellow
Jane red

Assume a person has several favourite colours. Obviously, favourite colours consist of a set of colours modeled by the given table. To transform a 1NF into an NF² table a “nest” operator is required which extends the relational algebra of the higher normal forms. Applying the “nest” operator to the 1NF table yields the following NF² table:

Non-First Normal Form
PERSON FAVOURITE COLOURS
Bob
FAVOURITE COLOUR
blue
red
Jane
FAVOURITE COLOUR
green
yellow
red

To transform this NF² table back into a 1NF an “unnest” operator is required which extends the relational algebra of the higher normal forms. The unnest, in this case, would make “colours” into its own table.

Although “unnest” is the mathematical inverse to “nest”, the operator “nest” is not always the mathematical inverse of “unnest”. Another constraint required is for the operators to bebijective, which is covered by the Partitioned Normal Form (PNF).

CICS was preceded by an earlier, single threaded transaction processing system, IBM MTCS. An ‘MTCS-CICS bridge’ was later developed to allow these transactions to execute under CICS with no change to the original application programs.

CICS was originally developed in the United States at an IBM Development Center in Des Plaines, Illinois, beginning in 1966 to address requirements from the public utility industry. The first CICS product was released in 1968, named Public Utility Customer Information Control System, or PU-CICS. It became clear immediately that it had applicability to many other industries, so the Public Utility prefix was dropped with the introduction of the first release of the CICS Program Product on July 8, 1969, not long after IMS database management system.

In 1974, CICS development responsibility was shifted to the IBM Hursley Site in the United Kingdom, where development work continues today.

Customer Information Control System (CICS) is a transaction server that runs primarily on IBM mainframe systems under z/OS and z/VSE.

CICS is middleware designed to support rapid, high-volume online transaction processing. A CICStransaction is a unit of processing initiated by a single request that may affect one or more objects.[1]This processing is usually interactive (screen-oriented), but background transactions are possible.

Programming

Programming considerations

Multiple-user interactive-transaction application programs were required to be quasireentrant in order to support multiple concurrent transaction threads.

A software coding error in one application could block all users from the system.

The modular design of CICS reentrant / reusable control programs meant that, with judicious “pruning,” multiple users with multiple applications could be executed on a computer with just 32K of expensive magnetic core physical memory (including the operating system).

Considerable effort was required by CICS application programmers to make their transactions as efficient as possible.

A common technique was to limit the size of individual programs to no more than 4,096 bytes, or 4K, so that CICS could easily reuse the memory occupied by any program not currently in use for another program or other application storage needs.

When virtual memory was added to versions OS/360 in 1972, the 4K strategy became even more important to reduce paging and thrashing unproductive resource-contention overhead.

The efficiency of compiled high-level COBOL and PL/I language programs left much to be desired. Many CICS application programs continued to be written in assembler language, even after COBOL and PL/I support became available.

With 1960s-and-1970s hardware resources expensive and scarce, a competitive “game” developed among system optimization analysts. When critical path code was identified, a code snippet was passed around from one analyst to another. Each person had to either (a) reduce the number of bytes of code required, or (b) reduce the number of CPU cycles required. Younger analysts learned from what more-experienced mentors did. Eventually, when no one could do (a) or (b), the code was considered optimized, and they moved on to other snippets. Small shops with only one analyst learned CICS optimization very slowly (or not at all).

Because application programs could be shared by many concurrent threads, the use of static variables embedded within a program (or use of operating system memory) was restricted (by convention only).

Unfortunately, many of the “rules” were frequently broken, especially by COBOL programmers who might not understand the internals of their programs or fail to use the necessary restrictive compile time options. This resulted in “non-re-entrant” code that was often unreliable, leading to spurious storage violations and entire CICS system crashes.

The entire partition, or Multiple Virtual Storage (MVS) region, operated with the same memory protection key including the CICS kernel code. Program corruption and CICS control block corruption was a frequent cause of system downtime. A software error in one application program could overwrite the memory (code or data) of one or all currently running application transactions. Locating the offending application code for complex transient timing errors could be a very-difficult operating-system analyst problem.

These serious shortcomings persisted for multiple new releases of CICS over a period of more than 20 years. CICS application transactions were often mission-critical for public utility companies, large banks and other multi-billion-dollar financial institutions. Top-quality CICS skills were in high demand and short supply. The complex learning curve was shallow and long. Unqualified novice developers could have a major negative impact on company operations.

Eventually, it became possible to provide a measure of advance application protection by performing all testing under control of a monitoring program that also served to provide Test and Debug features. One such software offering was known as OLIVER, which prevented application programs corrupting memory by using instruction set simulation of the application code, providing partial virtualization.

Network model

Example of a Network Model.

The network model is a database model conceived as a flexible way of representing objects and their relationships. Its distinguishing feature is that the schema, viewed as a graph in which object types are nodes and relationship types are arcs, is not restricted to being a hierarchy or lattice.

While the hierarchical database model structures data as a tree of records, with each record having one parent record and many children, the network model allows each record to have multiple parent and child records, forming a generalized graph structure.

This property applies at two levels: the schema is a generalized graph of record types connected by relationship types (called “set types” in CODASYL), and the database itself is a generalized graph of record occurrences connected by relationships (CODASYL “sets”).

Cycles are permitted at both levels. The chief argument in favour of the network model, in comparison to the hierarchic model, was that it allowed a more natural modeling of relationships between entities.

Although the model was widely implemented and used, it failed to become dominant for two main reasons. Firstly, IBM chose to stick to the hierarchical model with semi-network extensions in their established products such as IMS and DL/I.

Secondly, it was eventually displaced by the relational model, which offered a higher-level, more declarative interface. Until the early 1980s the performance benefits of the low-level navigational interfaces offered by hierarchical and network databases were persuasive for many large-scale applications, but as hardware became faster, the extra productivity and flexibility of the relational model led to the gradual obsolescence of the network model in corporate enterprise usage.

Database systems

Some well-known database systems that use the network model include:

History

The network model’s original inventor was Charles Bachman, and it was developed into a standard specification published in 1969 by the Conference on Data Systems Languages (CODASYL) Consortium. This was followed by a second publication in 1971, which became the basis for most implementations. Subsequent work continued into the early 1980s, culminating in an ISO specification, but this had little influence on products.

Edgar F. Codd

In 1953, angered by Senator Joseph McCarthy, Codd moved to OttawaCanada.

showing that a set of eight states was sufficient for universal computation and construction.[13]His design for a self-replicating computer was only implemented in 2010.

In the 1940s and 50’s, John von Neumann posed the following problem:[1]

  • What kind of logical organization is sufficient for an automaton to be able to reproduce itself?

He was able to construct a cellular automaton with 29 states, and with it a universal constructor.

Codd modified von Neumann’s question:

  • What kind of logical organization is necessary for an automaton to be able to reproduce itself?

Three years after Codd’s work, Edwin Roger Banks showed a 4-state CA in his PhD thesis that was also capable of universal computation and construction, but again did not implement a self-reproducing machine.[3]

John Devore, in his 1973 masters thesis, tweaked Codd’s rules to greatly reduce the size of Codd’s design, to the extent that it could be implemented in the computers of that time. However, the data tape for self-replication was too long; Devore’s original design was later able to complete replication using Golly.

Codd designed a self-replicating computer in the cellular automaton, based on Wang’s W-machine. However, the design was so colossal that it evaded implementation until 2009, when Tim Hutton constructed an explicit configuration.[5] There were some minor errors in Codd’s design, so Hutton’s implementation differs slightly, in both the configuration and the ruleset.

In the 1960s and 1970s he worked out his theories of data arrangement, issuing his paper “A Relational Model of Data for Large Shared Data Banks” [4] in 1970, after an internal IBM paper one year earlier.[14] To his disappointment, IBM proved slow to exploit his suggestions until commercial rivals started implementing them.

IBM Information Management System

IBM Information Management System (IMS) is a joint hierarchical database andinformation management system with extensive transaction processing capabilities.

History

IBM designed IMS with Rockwell and Caterpillar starting in 1966 for the Apollo program, where it was used to inventory the very large bill of materials (BOM) for the Saturn V moon rocket and Apollo space vehicle.

The first “IMS READY” message appeared on an IBM 2740 terminal in Downey, California, on 14 August 1968.[citation needed] In the interim period, IMS has undergone many developments as IBMSystem/360 technology evolved into the current z/OS and System z9 and z10 technologies. For example, IMS now supports the Java programming languageJDBCXML, and, since late 2005,web services.

Vern Watts was IMS’s chief architect for many years. Watts joined IBM in 1956 and worked at IBM’s Silicon Valley development labs until his death on April 4, 2009.[1] He had continuously worked on IMS since the 1960s.[2]

Database

The IMS Database component stores data using a hierarchical model, which is quite different from IBM’s later released relational databaseDB2.

In IMS, the hierarchical model is implemented using blocks of data known as segments.

Each segment can contain several pieces of data, which are called fields.

For example, a customer database may have a root segment (or the segment at the top of the hierarchy) with fields such as phone, name, and age. Child segments may be added underneath another segment, for instance, one order segment under each customer segment representing each order a customer has placed with a company. Likewise, each order segment may have many children segments for each item on the order. Unlike other databases, you do not need to define all of the data in a segment to IMS. A segment may be defined with a size of 40 bytes but only define one field that is six bytes long as a key field that you can use to find the segment when performing queries. IMS will retrieve and save all 40 bytes as directed by a program but may not understand (or care) what the other bytes represent. In practice, often all data in a segment may map to a COBOL copybook. Besides DL/I query usage, a field may be defined in IMS so that the data can be hidden from certain applications for security reasons. The database component of IMS can be purchased standalone, without the transaction manager component, and used by systems such as CICS.

There are three basic forms of IMS hierarchical databases:

“Full function” databases

  • Directly descended from the Data Language Interface (DL/I) databases originally developed for Apollo, full function databases can have primary and secondary indexes, accessed using DL/I calls from an application program, like SQL calls to DB2 or Oracle.
  • Full function databases can be accessed by a variety of methods, although Hierarchical Direct (HDAM) and Hierarchical Indexed Direct (HIDAM) dominate. The other formats are Simple Hierarchical Indexed Sequential (SHISAM), Hierarchical Sequential (HSAM), and Hierarchical Indexed Sequential (HISAM).
  • Full function databases store data using VSAM, a native z/OS access method, or Overflow Sequential (OSAM), an IMS-specific access method that optimizes the I/O channel program for IMS access patterns. In particular, OSAM performance benefits from sequential access of IMS databases (OSAM Sequential Buffering).

“Fast path” databases

  • Fast Path databases are optimized for extremely high transaction rates.[3] Data Entry Databases (DEDBs) and Main Storage Databases (MSDBs) are the two types of fast path databases. Neither provide any indexing. Virtual Storage Option (VSO) DEDBs can replace MSDBs in modern IMS releases, so MSDBs are gradually disappearing.

High Availability Large Databases (HALDBs)

  • IMS V7 introduced HALDBs, an extension of IMS full function databases to provide better availability, better handling of extremely large data volumes, and, with IMS V9, online reorganization to support continuous availability. (Third party tools exclusively provided online reorganization prior to IMS V9.) A HALDB can store in excess of 40 terabytes of data.[4]

Fast path DEDBs can only be built atop VSAM. DL/I databases can be built atop either VSAM or OSAM, with some restrictions depending on database organization. Although the maximum size of a z/OS VSAM dataset increased to 128 TB a few years ago, IMS still limits a VSAM dataset to 4 GB (and OSAM to 8 GB). This “limitation” simply means that IMS customers will use multiple datasets for large amounts of data. VSAM and OSAM are usually referred to as the access methods, and the IMS “logical” view of the database is referred to as the database “organization” (HDAM, HIDAM, HISAM, etc.) Internally the data are linked using 4-byte pointers or addresses. In the database datasets (DBDSs) the pointers are referred to as RBAs (relative byte addresses).

Collectively the database-related IMS capabilities are often called IMS DB. IMS DB has grown and evolved over nearly four decades to support myriad business needs.

Transaction Manager

IMS is also a robust transaction manager (IMS TM, also known as IMS DC) — one of the “big three” classic transaction managers along with CICS and BEA (now Oracle) Tuxedo.

A transaction manager interacts with an end user (connected through VTAM or TCP/IP, including 3270 and Web user interfaces) or another application, processes a business function (such as a banking account withdrawal), and maintains state throughout the process, making sure that the system records the business function correctly to a data store.

Thus IMS TM is quite like a Web application, operating through a CGI program (for example), to provide an interface to query or update a database. IMS TM typically uses either IMS DB or DB2 as its backend database. When used alone with DB2 the IMS TM component can be purchased without the IMS DB component.

IMS TM uses a messaging and queuing paradigm.

  • An IMS control program receives a transaction entered from a terminal (or Web browser or other application) and then stores the transaction on a message queue (in memory or in a dataset).
  • IMS then invokes its scheduler on the queued transaction to start the business application program in a message processing region.
  • The message processing region retrieves the transaction from the IMS message queue and processes it, reading and updating IMS and/or DB2 databases, assuring proper recording of the transaction.
  • Then, if required, IMS enqueues a response message back onto the IMS message queue.
  • Once the output message is complete and available the IMS control program sends it back to the originating terminal.

IMS TM can handle this whole process thousands (or even tens of thousands) of times per second.

Prior to IMS, businesses and governments had to write their own transaction processing environments. IMS TM provides a straightforward, easy-to-use, reliable, standard environment for high performance transaction execution. In fact, much of the world’s banking industry relies on IMS[citation needed], including the U.S. Federal Reserve. For example, chances are that withdrawing money from an automated teller machine (ATM) will trigger an IMS transaction. Several Chinese banks have recently purchased IMS to support that country’s burgeoning financial industry.

Today IMS complements DB2, IBM’s relational database system, introduced in 1982. In general, IMS performs faster than DB2 for the common tasks but may require more programming effort to design and maintain for non-primary duties.

Relational databases have generally proven superior in cases where the requirements, especially reporting requirements, change frequently or require a variety of viewpoint “angles” outside the primary or original function.

A relational “data warehouse” may be used to supplement an IMS database. For example, IMS may provide primary ATM transactions because it performs well for such a specific task. However, nightly copies of the IMS data may be copied to relational systems such that a variety of reports and processing tasks may be performed on the data. This allows each kind of database to focus best on its relative strength.

Initially, IBM refused to implement the relational model in order to preserve revenue from IMS/DB.

Codd then showed IBM customers the potential of the implementation of its model, and they in turn pressured IBM.

Then IBM included in its Future Systems project a System R subproject — but put in charge of it developers who were not thoroughly familiar with Codd’s ideas, and isolated the team from Codd[citation needed].

As a result, they did not use Codd’s own Alpha language but created a non-relational one, SEQUEL. Even so, SEQUEL was so superior to pre-relational systems that it was copied, in 1979, based on pre-launch papers presented at conferences, by Larry Ellison, of Relational software Inc, in his Oracle Database, which actually reached market before SQL/DS — because of the then-already proprietary status of the original name, SEQUEL had been renamed SQL.

Codd continued to develop and extend his relational model, sometimes in collaboration with Chris Date. One of the normalized forms, the Boyce–Codd normal form, is named after him.

Codd’s theorem, a result proven in his seminal work on the relational model, equates the expressive power of relational algebra and relational calculus (which, in essence, is equivalent to first-order logic).

As the relational model started to become fashionable in the early 1980s, Codd fought a sometimes bitter campaign to prevent the term being misused by database vendors who had merely added a relational veneer to older technology. As part of this campaign, he published his 12 rules to define what constituted a relational database. This made his position in IBM increasingly difficult, so he left to form his own consulting company with Chris Date and others.

Codd coined the term Online analytical processing (OLAP) and wrote the “twelve laws of online analytical processing”.[15] Controversy erupted, however, after it was discovered that this paper had been sponsored by Arbor Software (subsequently Hyperion, now acquired by Oracle), a conflict of interest that had not been disclosed, and ComputerWorld withdrew the paper.[16]

In 2004, SIGMOD renamed its highest prize to the SIGMOD Edgar F. Codd Innovations Award, in his honour.

Object databases developed in the 1980s to overcome the inconvenience of object-relational impedance mismatch, which led to the coining of the term “post-relational” and also the development of hybrid object-relational databases.

The next generation of post-relational databases in the late 2000s became known as NoSQL databases, introducing fast key-value stores and document-oriented databases. A competing “next generation” known as NewSQL databases attempted new implementations that retained the relational/SQL model while aiming to match the high performance of NoSQL compared to commercially available relational DBMSs.

1960s, navigational DBMS

Basic structure of navigational CODASYL database model.

Further information: Navigational database

The introduction of the term database coincided with the availability of direct-access storage (disks and drums) from the mid-1960s onwards.

The term represented a contrast with the tape-based systems of the past, allowing shared interactive use rather than daily batch processing.

The Oxford English dictionary cites[6] a 1962 report by the System Development Corporation of California as the first to use the term “data-base” in a specific technical sense.

As computers grew in speed and capability, a number of general-purpose database systems emerged; by the mid-1960s a number of such systems had come into commercial use. Interest in a standard began to grow, and Charles Bachman, author of one such product, the Integrated Data Store (IDS), founded the “Database Task Group” within CODASYL, the group responsible for the creation and standardization of COBOL. In 1971 theDatabase Task Group delivered their standard, which generally became known as the “CODASYL approach”, and soon a number of commercial products based on this approach entered the market.

The CODASYL approach relied on the “manual” navigation of a linked data set which was formed into a large network. Applications could find records by one of three methods:

  • use of a primary key (known as a CALC key, typically implemented by hashing)
  • navigating relationships (called sets) from one record to another
  • scanning all the records in a sequential order.

Later systems added B-Trees to provide alternate access paths. Many CODASYL databases also added a very straightforward query language. However, in the final tally, CODASYL was very complex and required significant training and effort to produce useful applications.

IBM also had their own DBMS system in 1968, known as IMSIMS was a development of software written for the Apollo program on the System/360. IMS was generally similar in concept to CODASYL, but used a strict hierarchy for its model of data navigation instead of CODASYL’s network model. Both concepts later became known as navigational databases due to the way data was accessed, and Bachman’s 1973 Turing Award presentation was The Programmer as Navigator. IMS is classified[by whom?] as a hierarchical database. IDMS andCincom Systems‘ TOTAL database are classified as network databases. IMS remains in use as of 2014.[7]

1970s, relational DBMS[edit]

Edgar Codd worked at IBM in San Jose, California, in one of their offshoot offices that was primarily involved in the development of hard disk systems. He was unhappy with the navigational model of the CODASYL approach, notably the lack of a “search” facility.

In 1970, he wrote a number of papers that outlined a new approach to database construction that eventually culminated in the groundbreaking A Relational Model of Data for Large Shared Data Banks.[8]

In this paper, he described a new system for storing and working with large databases.

  • Instead of linked list of free-form records
  • table” of fixed-length records, with each table used for a different type of entity

A linked-list system would be very inefficient when storing “sparse” databases where some of the data for any one record could be left empty. The relational model solved this by splitting the data into a series of normalized tables (or relations), with optional elements being moved out of the main table to where they would take up room only if needed. Data may be freely inserted, deleted and edited in these tables, with the DBMS doing whatever maintenance needed to present a table view to the application/user.

In the relational model, related records are linked together with a “key”

The relational model also allowed the content of the database to evolve without constant rewriting of links and pointers. The relational part comes from entities referencing other entities in what is known as one-to-many relationship, like a traditional hierarchical model, and many-to-many relationship, like a navigational (network) model. Thus, a relational model can express both hierarchical and navigational models, as well as its native tabular model, allowing for pure or combined modeling in terms of these three models, as the application requires.

For instance, a common use of a database system is to track information about users, their name, login information, various addresses and phone numbers. In the navigational approach all of these data would be placed in a single record, and unused items would simply not be placed in the database. In the relational approach, the data would be normalized into a user table, an address table and a phone number table (for instance). Records would be created in these optional tables only if the address or phone numbers were actually provided.

Linking the information back together is the key to this system. In the relational model, some bit of information was used as a “key“, uniquely defining a particular record. When information was being collected about a user, information stored in the optional tables would be found by searching for this key. For instance, if the login name of a user is unique, addresses and phone numbers for that user would be recorded with the login name as its key. This simple “re-linking” of related data back into a single collection is something that traditional computer languages are not designed for.

Just as the navigational approach would require programs to loop in order to collect records, the relational approach would require loops to collect information about anyone record. Codd’s solution to the necessary looping was a set-oriented language, a suggestion that would later spawn the ubiquitous SQL. Using a branch of mathematics known as tuple calculus, he demonstrated that such a system could support all the operations of normal databases (inserting, updating etc.) as well as providing a simple system for finding and returning sets of data in a single operation.

Codd’s paper was picked up by two people at Berkeley, Eugene Wong and Michael Stonebraker. They started a project known as INGRES using funding that had already been allocated for a geographical database project and student programmers to produce code. Beginning in 1973, INGRES delivered its first test products which were generally ready for widespread use in 1979. INGRES was similar to System R in a number of ways, including the use of a “language” for data access, known as QUEL. Over time, INGRES moved to the emerging SQL standard.

IBM itself did one test implementation of the relational model, PRTV, and a production one,Business System 12, both now discontinued. Honeywell wrote MRDS forMultics, and now there are two new implementations: Alphora Dataphor and Rel. Most other DBMS implementations usually called relational are actually SQL DBMSs.

IBM Peterlee Relational Test Vehicle (PRTV)

PRTV (Peterlee Relational Test Vehicle) was the world’s first relational database management system that could handle significant data volumes.

It was a relational query system with powerful query facilities, but very limited update facility and no simultaneous multiuser facility. PRTV was a follow-on from the very first relational implementation, IS1.

Features

PRTV included several firsts in the relational database area:

  • implemented relational optimizer [1]
  • implemented cost based relational optimizer [2]
  • handle tables of 1000 rows up to 10,000,000 rows[3]
  • user defined functions (UDFs) within an RDB (also a large suite of built-in functions such as trigonometric and statistical)[4]
  • geographic information system based on an RDB (using UDFs such as point-in-polygon).[5]

PRTV was based on a relational algebra, Information Systems Base Language (ISBL) and followed the relational model very strictly. Even features such as user defined functions were formalized within that model.[6] The PRTV team also introduced surrogates to the relational model[4] to help formalize relational update operations; and a formalisation for updating through views.[7] However neither of these was implemented within PRTV. PRTV emphatically did not implement NULL values, because this conception was introduced only in 1979.[8]

PRTV was itself never available as a product, but the Urban Management System[9] built on it was available as a limited IBM product.

In 1970, the University of Michigan began development of the MICRO Information Management System[9] based on D.L. Childs’ Set-Theoretic Data model.[10][11][12] Micro was used to manage very large data sets by the US Department of Labor, the U.S. Environmental Protection Agency, and researchers from the University of Alberta, theUniversity of Michigan, and Wayne State University. It ran on IBM mainframe computers using the Michigan Terminal System.[13] The system remained in production until 1998.

Integrated approach

Main article: Database machine

In the 1970s and 1980s attempts were made to build database systems with integrated hardware and software. The underlying philosophy was that such integration would provide higher performance at lower cost. Examples were IBM System/38, the early offering of Teradata, and the Britton Lee, Inc. database machine.

Another approach to hardware support for database management was ICL‘s CAFS accelerator, a hardware disk controller with programmable search capabilities. In the long term, these efforts were generally unsuccessful because specialized database machines could not keep pace with the rapid development and progress of general-purpose computers. Thus most database systems nowadays are software systems running on general-purpose hardware, using general-purpose computer data storage. However this idea is still pursued for certain applications by some companies like Netezza and Oracle (Exadata).

Late 1970s, SQL DBMS

IBM started working on a prototype system loosely based on Codd’s concepts as System R in the early 1970s. The first version was ready in 1974/5, and work then started on multi-table systems in which the data could be split so that all of the data for a record (some of which is optional) did not have to be stored in a single large “chunk”. Subsequent multi-user versions were tested by customers in 1978 and 1979, by which time a standardized query language – SQL[citation needed] – had been added. Codd’s ideas were establishing themselves as both workable and superior to CODASYL, pushing IBM to develop a true production version of System R, known asSQL/DS, and, later, Database 2 (DB2).

Larry Ellison‘s Oracle started from a different chain, based on IBM’s papers on System R, and beat IBM to market when the first version was released in 1978.[citation needed]

Stonebraker went on to apply the lessons from INGRES to develop a new database, Postgres, which is now known as PostgreSQL. PostgreSQL is often used for global mission critical applications (the .org and .info domain name registries use it as their primary data store, as do many large companies and financial institutions).

In Sweden, Codd’s paper was also read and Mimer SQL was developed from the mid-1970s atUppsala University. In 1984, this project was consolidated into an independent enterprise. In the early 1980s, Mimer introduced transaction handling for high robustness in applications, an idea that was subsequently implemented on most other DBMSs.

Another data model, the entity-relationship model, emerged in 1976 and gained popularity fordatabase design as it emphasized a more familiar description than the earlier relational model. Later on, entity-relationship constructs were retrofitted as a data modeling construct for the relational model, and the difference between the two have become irrelevant.[citation needed]

1980s, on the desktop

The 1980s ushered in the age of desktop computing. The new computers empowered their users with spreadsheets like Lotus 1,2,3 and database software like dBASE. The dBASE product was lightweight and easy for any computer user to understand out of the box. C. Wayne Ratliffthe creator of dBASE stated: “dBASE was different from programs like BASIC, C, FORTRAN, and COBOL in that a lot of the dirty work had already been done. The data manipulation is done by dBASE instead of by the user, so the user can concentrate on what he is doing, rather than having to mess with the dirty details of opening, reading, and closing files, and managing space allocation.“ [14] dBASE was one of the top selling software titles in the 1980s and early 1990s.

1980s, object-oriented

The 1980s, along with a rise in object-oriented programming, saw a growth in how data in various databases were handled.

Programmers and designers began to treat the data in their databases as objects.

That is to say that if a person’s data were in a database, that person’s attributes, such as their address, phone number, and age, were now considered to belong to that person instead of being extraneous data. T

his allows for relations between data to be relations to objects and their attributes and not to individual fields.[15]

The term “object-relational impedance mismatch” described the inconvenience of translating between programmed objects and database tables. Object databases and object-relational databases attempt to solve this problem by providing an object-oriented language (sometimes as extensions to SQL) that programmers can use as alternative to purely relational SQL. On the programming side, libraries known as object-relational mappings (ORMs) attempt to solve the same problem.

2000s, NoSQL and NewSQL

Main articles: NoSQL and NewSQL

The next generation of post-relational databases in the 2000s became known as NoSQL databases, including fast key-value stores and document-oriented databases.XML databases are a type of structured document-oriented database that allows querying based onXML document attributes.

XML databases are mostly used in enterprise database management, where XML is being used as the machine-to-machine data interoperability standard. XML databases are mostly commercial software systems, that include Clusterpoint,[16] MarkLogic[17]and Oracle XML DB.[18]

NoSQL databases are often very fast, do not require fixed table schemas, avoid join operations by storing denormalized data, and are designed to scale horizontally. The most popular NoSQL systems include MongoDBCouchbaseRiakmemcachedRedisCouchDBHazelcastApache Cassandra and HBase,[19] which are all open-source software products.

In recent years there was a high demand for massively distributed databases with high partition tolerance but according to the CAP theorem it is impossible for adistributed systemto simultaneously provide consistency, availability and partition tolerance guarantees. A distributed system can satisfy any two of these guarantees at the same time, but not all three. For that reason many NoSQL databases are using what is called eventual consistency to provide both availability and partition tolerance guarantees with a maximum level of data consistency.

NewSQL is a class of modern relational databases that aims to provide the same scalable performance of NoSQL systems for online transaction processing (read-write) workloads while still using SQL and maintaining the ACID guarantees of a traditional database system. Such databases include ClustrixEnterpriseDBNuoDB[20] andVoltDB.

Research

Database technology has been an active research topic since the 1960s, both in academia and in the research and development groups of companies (for example IBM Research). Research activity includes theory and development of prototypes. Notable research topics have included models, the atomic transaction concept and relatedconcurrency control techniques, query languages and query optimization methods, RAID, and more.

The database research area has several dedicated academic journals (for example, ACM Transactions on Database Systems-TODS, Data and Knowledge Engineering-DKE) and annualconferences (e.g., ACM SIGMOD, ACM PODSVLDBIEEE ICDE).

Examples

One way to classify databases involves the type of their contents, for example: bibliographic, document-text, statistical, or multimedia objects. Another way is by their application area, for example: accounting, music compositions, movies, banking, manufacturing, or insurance. A third way is by some technical aspect, such as the database structure or interface type. This section lists a few of the adjectives used to characterize different kinds of databases.

  • An in-memory database is a database that primarily resides in main memory, but is typically backed-up by non-volatile computer data storage. Main memory databases are faster than disk databases, and so are often used where response time is critical, such as in telecommunications network equipment.[21]SAP HANAplatform is a very hot topic for in-memory database. By May 2012, HANA was able to run on servers with 100TB main memory powered by IBM. The co founder of the company claimed that the system was big enough to run the 8 largest SAP customers.
  • An active database includes an event-driven architecture which can respond to conditions both inside and outside the database. Possible uses include security monitoring, alerting, statistics gathering and authorization. Many databases provide active database features in the form of database triggers.
  • cloud database relies on cloud technology. Both the database and most of its DBMS reside remotely, “in the cloud”, while its applications are both developed by programmers and later maintained and utilized by (application’s) end-users through aweb browser and Open APIs.
  • Data warehouses archive data from operational databases and often from external sources such as market research firms. The warehouse becomes the central source of data for use by managers and other end-users who may not have access to operational data. For example, sales data might be aggregated to weekly totals and converted from internal product codes to use UPCs so that they can be compared with ACNielsen data. Some basic and essential components of data warehousing include retrieving, analyzing, and mining data, transforming, loading and managing data so as to make them available for further use.
  • A document-oriented database is designed for storing, retrieving, and managing document-oriented, or semi structured data, information. Document-oriented databases are one of the main categories of NoSQL databases.
  • An embedded database system is a DBMS which is tightly integrated with an application software that requires access to stored data in such a way that the DBMS is hidden from the application’s end-users and requires little or no ongoing maintenance.[22]
  • End-user databases consist of data developed by individual end-users. Examples of these are collections of documents, spreadsheets, presentations, multimedia, and other files. Several products exist to support such databases. Some of them are much simpler than full fledged DBMSs, with more elementary DBMS functionality.
  • federated database system comprises several distinct databases, each with its own DBMS. It is handled as a single database by a federated database management system (FDBMS), which transparently integrates multiple autonomous DBMSs, possibly of different types (in which case it would also be a heterogeneous database system), and provides them with an integrated conceptual view.
  • Sometimes the term multi-database is used as a synonym to federated database, though it may refer to a less integrated (e.g., without an FDBMS and a managed integrated schema) group of databases that cooperate in a single application. In this case typicallymiddleware is used for distribution, which typically includes an atomic commit protocol (ACP), e.g., the two-phase commit protocol, to allow distributed (global) transactionsacross the participating databases.
  • In a hypertext or hypermedia database, any word or a piece of text representing an object, e.g., another piece of text, an article, a picture, or a film, can be hyperlinkedto that object. Hypertext databases are particularly useful for organizing large amounts of disparate information. For example, they are useful for organizing online encyclopedias, where users can conveniently jump around the text. The World Wide Web is thus a large distributed hypertext database.
  • knowledge base (abbreviated KBkb or Δ[23][24]) is a special kind of database forknowledge management, providing the means for the computerized collection, organization, and retrieval of knowledge. Also a collection of data representing problems with their solutions and related experiences.
  • Operational databases store detailed data about the operations of an organization. They typically process relatively high volumes of updates using transactions. Examples includecustomer databases that record contact, credit, and demographic information about a business’ customers, personnel databases that hold information such as salary, benefits, skills data about employees, enterprise resource planning systems that record details about product components, parts inventory, and financial databases that keep track of the organization’s money, accounting and financial dealings.
The major parallel DBMS architectures which are induced by the underlyinghardware architecture are:

  • Shared memory architecture, where multiple processors share the main memory space, as well as other data storage.
  • Shared disk architecture, where each processing unit (typically consisting of multiple processors) has its own main memory, but all units share the other storage.
  • Shared nothing architecture, where each processing unit has its own main memory and other storage.
  • Real-time databases process transactions fast enough for the result to come back and be acted on right away.
  • spatial database can store the data with multidimensional features. The queries on such data include location based queries, like “Where is the closest hotel in my area?”.
  • temporal database has built-in time aspects, for example a temporal data model and a temporal version of SQL. More specifically the temporal aspects usually include valid-time and transaction-time.
  • An unstructured data database is intended to store in a manageable and protected way diverse objects that do not fit naturally and conveniently in common databases. It may include email messages, documents, journals, multimedia objects, etc. The name may be misleading since some objects can be highly structured. However, the entire possible object collection does not fit into a predefined structured framework. Most established DBMSs now support unstructured data in various ways, and new dedicated DBMSs are emerging.

Design and modeling

Main article: Database design

The first task of a database designer is to produce a conceptual data model that reflects the structure of the information to be held in the database. A common approach to this is to develop an entity-relationship model, often with the aid of drawing tools. Another popular approach is the Unified Modeling Language. A successful data model will accurately reflect the possible state of the external world being modeled: for example, if people can have more than one phone number, it will allow this information to be captured. Designing a good conceptual data model requires a good understanding of the application domain; it typically involves asking deep questions about the things of interest to an organisation, like “can a customer also be a supplier?”, or “if a product is sold with two different forms of packaging, are those the same product or different products?”, or “if a plane flies from New York to Dubai via Frankfurt, is that one flight or two (or maybe even three)?”. The answers to these questions establish definitions of the terminology used for entities (customers, products, flights, flight segments) and their relationships and attributes.

Producing the conceptual data model sometimes involves input from business processes, or the analysis of workflow in the organization. This can help to establish what information is needed in the database, and what can be left out. For example, it can help when deciding whether the database needs to hold historic data as well as current data.

Having produced a conceptual data model that users are happy with, the next stage is to translate this into a schema that implements the relevant data structures within the database. This process is often called logical database design, and the output is a logical data modelexpressed in the form of a schema. Whereas the conceptual data model is (in theory at least) independent of the choice of database technology, the logical data model will be expressed in terms of a particular database model supported by the chosen DBMS. (The termsdata modeland database model are often used interchangeably, but in this article we use data model for the design of a specific database, and database model for the modelling notation used to express that design.)

The most popular database model for general-purpose databases is the relational model, or more precisely, the relational model as represented by the SQL language. The process of creating a logical database design using this model uses a methodical approach known as normalization. The goal of normalization is to ensure that each elementary “fact” is only recorded in one place, so that insertions, updates, and deletions automatically maintain consistency.

The final stage of database design is to make the decisions that affect performance, scalability, recovery, security, and the like. This is often called physical database design. A key goal during this stage is data independence, meaning that the decisions made for performance optimization purposes should be invisible to end-users and applications. Physical design is driven mainly by performance requirements, and requires a good knowledge of the expected workload and access patterns, and a deep understanding of the features offered by the chosen DBMS.

Another aspect of physical database design is security. It involves both defining access controlto database objects as well as defining security levels and methods for the data itself.

Models

Collage of five types of database models.

A database model is a type of data model that determines the logical structure of a database and fundamentally determines in which manner data can be stored, organized, and manipulated. The most popular example of a database model is the relational model (or the SQL approximation of relational), which uses a table-based format.

Common logical data models for databases include:

An object-relational database combines the two related structures.

Physical data models include:

Other models include:

External, conceptual, and internal views

A database management system provides three views of the database data:

  • The external level defines how each group of end-users sees the organization of data in the database. A single database can have any number of views at the external level.
  • The conceptual level unifies the various external views into a compatible global view.[26]It provides the synthesis of all the external views. It is out of the scope of the various database end-users, and is rather of interest to database application developers and database administrators.
  • The internal level (or physical level) is the internal organization of data inside a DBMS (see Implementation section below). It is concerned with cost, performance, scalability and other operational matters. It deals with storage layout of the data, using storage structures such as indexes to enhance performance. Occasionally it stores data of individual views (materialized views), computed from generic data, if performance justification exists for such redundancy. It balances all the external views’ performance requirements, possibly conflicting, in an attempt to optimize overall performance across all activities.

The three-level database architecture relates to the concept of data independence which was one of the major initial driving forces of the relational model. The idea is that changes made at a certain level do not affect the view at a higher level. For example, changes in the internal level do not affect application programs written using conceptual level interfaces, which reduces the impact of making physical changes to improve performance.

In practice usually a given DBMS uses the same data model for both the external and the conceptual levels (e.g., relational model). The internal level, which is hidden inside the DBMS and depends on its implementation (see Implementation section below), requires a different level of detail and uses its own types of data structure types.

Separating the externalconceptual and internal levels was a major feature of the relational database model implementations that dominate 21st century databases.[26]

Languages

Database languages are special-purpose languages, which do one or more of the following:

Database languages are specific to a particular data model. Notable examples include:

  • XQuery is a standard XML query language implemented by XML database systems such as MarkLogic and eXist, by relational databases with XML capability such as Oracle and DB2, and also by in-memory XML processors such as Saxon.

A database language may also incorporate features like:

  • DBMS-specific Configuration and storage engine management
  • Computations to modify query results, like counting, summing, averaging, sorting, grouping, and cross-referencing
  • Constraint enforcement (e.g. in an automotive database, only allowing one engine type per car)
  • Application programming interface version of the query language, for programmer convenience

Storage

Database storage is the container of the physical materialization of a database. It comprises the internal (physical) level in the database architecture. It also contains all the information needed (e.g., metadata, “data about the data”, and internal data structures) to reconstruct theconceptual level and external level from the internal level when needed. Putting data into permanent storage is generally the responsibility of the database engine a.k.a. “storage engine”. Though typically accessed by a DBMS through the underlying operating system (and often utilizing the operating systems’ file systems as intermediates for storage layout), storage properties and configuration setting are extremely important for the efficient operation of the DBMS, and thus are closely maintained by database administrators. A DBMS, while in operation, always has its database residing in several types of storage (e.g., memory and external storage). The database data and the additional needed information, possibly in very large amounts, are coded into bits. Data typically reside in the storage in structures that look completely different from the way the data look in the conceptual and external levels, but in ways that attempt to optimize (the best possible) these levels’ reconstruction when needed by users and programs, as well as for computing additional types of needed information from the data (e.g., when querying the database).

Some DBMSs support specifying which character encoding was used to store data, so multiple encodings can be used in the same database.

Various low-level database storage structures are used by the storage engine to serialize the data model so it can be written to the medium of choice. Techniques such as indexing may be used to improve performance. Conventional storage is row-oriented, but there are alsocolumn-oriented and correlation databases.

The acronym ACID describes some ideal properties of a database transaction:

Further information: Concurrency control

Migration

See also section Database migration in article Data migration

After the database is created, initialised and populated it needs to be maintained. Various database parameters may need changing and the database may need to be tuned (tuning) for better performance; application’s data structures may be changed or added, new related application programs may be written to add to the application’s functionality, etc.


Backup and restore

To achieve this a backup operation is done occasionally or continuously, where each desired database state (i.e., the values of its data and their embedding in database’s data structures) is kept within dedicated backup files (many techniques exist to do this effectively) (e.g., by specifying this state by a desired point in time when the database was in this state), these files are utilized to restore that state.

Other

Other DBMS features might include:

  • Database logs
  • Graphics component for producing graphs and charts, especially in a data warehouse system
  • Query optimizer – Performs query optimization on every query to choose for it the most efficient query plan (a partial order (tree) of operations) to be executed to compute the query result. May be specific to a particular storage engine.
  • Tools or hooks for database design, application programming, application program maintenance, database performance analysis and monitoring, database configuration monitoring, DBMS hardware configuration (a DBMS and related database may span computers, networks, and storage units) and related database mapping (especially for a distributed DBMS), storage allocation and database layout monitoring, storage migration, etc.

Alpha (programming language)

The Alpha language was the original database language proposed by Edgar F. Codd, the inventor of the relational database approach. It was defined in Codd’s 1971 paper “A Data Base Sublanguage Founded on the Relational Calculus”.[1] Alpha influenced the design of QUEL.[2] It was eventually supplanted by SQL (which is however based on the relational algebra defined by Codd in “Relational Completeness of Data Base Sublanguages”[3]), which IBM developed for its first commercial relational database product.

QUEL query languages

QUEL is a relational databasequery language, based on tuple relational calculus, with some similarities to SQL. It was created as a part of the IngresDBMS effort at University of California, Berkeley, based on Codd‘s earlier suggested but not implemented Data Sub-Language ALPHA. QUEL was used for a short time in most products based on the freely available Ingres source code, most notably PostgreSQL. As Oracle and DB2 gained market share in the early 1980s, most companies then supporting QUEL moved to SQL instead.[1] QUEL continues to be available as a part of the Ingres DBMS, although no QUEL-specific language enhancements have been added for many years[when?].

Usage

QUEL statements are always defined by tuple variables, which can be used to limit queries or return result sets. Consider this example, taken from one of the first original Ingres papers:[2]

Example 1.1. Compute salary divided by age-18 for employee Jones.

range of E is EMPLOYEE
retrieve into W
(COMP = E.Salary / (E.Age – 18))
where E.Name = “Jones”

Here E is a tuple variable which ranges over the EMPLOYEE relation, and all tuples in that relation are found which satisfy the qualification E.Name = “Jones.” The result of the query is a new relation W, which has a single domain COMP that has been calculated for each qualifying tuple.

An equivalent SQL statement is:

QUEL is generally more “normalized” than SQL.[citation needed] Whereas every major SQL command has a format that is at least somewhat different from the others, in QUEL a single syntax is used for all commands.[citation needed]

For instance, here is a sample of a simple session that creates a table, inserts a row into it, and then retrieves and modifies the data inside it and finally deletes the row that was added (assuming that name is a unique field).

Here is a similar set of SQL statements:

Note that syntax varies significantly between commands, and that even similar commands likeinsert and update use different styles.

Another feature of QUEL was a built-in system for moving records en-masse into and out of the system. Consider this command:

which creates a comma-delimited file of all the records in the student table. The d1 indicates a delimiter, as opposed to a data type. Changing the into to a fromreverses the process. Similar commands are available in many SQL systems, but usually as external tools, as opposed to being internal to the SQL language. This makes them unavailable to stored procedures.

QUEL has an extremely powerful aggregation capability. Aggregates can be nested, and different aggregates can have independent by-lists and/or restriction clauses. For example:

This example illustrates one of the arguably less desirable quirks of QUEL, namely that all string comparisons are potentially pattern matches. y.str = "ii*" matches all y.str values starting with ii.

SQL

SQL (/ˈɛs kjuː ˈɛl/,[4] or /ˈskwəl/Structured Query Language[5][6][7][8]) is a special-purpose programming languagedesigned for managing data held in a relational database management system (RDBMS).

Originally based upon relational algebra and tuple relational calculus, SQL consists of a data definition language and a data manipulation language. The scope of SQL includes data insert, query, update and deleteschema creation and modification, and data access control. Although SQL is often described as, and to a great extent is, a declarative language(4GL), it also includes procedural elements.

SQL was one of the first commercial languages for Edgar F. Codd‘s relational model, as described in his influential 1970 paper, “A Relational Model of Data for Large Shared Data Banks.”[9] Despite not entirely adhering to the relational model as described by Codd, it became the most widely used database language.[10][11]

SQL became a standard of the American National Standards Institute (ANSI) in 1986, and of theInternational Organization for Standardization (ISO) in 1987.[12] Since then, the standard has been enhanced several times with added features. Despite these standards, code is not completely portable among different database systems, which can lead to vendor lock-in. The different makers do not perfectly adhere to the standard, for instance by adding extensions, and the standard itself is sometimes ambiguous.

History

SQL was initially developed at IBM by Donald D. Chamberlin and Raymond F. Boyce in the early 1970s.[13] This version, initially called SEQUEL (Structured English Query Language), was designed to manipulate and retrieve data stored in IBM’s original quasi-relational database management system, System R, which a group at IBM San Jose Research Laboratory had developed during the 1970s.[13] The acronym SEQUEL was later changed to SQL because “SEQUEL” was a trademark of the UK-based Hawker Siddeley aircraft company.[14]

In the late 1970s, Relational Software, Inc. (now Oracle Corporation) saw the potential of the concepts described by Codd, Chamberlin, and Boyce and developed their own SQL-basedRDBMS with aspirations of selling it to the U.S. NavyCentral Intelligence Agency, and other U.S. government agencies. In June 1979, Relational Software, Inc. introduced the first commercially available implementation of SQL, Oracle V2 (Version2) for VAX computers.

After testing SQL at customer test sites to determine the usefulness and practicality of the system, IBM began developing commercial products based on their System R prototype including System/38, SQL/DS, and DB2, which were commercially available in 1979, 1981, and 1983, respectively.[15]

SQL is designed for a specific purpose: to query data contained in a relational database. SQL is a set-based, declarative query language, not an imperative language like Cor BASIC. However, there are extensions to Standard SQL which add procedural programming languagefunctionality, such as control-of-flow constructs. These include:

SOURCE COMMON NAME FULL NAME
ANSI/ISO Standard SQL/PSM SQL/Persistent Stored Modules
Interbase /Firebird PSQL Procedural SQL
IBM DB2 SQL PL SQL Procedural Language (implements SQL/PSM)
IBM Informix SPL Stored Procedural Language
IBM Netezza NZPLSQL[1] (based on Postgres PL/pgSQL)
Microsoft / Sybase T-SQL Transact-SQL
Mimer SQL SQL/PSM SQL/Persistent Stored Module (implements SQL/PSM)
MySQL SQL/PSM SQL/Persistent Stored Module (implements SQL/PSM)
Oracle PL/SQL Procedural Language/SQL (based on Ada)
PostgreSQL PL/pgSQL Procedural Language/PostgreSQL (based on Oracle PL/SQL)
PostgreSQL PL/PSM Procedural Language/Persistent Stored Modules (implements SQL/PSM)
Sybase Watcom-SQL SQL Anywhere Watcom-SQL Dialect
Teradata SPL Stored Procedural Language

In addition to the standard SQL/PSM extensions and proprietary SQL extensions, procedural and object-oriented programmability is available on many SQL platforms via DBMS integration with other languages. The SQL standard defines SQL/JRT extensions (SQL Routines and Types for the Java Programming Language) to support Javacode in SQL databases. SQL Server 2005uses the SQLCLR (SQL Server Common Language Runtime) to host managed .NETassemblies in the database, while prior versions of SQL Server were restricted to using unmanaged extended stored procedures that were primarily written in C. PostgreSQL allows functions to be written in a wide variety of languages including PerlPythonTcl, and C.[27]

Criticism[edit]

SQL deviates in several ways from its theoretical foundation, the relational model and its tuple calculus. In that model, a table is a set of tuples, while in SQL, tables and query results arelistsof rows: the same row may occur multiple times, and the order of rows can be employed in queries (e.g. in the LIMIT clause). Furthermore, additional features (such as NULL and views) were introduced without founding them directly on the relational model, which makes them more difficult to interpret.

Critics argue that SQL should be replaced with a language that strictly returns to the original foundation: for example, see The Third Manifesto. Other critics suggest that Datalog has two advantages over SQL: it has a cleaner semantics which facilitates program understanding and maintenance, and it is more expressive, in particular for recursive queries.[28]

Another criticism is that SQL implementations are incompatible between vendors. In particular date and time syntax, string concatenation, NULLs, and comparison case sensitivity vary from vendor to vendor. A particular exception is PostgreSQL, which strives for compliance.[29]

Popular implementations of SQL commonly omit support for basic features of Standard SQL, such as the DATE or TIME data types. The most obvious such examples, and incidentally the most popular commercial and proprietary SQL DBMSs, are Oracle (whose DATE behaves asDATETIME,[30][31] and lacks a TIME type)[32] and MS SQL Server (before the 2008 version). As a result, SQL code can rarely be ported between database systems without modifications.

There are several reasons for this lack of portability between database systems:

  • The complexity and size of the SQL standard means that most implementors do not support the entire standard.
  • The standard does not specify database behavior in several important areas (e.g. indexes, file storage…), leaving implementations to decide how to behave.
  • The SQL standard precisely specifies the syntax that a conforming database system must implement. However, the standard’s specification of the semantics of language constructs is less well-defined, leading to ambiguity.
  • Many database vendors have large existing customer bases; where the SQL standard conflicts with the prior behavior of the vendor’s database, the vendor may be unwilling to break backward compatibility.
  • There is little commercial incentive for vendors to make it easier for users to change database suppliers (see vendor lock-in).
  • Users evaluating database software tend to place other factors such as performance higher in their priorities than standards conformance.

Standardization

SQL was adopted as a standard by the American National Standards Institute (ANSI) in 1986 as SQL-86[33] and the International Organization for Standardization (ISO) in 1987. Nowadays the standard is subject to continuous improvement by the Joint Technical Committee ISO/IEC JTC 1, Information technology, Subcommittee SC 32, Data management and interchange which affiliate to ISO as well as IEC. It is commonly denoted by the pattern: ISO/IEC 9075-n:yyyy Part n: title, or, as a shortcut,ISO/IEC 9075.

ISO/IEC 9075 is complemented by ISO/IEC 13249: SQL Multimedia and Application Packages(SQL/MM) which defines SQL based interfaces and packages to widely spread applications like video, audio and spatial data.

Until 1996, the National Institute of Standards and Technology (NIST) data management standards program certified SQL DBMS compliance with the SQL standard. Vendors now self-certify the compliance of their products.[34]

The original standard declared the official pronunciation for “SQL” to be as an initialism/ˈɛskjuːˈɛl/ (“es queue el”).[10] Regardless, many English-speaking database professionals (includingDonald Chamberlin himself[35]) use the acronym-like pronunciation of /ˈskwəl/(“sequel”),[36]mirroring the language’s pre-release development name of “SEQUEL”.[14][13]

The SQL standard has gone through a number of revisions:

YEAR NAME ALIAS COMMENTS
1986 SQL-86 SQL-87 First formalized by ANSI.
1989 SQL-89 FIPS127-1 Minor revision, in which the major addition were integrity constraints. Adopted as FIPS 127-1.
1992 SQL-92 SQL2, FIPS 127-2 Major revision (ISO 9075), Entry Level SQL-92 adopted as FIPS 127-2.
1999 SQL:1999 SQL3 Added regular expression matching, recursive queries (e.g. transitive closure),triggers, support for procedural and control-of-flow statements, non-scalar types, and some object-oriented features (e.g. structured types). Support for embedding SQL in Java (SQL/OLB) and vice-versa (SQL/JRT).
2003 SQL:2003 SQL 2003 Introduced XML-related features (SQL/XML), window functions, standardized sequences, and columns with auto-generated values (including identity-columns).
2006 SQL:2006 SQL 2006 ISO/IEC 9075-14:2006 defines ways in which SQL can be used in conjunction with XML. It defines ways of importing and storing XML data in an SQL database, manipulating it within the database and publishing both XML and conventional SQL-data in XML form. In addition, it enables applications to integrate into their SQL code the use of XQuery, the XML Query Language published by the World Wide Web Consortium (W3C), to concurrently access ordinary SQL-data and XML documents.[37]
2008 SQL:2008 SQL 2008 Legalizes ORDER BY outside cursor definitions. Adds INSTEAD OF triggers. Adds the TRUNCATE statement.[38]
2011 SQL:2011

Interested parties may purchase SQL standards documents from ISO, IEC or ANSI. A draft of SQL:2008 is freely available as a zip archive.[39]

The SQL standard is divided into nine parts.

  • ISO/IEC 9075-1:2011 Part 1: Framework (SQL/Framework). It provides logical concepts.
  • ISO/IEC 9075-2:2011 Part 2: Foundation (SQL/Foundation). It contains the most central elements of the language and consists of both mandatory and optionalfeatures.
  • ISO/IEC 9075-3:2008 Part 3: Call-Level Interface (SQL/CLI). It defines interfacing components (structures, procedures, variable bindings) that can be used to execute SQL statements from applications written in Ada, C respectively C++, COBOL, Fortran, MUMPS, Pascal or PL/I. (For Java see part 10.) SQL/CLI is defined in such a way that SQL statements and SQL/CLI procedure calls are treated as separate from the calling application’s source code. Open Database Connectivity is a well-known superset of SQL/CLI. This part of the standard consists solely of mandatory features.
  • ISO/IEC 9075-4:2011 Part 4: Persistent Stored Modules (SQL/PSM) It standardizes procedural extensions for SQL, including flow of control, condition handling, statement condition signals and resignals, cursors and local variables, and assignment of expressions to variables and parameters. In addition, SQL/PSM formalizes declaration and maintenance of persistent database language routines (e.g., “stored procedures”). This part of the standard consists solely of optional features.
  • ISO/IEC 9075-9:2008 Part 9: Management of External Data (SQL/MED). It provides extensions to SQL that define foreign-data wrappers and datalink types to allow SQL to manage external data. External data is data that is accessible to, but not managed by, an SQL-based DBMS. This part of the standard consists solely ofoptional features.
  • ISO/IEC 9075-10:2008 Part 10: Object Language Bindings (SQL/OLB). It defines the syntax and semantics of SQLJ, which is SQL embedded in Java (see also part 3). The standard also describes mechanisms to ensure binary portability of SQLJ applications, and specifies various Java packages and their contained classes. This part of the standard consists solely of optional features, as opposed to SQL/OLB JDBC, which is not part of the SQL standard, which defines an API.[citation needed]
  • ISO/IEC 9075-11:2011 Part 11: Information and Definition Schemas (SQL/Schemata). It defines the Information Schema and Definition Schema, providing a common set of tools to make SQL databases and objects self-describing. These tools include the SQL object identifier, structure and integrity constraints, security and authorization specifications, features and packages of ISO/IEC 9075, support of features provided by SQL-based DBMS implementations, SQL-based DBMS implementation information and sizing items, and the values supported by the DBMS implementations.[40] This part of the standard contains both mandatory and optional features.
  • ISO/IEC 9075-13:2008 Part 13: SQL Routines and Types Using the Java Programming Language (SQL/JRT). It specifies the ability to invoke static Java methods as routines from within SQL applications (‘Java-in-the-database’). It also calls for the ability to use Java classes as SQL structured user-defined types. This part of the standard consists solely ofoptional features.
  • ISO/IEC 9075-14:2011 Part 14: XML-Related Specifications (SQL/XML). It specifies SQL-based extensions for using XML in conjunction with SQL. The XML data type is introduced, as well as several routines, functions, and XML-to-SQL data type mappings to support manipulation and storage of XML in an SQL database.[37] This part of the standard consists solely of optional features.[citation needed]

ISO/IEC 9075 is complemented by ISO/IEC 13249 SQL Multimedia and Application Packages. This closely related but separate standard is developed by the same committee. It defines interfaces and packages which are based on SQL. The aim is a unified access to typical database applications like text, pictures, data mining orspatial data.

  • ISO/IEC 13249-1:2007 Part 1: Framework
  • ISO/IEC 13249-2:2003 Part 2: Full-Text
  • ISO/IEC 13249-3:2011 Part 3: Spatial
  • ISO/IEC 13249-5:2003 Part 5: Still image
  • ISO/IEC 13249-6:2006 Part 6: Data mining
  • ISO/IEC 13249-8:xxxx Part 8: Metadata registries (MDR) (work in progress)

Alternatives

A distinction should be made between alternatives to SQL as a language, and alternatives to the relational model itself. Below are proposed relational alternatives to the SQL language. Seenavigational database and NoSQL for alternatives to the relational model.

D (data language specification)

From Wikipedia, the free encyclopedia

D is a set of prescriptions for what Christopher J. Date and Hugh Darwen believe a relational database management system ought to be like. It is proposed in their paperThe Third Manifesto, first published in 1994 and elaborated on in several books since then.

Overview[edit]

D by itself is an abstract language specification. It does not specify language syntax. Instead, it specifies desirable and undesirable language characteristics in terms of prescriptions and proscriptions. Thus, D is not a language but a family of both implemented and future languages. A “valid D” must have a certain set of features, and exclude a different set of features which Date and Darwen consider unwise and contrary to the relational modelproposed by E. F. Codd in 1970. A valid D may have additional features which are outside the scope of relational databases.

Tutorial D[edit]

Tutorial D is a specific D which is defined and used for illustration in The Third Manifesto. Implementations of D need not have the same syntax as Tutorial D. The purpose of Tutorial D is both educational and to show what a D might be like. Rel is an implementation of Tutorial D.

Implementations[edit]

D’s first implementation is D4, written in C#. D4 is the flagship language of Alphora‘s Dataphor. Others include Rel (see above), OpusDuro, and Dee.

Dataphor

Dataphor is an open-source truly-relational database management system (RDBMS) and its accompanying user interface technologies, which together are designed to provide highly declarative software application development. The Dataphor Server has its own storage engine or it can be a virtual, or federated, DBMS, meaning that it can utilize other database engines for storage.

Dataphor has been praised for its adherence to relational principles, more closely so than any SQL product.[1]

Overview

The stated purpose of Dataphor is to attempt to raise the bar of automation when building and maintaining complex software applications. Originally referred to as a framework, Dataphor provides more of a software development platform, complete with its own programming and user interface paradigms.

Dataphor is broadly divided into two components: the Dataphor Server, and the Dataphor Frontend. The purpose of the Dataphor Server is to provide a standardized language and runtime for the definition, manipulation, and integrity of application data. The Frontend is concerned with the dynamic derivation of user interfaces and the presentation thereof in either the Windows or Web thin client.

Dataphor does not employ SQL as its primary database language since SQL purportedly violates important principles of the relational model. Dataphor’s D4 language is based on the principles of Christopher J Date‘s and Hugh Darwen‘s Tutorial D, but with a Pascal-like imperative syntax.

Though Dataphor espouses to be truly relational, it does incorporate the concept of NULLs as found in SQL, which many claim to be contraindicated by the Relational Model. NULLs and the matter of managing missing information, however, continue to be debated.

In addition to the data management focus of the Dataphor Server, Dataphor includes tools which allow the presentation of user interfaces through Windows and Web “thin” clients. Dataphor takes advantage of the relational inference capabilities of the Dataphor compiler in order to allow complete GUI forms to be derived directly from the data model. The unique aspect of Dataphor’s user interface “derivation” is that it may be based on any relational expression (query) rather than merely base tables.

Truly Relational

Dataphor strives for theoretical compliance to relational principles. While they try to adhere to the principles in The Third Manifesto, they deviated in a few places from what the Third Manifesto strived for, but not in places that were violations of Codd’s 12 rules. E.g. they included nulls, but they claim to have a systematic treatment of them.[2]

While many systems built on SQL fail miserably with respect to Codd’s rule 9 “Logical data independence”, Dataphor applications can automatically change when the logical layer change. E.g. when a new column is added to the system, no additional development is needed to have that be a new field visible to the users for viewing or editing.

Expert opinions on Dataphor

Hugh Darwen has referred to D4, as a notable project in his talk entitled The Askew Wall.[3] Chris Date refers to Dataphor as a product that attempts to implement the Third Manifesto.[4]Fabian Pascal calls Dataphor “Truly Relational”,[5] and “superior to SQL”[1]

History

In 1999, point of sale systems developer Softwise Inc, found they were writing much of the same code over and over again, and looked for a tool to automate their database applications. They didn’t find an application which did what they want, so they created a division of their company, called it Alphora, and set some of their developers to build such a tool. That tool became Dataphor. It is said to be the first truly relational DBMS since IBM Business System 12. Development of Dataphor began shortly before 2000, with a 1.0 release in 2001.

In early 2008, the Alphora name and the Dataphor product were acquired by Database Consulting Group, which was founded by the original architects of Dataphor, who left Softwise in 2007. After the acquisition, Dataphor was re-licensed as open source under the BSD license.

Technology

Dataphor utilizes the Microsoft .NET Framework and is written entirely in C#. The following is a summary of the various technology components of Dataphor:

Dataphor Server

The Dataphor Server has several components including:

  • Call-level interface – session management, process scheduler, etc.
  • Data Dictionary Catalog – containing all of the Tables, Views, Operators, Constraints, References, and other schema objects.
  • D4 Scanner, Parser, Emitter, and Compiler.
  • D4 Runtime – including relational, and scalar processing
  • Storage Integration layer – real-time translation to various dialects of SQL

Languages

While Dataphor supports a SQL flavor they call “RealSQL”,[6] D4 is the preferred language for use within Dataphor, D4 supports DDL and DML statements. D4 queries tend to look likeRelational Algebra expressions with written out names of operators. For example:

SQL STATEMENT EQUIVALENT D4 STATEMENT
SELECT * FROM USER select User
SELECT * FROM USER JOIN Department ON USER.DepartmentId = Department.DepartmentId select User natural joinDepartment
SELECT Name FROM USER select User restrict (Name)
Syntax

D4 has a Pascal-like syntax. D4 sample code is usually written in UpperCamelCase, which is also widely used in Pascal and Delphi systems.

Like most query languages, D4 has a Data Definition Language (DDL) and a Data Manipulation Language (DML). D4 also has an Imperative Language for procedural code.

Data Definition Language

The DDL for Dataphor bears many similarities to other DBMSs, but with an obviously Pascal-like twist. Many of the allowed DDL operations, like constraints, allow relational declarative statements to be used, which many believe is superior to the procedural style operations used in SQL.

Data Manipulation Language

The DML syntax at first glance may appear to be similar to SQLs syntax, but because of D4’s closer ties to relational algebra, the syntax has a cleaner definition, and most users greatly prefer it over SQL.[citation needed]

Imperative language

The Imperative Language in D4 is remarkably similar to Pascal in many respects. The largest distinction being that D4 also allows DDL and DML statements to be run in regular procedural code.

History

D4 was named after the similar sounding Dataphor, the system that uses the language. It was some time after these names were decided that it’s creators discoveredTutorial D, and the coincidence it had with that name. Since discovering Tutorial D and The Third Manifesto, the creators have used The Third Manifesto as a guide in making Dataphor and D4. Since then,Hugh Darwen has referred to D4, as a notable project in his talk entitled The Askew Wall.[3]

Federated Storage Engine

While Dataphor has a storage engine of its own, it can also connect to other RDBMSes, and use them as a storage engine. Dataphor can use the following DBMSes as storage engines:

Dataphor can access Oracle, DB2, SQL Server, Postgres, MySQL and any other storage engine with a single unified language.[8]

Frontend Library

The Dataphor Frontend library provides for the delivery of dynamically derived, or pre-designed static forms. The library is exposed as a standard set of D4 functions (called operators in D4) such as Form(‘<library>’, ‘<name>’) and Derive(‘<D4 expession>’, ‘<form type>’). The resulting forms are described in an XML dialect called a Dataphor Form Document (DFD). The form description is high-level, consisting of a general description of the user interface aspects as they apply independent of client platform.

Dataphoria IDE

Dataphoria is a development environment for:

  • Editing D4
  • Ad hoc execution of D4
  • Creating, Editing, and Customizing (inherited) forms
  • Managing libraries
  • Analyzing execution plans

Windows Client

The Dataphor Windows Client is a thin client in the sense that it is not pre-programmed for a particular application. The Windows client establishes a connection to a Dataphor Server, from which it (through D4) requests form definitions and coordinates the manipulation of application data. The DFD documents are interpreted into concrete Windows Forms controls, but while maintaining the conceptual DOM of the DFD.

Web Client

The Dataphor Web Client is a basic implementation of a Dataphor client, which is manifest as an ASP.NET web application. Like the Windows Client, the Web Client connects to and requests forms and data from and instance of the Dataphor Server. Rather than synchronizing a DFD to Windows controls, however, the Web Client renders HTML which is displayed in a browser. In this way, the Web “Client” is a client relative to the Dataphor Server, but a server relative to the end web browser.

MUMPS (software)

For the programming language of the same name, see MUMPS. For other uses of the word MUMPS, see Mumps (disambiguation).
MUMPS
STABLE RELEASE 4.10.0 / 2011-05-10
WRITTEN IN C/Fortran 90
OPERATING SYSTEM Most UNIX and Windows (through WinMUMPS)
PLATFORM PC, Mac, NEC, SGI, IBM, etc.
AVAILABLE IN C/Fortran 77/Fortran 90
LICENSE Public domain
WEBSITE http://graal.ens-lyon.fr/MUMPS/ andhttp://mumps.enseeiht.fr

MUMPS (MUltifrontal Massively Parallel sparse direct Solver) is a software application for the solution of large sparsesystems of linear algebraic equations on distributed memory parallel computers. It was developed in European projectPARASOL (1996–1999) by CERFACSIRITENSEEIHT and RAL. The software implements the multifrontal method, which is a version of Gaussian elimination for large sparse systems of equations, especially those arising from the finite element method. It is written in Fortran 90 with parallelism by MPI and it uses BLASandScaLAPACK kernels for dense matrix computations. Since 1999, MUMPS has been supported by CERFACS, IRITENSEEIHT, and INRIA.

The importance of MUMPS lies in the fact that it is a supported public domain implementation of the multifrontal method.

The System/38 also has the distinction of being the first commercially available IBM Midrange computer to have a RDBMS integrated into the operating system.

Oracle Corporation

Oracle Corporation is an American multinational computer technology corporation headquartered in Redwood City, California, United States. The company specializes in developing and marketing computer hardware systems andenterprise software products – particularly its own brands of database management systems. Oracle is the second-largest software maker by revenue, after Microsoft.[3]

The company also builds tools for database development and systems of middle-tier software,enterprise resource planning (ERP) software, customer relationship management (CRM) software and supply chain management (SCM) software.

Larry Ellison, a co-founder of Oracle, has served as Oracle’s CEO throughout its history. He also served as the Chairman of the Board until his replacement by Jeffrey O. Henley in 2004. On August 22, 2008, the Associated Press ranked Ellison as the top-paid chief executive in the world.[4][5]

Ellison took inspiration[6] from the 1970 paper written by Edgar F. Codd on relational database management systems (RDBMS) named “A Relational Model of Data for Large Shared Data Banks.”[7] He heard about the IBM System R database from an article in the IBM Research Journal provided by Ed Oates (a future co-founder of Oracle). System R also derived from Codd’s theories, and Ellison wanted to make Oracle’s product compatible with System R, but IBM stopped this by keeping the error codes for their DBMS secret. Ellison co-founded Oracle Corporation in 1977 with Bob Miner and Ed Oates under the name Software Development Laboratories (SDL). In 1979 SDL changed its name to Relational Software, Inc. (RSI).[8] In 1982, RSI renamed itself Oracle Systems Corporation [9] to align itself more closely with its flagship product Oracle Database. At this stage Bob Miner served as the company’s senior programmer. In 1995, Oracle Systems Corporation changed its name to Oracle Corporation.[10] The company is officially named Oracle, but sometimes referred to as Oracle Corporation, which is in fact the name of the holding company.[11]

Part of Oracle Corporation’s early success arose from using the C programming language to implement its products. This eased porting to different operating systems(most of which support C).

Technology timeline

  • 1979: offers the first commercial SQL RDBMS[41]
  • 1983: offers a VAX-mode database
  • 1984: offers the first database with read-consistency
  • 1986: offers a client-server DBMS
  • 1987: introduces UNIX-based Oracle applications
  • 1988: introduces PL/SQL
  • 1992: offers full applications implementation methodology
  • 1995: offers the first 64-bit RDBMS
  • 1996: moves towards an open standards-based, web-enabled architecture
  • 1999: offers its first DBMS with XML support
  • 2001: becomes the first to complete 3 terabyte TPC-H world record
  • 2002: offers the first database to pass 15 industry standard security evaluations
  • 2003: introduces what it calls “Enterprise Grid Computing” with Oracle10g
  • 2005: releases its first free database, Oracle Database 10g Express Edition (XE)
  • 2008: Smart scans in software improve query-response in HP Oracle Database Machine / Exadata storage.
  • 2013: Now they are using Oracle 12C which is compatible of providing cloud services with Oracle Database.

Manageability claim

Oracle stated in its product announcements that manageability for DBAs had improved from Oracle9i to 10g. Lungu & Vatuiu[42] assessed the relative manageability by performing common DBA tasks and measuring the time. They performed their tests on a single Pentium CPU (1.7 GHz) with 512 MB RAM,running Windows Server 2000.Summarizing the results from Oracle9i to 10g: Installation improved 36%,Day-to-day Administration 63%, Backup and Recovery 63%, and Performance Diagnostics and Tuning 74% for a weighted Total of 56%. The researcher’s conclusion is that “Oracle10g represents a giant step forward from Oracle9i in making the database easier to use and manage.”

Products and services

Oracle designs, manufactures and sells both software and hardware products, as well as offers services complementing them (such as financing, training, consulting and hosting services). Many of the products have been added to Oracle’s portfolio through acquisitions.

  • Oracle Database

In 2004 Oracle Corporation shipped release 10g (g standing for “grid”) as the then latest version of Oracle Database. (Oracle Application Server 10g using Java EEintegrates with the server part of that version of the database, making it possible to deploy web-technology applications. The application server comprises the first middle-tier software designed for grid computing.[citation needed] The interrelationship between Oracle 10g and Java allows developers to set up stored procedures written in the Java language, as well as those written in the traditional Oracle database programming language, PL/SQL.) – Release 11g became the current Oracle Database version in 2007. Oracle Database 11g Release 2 is the database version currently available, available since September 2009. This version is available in four commercial editions – Enterprise Edition, Standard Edition, Standard Edition One, Personal Edition – and one free edition – the Express Edition. The licensing of these editions shows various restrictions and obligations and is complex.[43] The Enterprise Edition (DB EE), as it is the most expensive of the Database Editions, has the least restrictions – but nevertheless has a complex licensing. The Standard Edition (DB SE) and Standard Edition One (SE1), are constrained by more licensing restrictions, which reflects their lower price. Release 12c has been made available on the first of July, 2013.[44]

The following are additional database technologies that have been acquired and developed by the Oracle Corporation:

  • TimesTen features in-memory database operations.
  • Oracle NoSQL Database, a scalable, distributed key-value NoSQL database[45]

Middleware

Oracle Fusion Middleware is a family of middleware software products, including for instanceapplication serversystem integrationbusiness process management(BPM), user interaction,content managementidentity management and business intelligence (BI) products.

Oracle Secure Enterprise Search

Oracle Secure Enterprise Search (SES), Oracle’s enterprise-search offering, gives users the ability to search for content across multiple locations, including websites, file servers, content management systems, enterprise resource planning systems, customer relationship management systems, business intelligence systems, and databases.

Oracle Beehive

Released in 2008, the Oracle Beehive collaboration software provides team workspaces (including wikis, team calendaring and file sharing), email, calendar, instant messaging, and conferencing on a single platform. Customers can use Beehive as licensed software or assoftware as a service (“SaaS”).[46]

Applications

Oracle also sells a suite of business applications. The Oracle E-Business Suite includes software to perform various enterprise functions related to, for instance, financials, manufacturing,customer relationship management (CRM), enterprise resource planning (ERP) and human resource management. The Oracle Retail Suite covers the retail industry vertical providing merchandise management, price management, invoice matching, allocations, store operations management, warehouse management, demand forecasting, merchandise financial planning, assortment planning and category management. Users can access these facilities through a browser interface over the Internet or via a corporate intranet.

Following a number of high-value acquisitions beginning in 2003, especially in the area of applications, Oracle Corporation currently maintains a number of product lines:

  • Oracle Fusion Applications
  • Oracle Social Engagement and Monitoring System

-Oracle has developed a Social Engagement and Monitoring Cloud service that allows businesses to capture relevant brand conversation from global web and social channels to understand what is being said about their product. The Social Engagement and Monitoring cloud provides the most effective and efficient responses across social and customer experience channels. SEM is able to route correct responses to the right team, member, or customer experience channel to ensure the best customer service. The analysis helps to understand what is important to a business’s customers. It identifies trends, spikes, and anomalies to make real time course corrections. It also can identify brand advocates. The SEM cloud identifies customer intention and interests by analyzing the common ways customers talk about a product or a service.[47]

Development software

Oracle Corporation’s tools for developing applications include (amongst others):

Many external and third-party tools make the Oracle database administrator‘s tasks easier.

Operating systems

Oracle develops two operating systems: Oracle Solaris and Oracle Linux.

Hardware

…….

During the 1970s, after a brief stint at Amdahl Corporation, Ellison began working for Ampex Corporation. His projects included a database for the CIA, which he named “Oracle”. Ellison was inspired by a paper written by Edgar F. Codd on relational database systems called “A Relational Model of Data for Large Shared Data Banks”.[10] In 1977, he founded Software Development Laboratories (SDL) with two partners and an investment of $2,000; $1,200 of the money was his.

In 1979 the company renamed itself Relational Software Inc., and in 1982 officially became Oracle Systems Corporation after its flagship product, the Oracle database. Ellison had heard about the IBM System R database, also based on Codd’s theories, and wanted Oracle to achieve compatibility with it, but IBM made this impossible by refusing to share System R’s code. The initial release of Oracle[when?] was called Oracle 2; there was no Oracle 1.[citation needed]

In 1990, Oracle laid off 10% of its work force (about 400 people) because it was losing money. This crisis, which almost resulted in Oracle’s bankruptcy, came about because of Oracle’s “up-front” marketing strategy, in which sales people urged potential customers to buy the largest possible amount of software all at once. The sales people then booked the value of future license sales in the current quarter, thereby increasing their bonuses. This became a problem when the future sales subsequently failed to materialize. Oracle eventually had to restate its earnings twice, and had to settle class-action lawsuits arising from its having overstated its earnings. Ellison would later say that Oracle had made “an incredible business mistake”.[citation needed]

Although IBM dominated the mainframe relational database market with its DB2 and SQL/DS database products, it delayed entering the market for a relational database onUNIX and Windows operating-systems. This left the door open for Sybase, Oracle, Informix, and eventually Microsoft to dominate mid-range systems and microcomputers.

Around this time, Oracle fell behind Sybase. From 1990 to 1993, Sybase was the fastest growing database company and the database industry’s darling vendor, but soon fell victim to merger mania. Sybase’s 1996 merger with Powersoft resulted in a loss of focus on its core database technology. In 1993, Sybase sold the rights to its database software running under the Windows operating system to Microsoft Corporation, which now markets it under the name “SQL Server”.

In 1994, Informix overtook Sybase and became Oracle’s most important rival. The intense war between Informix CEO Phil White and Ellison was front page Silicon Valley news for three years. In April 1997, Informix announced a major revenue shortfall and earnings restatements. Phil White eventually landed in jail, and IBM absorbed Informix in 2001. Also in 1997, Ellison was made a director of Apple Computer after Steve Jobs returned to the company. Ellison resigned in 2002, saying “my schedule does not currently allow me to attend enough of the formal board meetings to warrant a role as a director”.[11]

With the defeat of Informix and of Sybase, Oracle enjoyed years of industry dominance until the rise of Microsoft SQL Server in the late 1990s and IBM’s acquisition of Informix Software in 2001 to complement their DB2 database. As of 2013 Oracle’s main competition for new database licenses on UNIX, Linux, and Windows operating systems comes from IBM’s DB2 and from Microsoft SQL Server, which only runs on Windows. IBM’s DB2 still dominates the mainframe database market.

Sybase

Sybase, an SAP company, is an enterprise software and services company offering software to manage, analyze, and mobilize information, using relational databases, analytics and data warehousing solutions and mobile applications development platforms.

History

Sybase was founded in 1984 by Mark Hoffman, Bob Epstein, Jane Doughty and Tom Haggin in Epstein’s home in Berkeley, California. Together, they set out to create a relational database management system (RDBMS), which would organize information and make it available to many computers in a network.

In late 1986, Sybase shipped its first test programs, and in May 1987 formally released the SYBASE system, the first high-performance RDBMS for online applications. Rather than having a vast central bank of data stored in a large mainframe computer, the SYBASE System provided for a client/server computer architecture. Sybase was the first to market with a client/server relational database, providing the Human Genome Project with licenses for the first generation of client/server relational databases.

At the time, Sybase called the database server “Sybase SQL Server” and made a deal with Microsoft to share the source code for Microsoft to remarket on the OS/2 platform as “SQL Server”. Sybase, Microsoft and Ashton-Tate formed a consortium to port the SQL Server onto OS/2. When Microsoft left the consortium, the terms of the agreement gave Microsoft a license to Sybase’s SQL Server code. The Sybase SQL Server version 4.9, and Microsoft SQL Server 6.0 and 6.5, were virtually identical. Due to disagreements between the two companies over revenue sharing (or lack thereof), Sybase and Microsoft decided to split the code-lines and went their own way, although the shared heritage is very evident in the Transact-SQL (T-SQL) procedural language as well as the basic process architecture. The big difference is that Sybase has a Unix heritage, while Microsoft SQL Server was adapted and optimized only for the Microsoft Windows NT operating system. Sybase continues to offer versions for Windows, several varieties of Unix, and for Linux.

In October 1989, Sybase released additional products, introducing the SYBASE Open Client/Server Interfaces—new software programs that provided generic client/server communication, allowing for greater connectivity within computer networks. With these new offerings, and its earlier system, Sybase achieved sales of $56 million in 1989. Two years later, in August 1991, Sybase made its initial public offering of stock.

In June 1992, Sybase announced its latest generation of software. Dubbed the System 10 product family, these programs were designed to provide a framework for companies to switch over their computer operations from older mainframe models to client/server systems. In April 1993, Sybase introduced the first component of System 10, called OmniSQL Gateway. This program connected the various parts of a computer network, enabling users at any point to gain access to changes being made anywhere on the system. Later that year, Sybase completed its rollout of the System 10 components, which included SQL Server 10 and Back-up Server; Open Client/Server APIs; and SQL Monitor and SA Companion, which were used to manage computer systems.

In 1994, Sybase acquired Powersoft, the leading maker of development tools for client-server computing, with 40 percent of that market. Through the deal, Sybase acquired PowerBuilder, a rapid application development (RAD) tool and Powersoft’s leading product. The acquisition also marked the basis of Sybase’s entry into the enterprise mobility market with Watcom SQL, which Sybase renamed SQL Anywhere. When Sybase launched its mobility subsidiary, Sybase iAnywhere, in 2000, SQL Anywhere became its flagship relational database management system (RDBMS) and helped the company to become the leader of the mobile database market.

In January, 1998, Sybase announced that the financial results for the company in the last three quarters of 1997 would have to be restated, as it found inconsistencies in profits reporting from its Japanese division.[2] Five executives in Sybase’s Japanese subsidiary were found to have used side letters to artificially inflate the profits from their operations. Following a class-action lawsuit,[3] the five executives involved were fired.

Following a downturn in the late 1990s, Sybase returned to profitability under the management of John Chen in 2000, has maintained profitability since then and continues to reinvent itself with a new ‘Unwired Enterprise’ strategy. The ‘Unwired Enterprise’ vision is about allowing companies to deliver data to mobile devices in the field as well as traditional desktops, and combines technology from Sybase’s existing data management products with its new mobility products. Sybase has expanded into the mobile space through a series of acquisitions of enterprise and mobile software companies. In 2006, Sybase completed the acquisition of Mobile 365, later renamed Sybase 365, allowed Sybase to enter the mobile messaging and mobile commerce market. Sybase has maintained a strong foothold in its data management products. It makes a number of data management products including Adaptive Server EnterpriseSybase IQ, a data analytics warehouse system, and Replication Server, a vendor-neutral data movement system that helps address ever-growing data distribution and management requirements. Sybase has a strong presence in the financial services,[4]telecommunications, technology and government markets.[5]

Sybase now works with other industry leaders in infrastructure, data storage and virtualization to optimize technologies for delivery into public and virtual private cloudenvironments that provide greater technology availability and flexibility to Sybase customers looking to unwire their enterprise.

Sybase crossed the $1 billion mark in 2007.[6]

In May 2008, the Sybase IQ analytics server set a new Guinness World Record by powering the world’s largest data warehouse.[7] In 2008, Sybase also launched RAP – The Trading Edition, an analytics platform for Wall Street. In August of the same year, Sybase promoted the Sybase Unwired Platform (SUP), a platform for developing mobile applications across a heterogeneous environment. In September 2008, Sybase 365 expanded its messaging interoperability with the launch of its global Multimedia Messaging Exchange, MMX 365.

On January 21, 2009, Sybase acquired mPayment solutions provider paybox.[8] In March 2009, Sybase and SAP partnered to deliver the new SAP Business Suite software to iPhone, Windows Mobile, BlackBerry and other devices. In September 2009, Sybase and Verizon partnered to manage mobility solutions for enterprises worldwide through Verizon’s Managed Mobility Solutions, which utilizes Sybase’s enterprise device management platform. Gartner reported that Sybase gained market share in the database industry in 2009.

In May 2010, SAP announced that it would be acquiring Sybase for $5.8 billion.[9]

The company remains a standalone subsidiary of SAP and headed by co-CEOs Bill McDermott and Jim Hagemann Snabe.[10]

Sybase 365 is one of the largest independent (non-telco) exchanges for text (SMS) and multimedia (MMS) messages. By September 2010, it had delivered more than 1 trillion messages – equivalent to 32,000 per second for an entire year.[11]

In November 2010, Sybase and Verizon delivered a managed mobility service to reduce the complexity for enterprises to develop and deploy mobile apps, even if they have diverse back-end software and user devices (i.e. multiple brands and platforms of smartphones and tablets).[12] The need appears to be there: 90% of IT managers plan to deploy new mobile apps and one in two believe that successfully managing mobile apps will top their priority list, according to a January survey sponsored by Sybase.[13]

Sybase remains committed to its data management and analytics products. Sybase IQ was positioned in the Leaders quadrant of Gartner’s 2011 Data Ware House Database Management System Magic Quadrant.[14]

Timeline

  • 1984: Sybase (initially called Systemware) is founded by Mark Hoffman, Bob Epstein, Jane Doughty and Tom Haggin out of Epstein’s home in California.
  • 1988: Sybase goes into partnership with Microsoft to port SQL Server to Windows and OS/2
  • August 1991: Sybase goes public at a split adjusted price of $6.75.
  • 1993: Sybase and Microsoft dissolve their partnership. Microsoft bought the SQL Server code base from Sybase.
  • 1993: Sybase launches Replication Server, a data replication technology that moves and synchronizes data across the enterprise.
  • November 14, 1994: Sybase acquires Powersoft, which bought Watcom earlier that year.
  • 1995: Sybase launches PowerDesigner, a modeling and metadata management solution, following its acquisition of PowerAMC.
  • 1995 Sybase renames the main product SQL Server to its current name Adaptive Server Enterprise (ASE) for version 11.5. Anywhere 5 is released. It includes SQL Remote, SQL Central, Transact SQL syntax, and support for the Sybase Replication Server.
  • 1996: Mitchell Kertzman, Powersoft CEO, is appointed CEO.
  • 1996: Sybase launches Sybase IQ, the first column-based analytics platform.
  • 1997: Sybase’s flagship database product Adaptive Server Enterprise (ASE) starts its life in the mid-eighties as “Sybase SQL Server”. Microsoft is a Sybase distributor, reselling the Sybase product for OS/2 and NT under the name “Microsoft SQL Server.” Around 1994, Microsoft begins independently developing its own product. When Sybase releases version 11.5 in 1997, Sybase renames its product to “Adaptive Server Enterprise” to better distinguish itself from MS SQL Server.
  • October 1998: John Chen is appointed Chairman, CEO and President.[15]
  • 1998: SQL Anywhere 6 is released, with new names “Adaptive Server Anywhere” as the engine and part of the “SQL Anywhere Studio” which now includes SQL Modeler (later PowerDesigner), Java is introduced to the database.
  • 2000: iAnywhere Solutions, Inc. is founded as a subsidiary of Sybase.
  • June 20, 2001: Sybase acquires New Era of Networks, a leading application integration company that produces the e-Biz Integrator middleware, though it stops offering this product in 2004.
  • November 2003: Sybase launches the “Unwired Enterprise” strategy.[16]
  • 2004: Sybase acquires XcelleNet, frontline device management software based in Georgia, to enhance its Unwired Enterprise strategy.
  • September 12, 2005: Sybase releases ASE 15.0.
  • August 7, 2006: iAnywhere announces release of SQL Anywhere 10.
  • November 8, 2006: Sybase acquires Mobile 365, a mobile data and messaging company, and renames it Sybase 365.
  • February 2008: Sybase releases Adaptive Server Enterprise, Cluster Edition, with Oracle RAC-like shared-everything clusterability, but based on an open architecture and less costly.
  • May 2008: Sybase IQ analytics database sets a new Guinness World Record by powering the world’s largest data warehouse.
  • May 2008: Sybase launches RAP – The Trading Edition, an analytics platform for Wall Street.
  • August 2008: Sybase unveils the Sybase Unwired Platform (SUP), a platform for developing mobile applications across a heterogeneous environment.
  • September 2008: Sybase 365 expands its messaging interoperability with the launch of its global Multi-media Messaging Exchange, MMX 365.
  • March 2009: Sybase and SAP partner to deliver the SAP Business Suite software to iPhone, Windows Mobile, BlackBerry and other devices.
  • May 2009: Sybase begins packaging MicroStrategy business intelligence software with its Sybase IQ server.[17]
  • September 2009: Sybase and Verizon partner to manage mobility solutions for enterprises worldwide through Verizon’s Managed Mobility Solutions, which utilizes Sybase’s enterprise device management platform.
  • May 2010: SAP makes an offer to acquire Sybase for $5.8 billion.[9]
  • July 2010: SAP completes tender offer for shares of Sybase and its acquisition.[18]
  • August 2010: Sybase partners with telecommunications partners to offer the world’s first fully operational IPX Voice hub.[19]
  • November 2010: Sybase and Verizon deliver Mobile Services Enablement Platform, a new managed mobility offering, to reduce the complexity in developing and deploying mobile applications running across a variety of back-end enterprise systems and mobile device types.[12]
  • February 2011: Sybase is positioned in Leaders quadrant in Gartner’s 2011 Data Warehouse Database Management System Magic Quadrant.[14]
  • July 2011: Sybase awarded Best CEP Provider and Best Enterprise Data Management Provider at the Water Rankings Awards.[20]
  • Sept 2011 : Sybase Celebrates 20-Year PowerBuilder Milestone.[21]
  • Nov 2011: Sybase integrates statistical programming language R in Sybase RAP for Capital Markets Analytics.[22]
  • Feb 2012 : Sybase 365 Recognized by Juniper Research among leaders in the Future of Mobile Commerce.[23]
  • Feb 2012 : Sybase a Leader in the 2012 Gartner Magic Quadrant for Data Warehouse DBMS.[24]
  • April – October 2012 : All Sybase employees are incorporated into SAP’s workforce
  • November 2012 : After leading Sybase for 15 years, Sybase CEO and President John Chen leaves Sybase following the completion of Sybase’s integration into SAP

Products

Sybase’s main products include:

Data Management Products

Analytics Products

Mobility Products

  • SQL Anywhere – RDBMS with a small footprint designed for mobility
  • Afaria – Mobile device management and security software
  • Sybase Unwired Platform (SUP) – a framework for developing mobile applications that SAP is using both as a development platform and a device management system[25]
  • SMS Ad Exchange – an SMS mobile advertising service.
  • GRX 365 – network performance and security
  • mBanking 365 – a mobile banking product
  • MMS 365 – a content delivery gateway
  • MMX 365 – a messaging exchange
  • Sybase 365 mCommerce Solution – an end-to-end solution for mBanking, mPayments and mRemittance

Tools

Teradata

Teradata Corporation is an American computer company that sells analytic data platforms, applications and related services. Its products are meant to consolidate data from different sources and make the data available for analysis. Formerly a division of NCR Corporation, Teradata was incorporated in 1979 and separated from NCR in October 2007.[2]Michael Koehler became president and CEO of Teradata after its 2007 spin off from NCR.[3] Teradata’s headquarters is located in MiamisburgOhio.

Corporate Overview

Teradata is an enterprise software company that develops and sells a relational database management system (RDBMS) with the same name. Teradata is publicly traded on the New York Stock Exchange (NYSE) NYSE under the stock symbol TDC. [4]

The Teradata product is referred to as a “data warehouse system” and stores and manages data. The data warehouses use a “shared nothing” architecture which means that each server node has its own memory and processing power.[5] Adding more servers and nodes increases the amount of data that can be stored. The database software sits on top of the servers and spreads the workload among them.[6] Teradata sells applications and software to process different types of data. In 2010, Teradata added text analytics to track unstructured data, such as word processor documents, and semi-structured data, such as spreadsheets.[7]

Teradata’s product can be used for business analysis. Data warehouses can track company data, such as sales, customer preferences, product placement, etc.[6]

Teradata has a supplier diversity program that designates a minimum of 3 to 5% of spending on minority, women, veteran, or small business vendors.[8]

In 2013, the Ethisphere Institute named Teradata as one of the “World’s Most Ethical Companies.”[9]

Technology and products

Teradata is a massively parallel processing system running a shared nothing architecture.[39] Its technology consists of hardwaresoftware, database, and consulting. The system moves data to a data warehouse where it can be recalled and analyzed.[40]

The systems can be used as back-up for one another during downtime, and in normal operation balance the work load across themselves.[41]

 

Teradata is the most popular data warehouse DBMS in the DB-Engines database ranking.[45]

In 2010, Teradata was listed in Fortune’s annual list of Most Admired Companies.[46]

Active enterprise data warehouse

Teradata Active Enterprise Data Warehouse is the platform that runs the Teradata Database, with added data management tools and data mining software.

The data warehouse differentiates between “hot and cold” data – meaning that the warehouse puts data that is not often used in a slower storage section.[47] As of October 2010, Teradata uses Xeon 5600 processors for the server nodes.[48]

Teradata Database 13.10 was announced in 2010 as the company’s database software for storing and processing data.[49][50]

Teradata Database 14 was sold as the upgrade to 13.10 in 2011 and runs multiple data warehouse workloads at the same time.[51] It includes column-store analyses.[52]

Teradata Integrated Analytics is a set of tools for data analysis that resides inside the data warehouse.[53]

Teradata and Big Data

Teradata began to associate itself with the term, “Big Data” in 2010. CTO, Stephen Brobst, attributes the rise of big data to “new media sources, such as social media.”[75] The increase in semi-structured and unstructured data gathered from online interactions prompted Teradata to form the “Petabyte club” in 2011 for its heaviest big data users.[76]

The rise of big data resulted in many traditional data warehousing companies updating their products and technology.[77] For Teradata, big data prompted the acquisition of Aster Data Systems in 2011 for the company’s MapReduce capabilities and ability to store and analyze semi-structured data.[78]

Public interest in big data resulted in a 13% increase in Teradata’s global sales.[76]

PostgreSQL

PostgreSQL evolved from the Ingres project at the University of California, Berkeley. In 1982, the project leader, Michael Stonebraker, left Berkeley to make a proprietary version of Ingres.[15] He returned to Berkeley in 1985 and started a post-Ingres project to address the problems with contemporary database systems that had become increasingly clear during the early 1980s. The new project, POSTGRES, aimed to add the fewest features needed to completely support types.[17] These features included the ability to define types and to fully describe relationships – something used widely before but maintained entirely by the user. In Postgres, the database “understood” relationships, and could retrieve information in related tables in a natural way using rules. Postgres used many of the ideas of Ingres, but not its code.[18]

Starting in 1986, the team published a number of papers describing the basis of the system, and by 1988 had a prototype version. The team released version 1 to a small number of users in June 1989, then version 2 with a re-written rules system in June 1990. Version 3, released in 1991, again re-wrote the rules system, and added support for multiple storage managers and an improved query engine. By 1993 the great number of users began to overwhelm the project with requests for support and features. After releasing version 4—primarily a cleanup—the project ended.

But open source software developers could obtain copies and develop the system further, because Berkeley had released Postgres under an MIT-style license. In 1994, Berkeley graduate students Andrew Yu and Jolly Chen replaced the Ingres-based QUEL query language interpreter with one for the SQL query language, creating Postgres95. The code was released on the web.

In July 1996, Marc Fournier at Hub.org Networking Services provided the first non-university development server for the open-source development effort. Along with Bruce Momjian and Vadim B. Mikheev, work began to stabilize the code inherited from Berkeley. The first open-source version was released on August 1, 1996.

In 1996, the project was renamed to PostgreSQL to reflect its support for SQL. The first PostgreSQL release formed version 6.0 in January 1997. Since then, the software has been maintained by a group of database developers and volunteers around the world, coordinating via the Internet.

The PostgreSQL project continues to make major releases (approximately annually) and minor “bugfix” releases, all available under its free and open-source softwarePostgreSQL License. Code comes from contributions from proprietary vendors, support companies, and open-source programmers at large.

Multiversion concurrency control (MVCC)

PostgreSQL manages concurrency through a system known as multiversion concurrency control (MVCC), which gives each transaction a “snapshot” of the database, allowing changes to be made without being visible to other transactions until the changes are committed. This largely eliminates the need for read locks, and ensures the database maintains the ACID (atomicity, consistency, isolation, durability) principles in an efficient manner. PostgreSQL offers three levels of transaction isolation: Read Committed, Repeatable Read and Serializable. Because PostgreSQL is immune to dirty reads, requesting a Read Uncommitted transaction isolation level provides read committed instead. Prior to PostgreSQL 9.1, requesting Serializable provided the same isolation level as Repeatable Read. PostgreSQL 9.1 and later support full serializability via the serializable snapshot isolation (SSI) technique.[19]

Storage and replication

Replication

PostgreSQL, beginning from version 9.0, includes built-in binary replication, based on shipping the changes (write-ahead logs) to slave systems asynchronously.

Version 9.0 also introduced the ability to run read-only queries against these replicated slaves, where earlier versions would only allow that after promoting them to be a new master. This allows splitting read traffic among multiple nodes efficiently. Earlier replication software that allowed similar read scaling normally relied on adding replication triggers to the master, introducing additional load onto it.

Beginning from version 9.1, PostgreSQL also includes built-in synchronous replication[20] that ensures that, for each write transaction, the master waits until at least one slave node has written the data to its transaction log. Unlike other database systems, the durability of a transaction (whether it is asynchronous or synchronous) can be specified per-database, per-user, per-session or even per-transaction. This can be useful for work loads that do not require such guarantees, and may not be wanted for all data as it will have some negative effect on performance due to the requirement of the confirmation of the transaction reaching the synchronous standby.

There can be a mixture of synchronous and asynchronous standby servers. A list of synchronous standby servers can be specified in the configuration which determines which servers are candidates for synchronous replication. The first in the list which is currently connected and actively streaming is the one that will be used as the current synchronous server. When this fails, it falls to the next in line.

Synchronous multi-master replication is currently not included in the PostgreSQL core. Postgres-XC which is based on PostgreSQL provides scalable synchronous multi-master replication,[21] available in version 1.1 is licensed under the same license as PostgreSQL.

The community has also written some tools to make managing replication clusters easier, such as repmgr.

There are also several asynchronous trigger-based replication packages for PostgreSQL. These remain useful even after introduction of the expanded core capabilities, for situations where binary replication of an entire database cluster is not the appropriate approach:

Indexes

PostgreSQL includes built-in support for regular B-tree and hash indexes, and two types of inverted indexes: generalized search trees (GiST) and generalized inverted indexes (GIN). Hash indexes are implemented, but discouraged because they cannot be recovered after a crash or power loss. In addition, user-defined index methods can be created, although this is quite an involved process. Indexes in PostgreSQL also support the following features:

  • Expression indexes can be created with an index of the result of an expression or function, instead of simply the value of a column.
  • Partial indexes, which only index part of a table, can be created by adding a WHERE clause to the end of the CREATE INDEX statement. This allows a smaller index to be created.
  • The planner is capable of using multiple indexes together to satisfy complex queries, using temporary in-memory bitmap index operations.
  • As of PostgreSQL 9.1, k-nearest neighbors (k-NN) indexing (also referred to KNN-GiST) provides efficient searching of “closest values” to that specified, useful to finding similar words, or close objects or locations with geospatial data. This is achieved without exhaustive matching of values.
  • In PostgreSQL 9.2 and above, index-only scans often allow the system to fetch data from indexes without ever having to access the main table.

Schemas

In PostgreSQL, all objects (with the exception of roles and tablespaces) are held within a schema. Schemas effectively act like namespaces, allowing objects of the same name to co-exist in the same database. Schemas are analogous to directories in a file system, except that they cannot be nested, nor is it possible to create a “symbolic link” pointing to another schema or object.

By default, databases are created with the “public” schema, but any additional schemas can be added, and the public schema isn’t mandatory. A “search_path” determines the order in which schemas are checked on unqualified objects (those without a prefixed schema), which can be configured on a database or role level. The search path, by default, contains the special schema name of “$user”, which first looks for a schema named after the connected database user (e.g. if the user “dave” were connected, it would first look for a schema also named “dave” when referring to any objects). If such a schema is not found, it then proceeds to the next schema. New objects are created in whichever valid schema (one that presently exists) is listed first in the search path.

Data types

A wide variety of native data types are supported, including:

  • Boolean
  • Arbitrary precision numerics
  • Character (text, varchar, char)
  • Binary
  • Date/time (timestamp/time with/without timezone, date, interval)
  • Money
  • Enum
  • Bit strings
  • Text search type
  • Composite
  • HStore (an extension enabled key-value store within Postgres)
  • Arrays (variable length and can be of any data type, including text and composite types) up to 1 GB in total storage size.
  • Geometric primitives
  • IPv4 and IPv6 addresses
  • CIDR blocks and MAC addresses
  • XML supporting XPath queries
  • UUID
  • JSON (versions 9.2 and up)

In addition, users can create their own data types which can usually be made fully indexable via PostgreSQL’s GiST infrastructure. Examples of these include thegeographic information system (GIS) data types from the PostGIS project for PostgreSQL.

There is also a data type called a “domain”, which is the same as any other data type but with optional constraints defined by the creator of that domain. This means any data entered into a column using the domain will have to conform to whichever constraints were defined as part of the domain.

Starting with PostgreSQL 9.2, a data type that represents a range of data can be used which are called range types. These can be discrete ranges (e.g. all integer values 1 to 10) or continuous ranges (e.g. any point in time between 10:00 am and 11:00 am). The built-in range types available include ranges of integers, big integers, decimal numbers, time stamps (with and without time zone) and dates.

Custom range types can be created to make new types of ranges available, such as IP address ranges using the inet type as a base, or float ranges using the float data type as a base. Range types support inclusive and exclusive range boundaries using the [] and () characters respectively. (e.g. ‘[4,9)’ represents all integers starting from and including 4 up to but not including 9.) Range types are also compatible with existing operators used to check for overlap, containment, right of etc.

User-defined objects

New types of almost all objects inside the database can be created, including:

  • Casts
  • Conversions
  • Data types
  • Domains
  • Functions, including aggregate functions and window functions
  • Indexes including custom indexes for custom types
  • Operators (existing ones can be overloaded)
  • Procedural languages

Inheritance

Tables can be set to inherit their characteristics from a “parent” table. Data in child tables will appear to exist in the parent tables, unless data is selected from the parent table using the ONLY keyword, i.e. SELECT * FROM ONLY parent_table. Adding a column in the parent table will cause that column to appear in the child table.

Inheritance can be used to implement table partitioning, using either triggers or rules to direct inserts to the parent table into the proper child tables.

As of 2010 this feature is not fully supported yet—in particular, table constraints are not currently inheritable. All check constraints and not-null constraints on a parent table are automatically inherited by its children. Other types of constraints (unique, primary key, and foreign key constraints) are not inherited.

Inheritance provides a way to map the features of generalization hierarchies depicted in Entity Relationship Diagrams (ERD) directly into the PostgreSQL database.

ISAM

ISAM stands for Indexed Sequential Access Method, a method for indexing data for fast retrieval. ISAM was originally developed by IBM for mainframe computers. Today the term is used for several related concepts:
  • Specifically, the IBM ISAM product and the algorithm it employs.
  • database system where an application developer directly uses an application programming interface to search indexes in order to locate records in data files. In contrast, a relational database uses a query optimizer which automatically selects indexes.
  • An indexing algorithm that allows both sequential and keyed access to data. Most databases now[when?] use some variation of the B-Tree for this purpose, although the original IBM ISAM and VSAM implementations did not do so.
  • Most generally, any index for a database. Indexes are used by almost all databases, both relational and otherwise.

In an ISAM system, data is organized into records which are composed of fixed length fields. Records are stored sequentially, originally to speed access on a tape system. A secondary set of hash tables known as indexes contain “pointers” into the tables, allowing individual records to be retrieved without having to search the entire data set. This is a departure from the contemporaneous navigational databases, in which the pointers to other data were stored inside the records themselves. The key improvement in ISAM is that the indexes are small and can be searched quickly, thereby allowing the database to access only the records it needs. Additionally modifications to the data do not require changes to other data, only the table and indexes in question.

When an ISAM file is created, index nodes are fixed, and their pointers do not change during inserts and deletes that occur later (only content of leaf nodes change afterwards). As a consequence of this, if inserts to some leaf node exceed the node’s capacity, new records are stored in overflow chains. If there are many more inserts than deletions from a table, these overflow chains can gradually become very large, and this affects the time required for retrieval of a record.[1]

Relational databases can easily be built on an ISAM framework with the addition of logic to maintain the validity of the links between the tables. Typically the field being used as the link, the foreign key, will be indexed for quick lookup. While this is slower than simply storing the pointer to the related data directly in the records, it also means that changes to the physical layout of the data do not require any updating of the pointers—the entry will still be valid.

ISAM is very simple to understand and implement, as it primarily consists of direct, sequential access to a database file. It is also very inexpensive. The tradeoff is that each client machine must manage its own connection to each file it accesses. This, in turn, leads to the possibility of conflicting inserts into those files, leading to an inconsistent database state. This is typically solved with the addition of a client-server framework which marshals client requests and maintains ordering. This is the basic concept behind a database management system (DBMS), which is a client layer over the underlying data store.

ISAM was replaced at IBM with a methodology called VSAM (Virtual Storage Access Method). Still later, IBM developed DB2 which, as of 2004, IBM promotes as their primary database management system. VSAM is the physical access method used in DB2.

The OpenVMS operating system uses the Files-11 file system in conjunction with RMS (Record Management Services). RMS provides an additional layer between the application and the files on disk that provides a consistent method of data organization and access across multiple 3GL and 4GL languages. RMS provides 4 different methods of accessing data; Sequential, Relative Record Number Access, Record File Address Access, and Indexed Access.

The Indexed Access method of reading or writing data only provides the desired outcome if in fact the file is organized as an ISAM file with the appropriate, previously defined keys. Access to data via the previously defined key(s) is extremely fast. Multiple keys, overlapping keys and key compression within the hash tables are supported. A utility to define/redefine keys in existing files is provided. Records can be deleted, although “garbage collection” is done via a separate utility.

ISAM-style Implementations

The Third Manifesto

The Third Manifesto (1995) is Christopher J. Date‘s and Hugh Darwen‘s proposal for future database management systems, a response to two earlier Manifestos with the same purpose. The theme of the manifestos is how to avoid the ‘object-relational impedance mismatch‘ between object-oriented programming languages andrelational database management systems. The Third Manifesto proposes to maintain the relational model for databases and to support objects as user-defined types.

A major theme of the manifesto is to explain how the inadequacies of existing relational database management systems are not shortcomings of the relational database model per se, but rather, of implementation decisions in those systems, and of the SQL query language that most of these systems use.

The manifesto describes an alternative to SQL, named D. D is a specification of desirable characteristics of a database language, rather than a specific syntax or grammar. As such, it describes a family of languages rather than any particular language. However, as an example, a particular member of the hypothetical D “family” called Tutorial D is described in detail, including significant portions of its grammar.

Several partial implementations of D exist, including:

IBM Business System 12

Business System 12, or simply BS12, was one of the first fully relational database management systems, designed and implemented by IBM‘s Bureau Service subsidiary at the company’s international development centre in UithoornNetherlands. Programming started in 1978 and the first version was delivered in 1982. It was never widely used and essentially disappeared soon after the division was shut down in 1985, possibly because IBM and other companies settled on SQL as the standard.

BS12’s lasting contribution to history was the use of a new query language based on ISBL, created at IBM’s UK Scientific Centre. Developers of the famous System Runderway in the US at the same time were also consulted on certain matters concerning the engine, but the BS12 team rejected SQL unequivocally, being convinced that this apparently unsound and difficult-to-use language (which at that time was also relationally incomplete) would never catch on.

BS12 included a number of interesting features that have yet to appear on most SQL-based systems, some a consequence of following the ISBL precedent, others due to deliberate design. For instance, a view could be parameterised and parameters could be of type TABLE. Thus, a view could in effect be a new relational operator defined in terms of the existing operators. Codd‘s DIVIDE operator was in fact implemented that way.

Another feature that could have easily been included in SQL systems was the support for update operations on the catalog tables (system tables describing the structure of the database, as in SQL). A new table could be created by inserting a row into the TABLES catalog, and then columns added to it by inserting into COLUMNS.

In addition, BS12 was ahead of SQL in supporting user-defined functions and procedures, using a computationally complete sublanguage, triggers, and a simple “call” interface for use by application programs, all in its very first release in 1982.

Example

Sample query from BS12 article on System R website for determining which departments are over their salary budgets:

Note the “natural join” on the common column, DEPTNUM. Although some SQL dialects support natural joins, for familiarity, the example will show only a “traditional” join. Here is the equivalent SQL for comparison:

Cellular Automata

construction of a wireworld computron: http://www.quinapalus.com/wi-index.html

Tim Hutton’s paper on making Codd’s automata

http://www.sq3.org.uk/papers/Hutton_CoddsSelfReplicatingComputer_2010.pdf

http://www.wolframscience.com/reference/notes/876b

From: Stephen Wolfram, A New Kind of Science
Notes for Chapter 2: The Crucial Experiment
Section: Why These Discoveries Were Not Made Before
Page 876

History of cellular automata. Despite their very simple construction, nothing like general cellular automata appear to have been considered before about the 1950s. Yet in the 1950s – inspired in various ways by the advent of electronic computers – several different kinds of systems equivalent to cellular automata were independently introduced. A variety of precursors can be identified. Operations on sequences of digits had been used since antiquity in doing arithmetic. Finite difference approximations to differential equations began to emerge in the early 1900s and were fairly well known by the 1930s. And Turing machines invented in 1936 were based on thinking about arbitrary operations on sequences of discrete elements. (Notions in physics like the Ising model do not appear to have had a direct influence.)

The best-known way in which cellular automata were introduced (and which eventually led to their name) was through work by John von Neumann in trying to develop an abstract model of self-reproduction in biology – a topic which had emerged from investigations in cybernetics. Around 1947 – perhaps based on chemical engineering – von Neumann began by thinking about models based on 3D factories described by partial differential equations. Soon he changed to thinking about robotics and imagined perhaps implementing an example using a toy construction set. By analogy to electronic circuit layouts he realized however that 2D should be enough. And following a 1951 suggestion from Stanislaw Ulam (who may have already independently considered the problem) he simplified his model and ended up with a 2D cellular automaton (he apparently hoped later to convert the results back to differential equations). The particular cellular automaton he constructed in 1952-3 had 29 possible colors for each cell, and complicated rules specifically set up to emulate the operations of components of an electronic computer and various mechanical devices. To give a mathematical proof of the possibility of self-reproduction, von Neumann then outlined the construction of a 200,000 cell configuration which would reproduce itself (details were filled in by Arthur Burks in the early 1960s). Von Neumann appears to have believed – presumably in part from seeing the complexity of actual biological organisms and electronic computers – that something like this level of complexity would inevitably be necessary for a system to exhibit sophisticated capabilities such as self-reproduction. In this book I show that this is absolutely not the case, but with the intuition he had from existing mathematics and engineering von Neumann presumably never imagined this.

Two immediate threads emerged from von Neumann’s work. The first, mostly in the 1960s, was increasingly whimsical discussion of building actual self-reproducing automata – often in the form of spacecraft. The second was an attempt to capture more of the essence of self-reproduction by mathematical studies of detailed properties of cellular automata. Over the course of the 1960s constructions were found for progressively simpler cellular automata capable of self-reproduction (see page 1186) and universal computation (see page 1121). Starting in the early 1960s a few rather simple general features of cellular automata thought to be relevant to self-reproduction were noticed – and were studied with increasingly elaborate technical formalism. (An example was the so-called Garden of Eden result that there can be configurations in cellular automata that arise only as initial conditions; see page 963.) There were also various explicit constructions done of cellular automata whose behavior showed particular simple features perhaps relevant to self-reproduction (such as so-called firing squad synchronization, as on page 1039).

By the end of the 1950s it had been noted that cellular automata could be viewed as parallel computers, and particularly in the 1960s a sequence of increasingly detailed and technical theorems – often analogous to ones about Turing machines – were proved about their formal computational capabilities. At the end of the 1960s there then began to be attempts to connect cellular automata to mathematical discussions of dynamical systems – although as discussed below this had in fact already been done a decade earlier, with different terminology. And by the mid-1970s work on cellular automata had mostly become quite esoteric, and interest in it largely waned. (Some work nevertheless continued, particularly in Russia and Japan.) Note that even in computer science various names for cellular automata were used, including tessellation automata, cellular spaces, iterative automata, homogeneous structures and universal spaces.

As mentioned in the main text, there were by the late 1950s already all sorts of general-purpose computers on which simulations of cellular automata would have been easy to perform. But for the most part these computers were used to study traditional much more complicated systems such as partial differential equations. Around 1960, however, there were a couple of simulations related to 2D cellular automata done. Stanislaw Ulam and others used computers at Los Alamos to produce a handful of examples of what they called recursively defined geometrical objects – essentially the results of evolving generalized 2D cellular automata from single black cells (see page 930). Especially after obtaining larger pictures in 1967, Ulam noted that in at least one case fairly simple growth rules generated a complicated pattern, and mentioned that this might be relevant to biology. But perhaps because almost no progress was made on this with traditional mathematical methods, the result was not widely known, and was never pursued. (Ulam tried to construct a 1D analog, but ended up not with a cellular automaton, but instead with the sequences based on numbers discussed on page 910.) Around 1961 Edward Fredkin simulated the 2D analog of rule 90 on a PDP-1 computer, and noted its self-reproduction properties (see page 1186), but was generally more interested in finding simple physics-like features.

Despite the lack of investigation in science, one example of a cellular automaton did enter recreational computing in a major way in the early 1970s. Apparently motivated in part by questions in mathematical logic, and in part by work on “simulation games” by Ulam and others, John Conway in 1968 began doing experiments (mostly by hand, but later on a PDP-7 computer) with a variety of different 2D cellular automaton rules, and by 1970 had come up with a simple set of rules he called “The Game of Life”, that exhibit a range of complex behavior (see page 249). Largely through popularization in Scientific American by Martin Gardner, Life became widely known. An immense amount of effort was spent finding special initial conditions that give particular forms of repetitive or other behavior, but virtually no systematic scientific work was done (perhaps in part because even Conway treated the system largely as a recreation), and almost without exception only the very specific rules of Life were ever investigated. (In 1978 as a possible 1D analog of Life easier to implement on early personal computers Jonathan Millen did however briefly consider what turns out to be the code 20 k=2, r=2 totalistic rule from page 283.)

Quite disconnected from all this, even in the 1950s, specific types of 2D and 1D cellular automata were already being used in various electronic devices and special-purpose computers. In fact, when digital image processing began to be done in the mid-1950s (for such applications as optical character recognition and microscopic particle counting) 2D cellular automaton rules were usually what was used to remove noise. And for several decades starting in 1960 a long line of so-called cellular logic systems were built to implement 2D cellular automata, mainly for image processing. Most of the rules used were specifically set up to have simple behavior, but occasionally it was noted as a largely recreational matter that for example patterns of alternating stripes (“custering”) could be generated.

In the late 1950s and early 1960s schemes for electronic miniaturization and early integrated circuits were often based on having identical logical elements laid out on lines or grids to form so-called cellular arrays. In the early 1960s there was for a time interest in iterative arrays in which data would be run repeatedly through such systems. But few design principles emerged, and the technology for making chips with more elaborate and less uniform circuits developed rapidly. Ever since the 1960s the idea of making array or parallel computers has nevertheless resurfaced repeatedly, notably in systems like the ILLIAC IV from the 1960s and 1970s, and systolic arrays and various massively parallel computers from the 1980s. Typically the rules imagined for each element of such systems are however immensely more complicated than for any of the simple cellular automata I consider.

From at least the early 1940s, electronic or other digital delay lines or shift registers were a common way to store data such as digits of numbers, and by the late 1940s it had been noted that so-called linear feedback shift registers (see page 976) could generate complicated output sequences. These systems turn out to be essentially 1D additive cellular automata (like rule 90) with a limited number of cells (compare page 259). Extensive algebraic analysis of their behavior was done starting in the mid-1950s, but most of it concentrated on issues like repetition periods, and did not even explicitly uncover nested patterns. (Related analysis of linear recurrences over finite fields had been done in a few cases in the 1800s, and in some detail in the 1930s.) General 1D cellular automata are related to nonlinear feedback shift registers, and some explorations of these – including ones surprisingly close to rule 30 (see page 1093) – were made using special-purpose hardware by Solomon Golomb in 1956-9 for applications in jamming-resistant radio control – though again concentrating on issues like repetition periods. Linear feedback shift registers quickly became widely used in communications applications. Nonlinear feedback shift registers seem to have been used extensively for military cryptography, but despite persistent rumors the details of what was done continue to be secret.

In pure mathematics, infinite sequences of 0’s and 1’s have been considered in various forms since at least the late 1800s. Starting in the 1930s the development of symbolic dynamics (see page 963) led to the investigation of mappings of such sequences to themselves. And by the mid-1950s studies were being made (notably by Gustav Hedlund) of so-called shift-commuting block maps – which turn out to be exactly 1D cellular automata (see page 963). In the 1950s and early 1960s there was work in this area (at least in the U.S.) by a number of distinguished pure mathematicians, but since it was in large part for application to cryptography, much of it was kept secret. And what was published was mostly abstract theorems about features too global to reveal any of the kind of complexity I discuss.

Specific types of cellular automata have also arisen – usually under different names – in a vast range of situations. In the late 1950s and early 1960s what were essentially 1D cellular automata were studied as a way to optimize circuits for arithmetic and other operations. From the 1960s onward simulations of idealized neural networks sometimes had neurons connected to neighbors on a grid, yielding a 2D cellular automaton. Similarly, various models of active media – particularly heart and other muscles – and reaction-diffusion processes used a discrete grid and discrete excitation states, corresponding to a 2D cellular automaton. (In physics, discrete idealizations of statistical mechanics and dynamic versions of systems like the Ising model were sometimes close to cellular automata, except for the crucial difference of having randomness built into their underlying rules.) Additive cellular automata such as rule 90 had implicitly arisen in studies of binomial coefficient modulo primes in the 1800s (see page 870), but also appeared in various settings such as the “forests of stunted trees” studied around 1970.

Yet by the late 1970s, despite all these different directions, research on systems equivalent to cellular automata had largely petered out. That this should have happened just around the time when computers were first becoming widely available for exploratory work is ironic. But in a sense it was fortunate, because it allowed me when I started working on cellular automata in 1981 to define the field in a new way (though somewhat to my later regret I chose – in an attempt to recognize history – to use the name “cellular automata” for the systems I was studying). The publication of my first paper on cellular automata in 1983 (see page 881) led to a rapid increase of interest in the field, and over the years since then a steadily increasing number of papers (as indicated by the number of source documents in the Science Citation Index shown below) have been published on cellular automata – almost all following the directions I defined.

Image

Von Neumann cellular automaton

A simple configuration in von Neumann’s cellular automaton. A binary signal is passed repeatedly around the blue wire loop, using excited and quiescent ordinary transmission states. A confluent cell duplicates the signal onto a length of red wire consisting of special transmission states. The signal passes down this wire and constructs a new cell at the end. This particular signal (1011) codes for an east-directed special transmission state, thus extending the red wire by one cell each time. During construction, the new cell passes through several sensitised states, directed by the binary sequence.

Von Neumann cellular automata are the original expression of cellular automata, the development of which were prompted by suggestions made to John von Neumann by his close friend and fellow mathematician Stanislaw Ulam. Their original purpose was to provide insight into the logical requirements for machine self-replication and were used in von Neumann’s universal constructor.

Nobili’s cellular automaton is a variation of von Neumann’s cellular automaton, augmented with the ability for confluent cells to cross signals and store information. The former requires an extra three states, hence Nobili’s cellular automaton has 32 states, rather than 29. Hutton’s cellular automaton is yet another variation, which allows a loop of data, analogous to Langton’s loops, to replicate.

Definition

Configuration

In general, cellular automata (CA) constitute an arrangement of finite state automata (FSA) that sit in positional relationships between one another, each FSA exchanging information with those other FSAs to which it is positionally adjacent. In von Neumann’s cellular automaton, the finite state machines (or cells) are arranged in a two-dimensional Cartesian grid, and interface with the surrounding four cells. As von Neumann’s cellular automaton was the first example to use this arrangement, it is known as the von Neumann neighbourhood.

The set of FSAs define a cell space of infinite size. All FSAs are identical in terms of state-transition function, or rule-set.

The neighborhood (a grouping function) is part of the state-transition function, and defines for any cell the set of other cells upon which the state of that cell depends.

All cells transition state synchronously, in step with a universal “clock” as in a synchronous digital circuit.

States

Each FSA of the von Neumann cell space can accept any of the 29 states of the rule-set. The rule-set is grouped into five orthogonal subsets. Each state includes the colour of the cell in the cellular automata program Golly in (red, green, blue). They are

  1. ground state U (48, 48, 48)
  2. the transition or sensitised states (in 8 substates)
    1. S (newly sensitised) (255, 0, 0)
    2. S0 – (sensitised, having received no input for one cycle) (255, 125, 0)
    3. S00 – (sensitised, having received no input for two cycles) (255, 175, 50)
    4. S01 – (sensitised, having received no input for one cycle and then an input for one cycle) (255, 200, 75)
    5. S000 – (sensitised, having received no input for three cycles) (251, 255, 0)
    6. S1 – (sensitised, having received an input for one cycle) (255, 150, 25)
    7. S10 – (sensitised, having received an input for one cycle and then no input for one cycle) (255, 255, 100)
    8. S11 – (sensitised, having received input for two cycles) (255, 250, 125)
  3. the confluent states (in 4 states of excitation)
    1. C00 – quiescent (and will also be quiescent next cycle) (0, 255, 128)
    2. C10 – excited (but will be quiescent next cycle) (255, 255, 128)
    3. C01 – next-excited (now quiescent, but will be excited next cycle) (33, 215, 215)
    4. C11 – excited next-excited (currently excited and will be excited next cycle) (255, 128, 64)
  4. the ordinary transmission states (in 4 directions, excited or quiescent, making 8 states)
    1. North-directed (excited and quiescent) (36, 200, 36) (106, 106, 255)
    2. South-directed (excited and quiescent) (106, 255, 106) (139, 139, 255)
    3. West-directed (excited and quiescent) (73, 255, 73) (122, 122, 255)
    4. East-directed (excited and quiescent) (27, 176, 27) (89, 89, 255)
  5. the special transmission states (in 4 directions, excited or quiescent, making 8 states)
    1. North-directed (excited and quiescent) (191, 73, 255) (255, 56, 56)
    2. South-directed (excited and quiescent) (203, 106, 255) (255, 89, 89)
    3. West-directed (excited and quiescent) (197, 89, 255) (255, 73, 73)
    4. East-directed (excited and quiescent) (185, 56, 255) (235, 36, 36)

“Excited” states carry data, at the rate of one bit per state transition step.

Note that confluent states have the property of a one-cycle delay, thus effectively holding two bits of data at any given time.

Transmission state rules

The flow of bits between cells is indicated by the direction property. The following rules apply:

  • Transmission states apply the OR operator to inputs, meaning a cell in a transmission state (ordinary or special) will be excited at time t+1 if any of the inputs pointing to it is excited at time t
  • Data passes from cells in the ordinary transmission states to other cells in the ordinary transmission states, according to the direction property (in the cell’s direction only)
  • Data passes from cells in the special transmission states to other cells in the special transmission states, according to the direction property (in the cell’s direction only)
  • The two subsets of transmission states, ordinary and special, are mutually antagonistic:
    • Given a cell A at time t in the excited ordinary transmission state
    • pointing to a cell B in any special transmission state
    • at time t+1 cell B will become the ground state. The special transmission cell has been “destroyed.”
    • a similar sequence will occur in the case of a cell in the special transmission state “pointing” to a cell in the ordinary transmission state

Confluent state rules

The following specific rules apply to confluent states:

  • Confluent states do not pass data between each other.
  • Confluent states take input from one or more ordinary transmission states, and deliver output to transmission states, ordinary and special, that are not directed toward the confluent state.
  • Data are not transmitted against the transmission state direction property.
  • Data held by a confluent state is lost if that state has no adjacent transmission state that is also not pointed at the confluent state.
  • Thus, confluent-state cells are used as “bridges” from transmission lines of ordinary- to special-transmission state cells.
  • The confluent state applies the AND operator to inputs, only “saving” an excited input if all potential inputs are excited simultaneously.
  • Confluent cells delay signals by one generation more than OTS cells; this is necessary due to parity constraints.

Construction rules

The nine cell types that can be constructed in von Neumann’s CA. Here, binary signals pass along nine ordinary transmission lines, constructing a new cell when they encounter a ground state at the end. For example, the binary string 1011 is shown on the fifth line, and constructs the east-directed special transmission state – this is the same process as used in the automaton at the top of this page. Note that there is no interaction between neighbouring wires, unlike in Wireworld for example, allowing for a compact packing of components.

Initially, much of the cell-space, the universe of the cellular automaton, is “blank,” consisting of cells in the ground state U. When given an input excitation from a neighboring ordinary- or special transmission state, the cell in the ground state becomes “sensitised,” transitioning through a series of states before finally “resting” at a quiescent transmission or confluent state.

The choice of which destination state the cell will reach is determined by the sequence of input signals. Therefore, the transition/sensitised states can be thought of as the nodes of a bifurcation tree leading from the ground-state to each of the quiescent transmission and confluent states.

In the following tree, the sequence of inputs is shown as a binary string after each step:

  • a cell in the ground state U, given an input, will transition to the S (newly sensitised) state in the next cycle (1)
  • a cell in the S state, given no input, will transition into the S0 state (10)
    • a cell in the S0 state, given no input, will transition into the S00 state (100)
      • a cell in the S00 state, given no input, will transition into the S000 state (1000)
        • a cell in the S000 state, given no input, will transition into the east-directed ordinary transmission state (10000)
        • a cell in the S000 state, given an input, will transition into the north-directed ordinary transmission state (10001)
      • a cell in the S00 state, given an input, will transition into the west-directed ordinary transmission state (1001)
    • a cell in the S0 state, given an input, will transition into the S01 state (101)
      • a cell in the S01 state, given no input, will transition into the south-directed ordinary transmission state (1010)
      • a cell in the S01 state, given an input, will transition into the east-directed special transmission state (1011)
  • a cell in the S state, given an input, will transition into the S1 state (11)
    • a cell in the S1 state, given no input, will transition into the S10 state (110)
      • a cell in the S10 state, given no input, will transition into the north-directed special transmission state (1100)
      • a cell in the S10 state, given an input, will transition into the west-directed special transmission state (1101)
    • a cell in the S1 state, given an input, will transition into the S11 state (111)
      • a cell in the S11 state, given no input, will transition into the south-directed special transmission state (1110)
      • a cell in the S11 state, given an input, will transition into the quiescent confluent state C00 (1111)

Note that:

  • one more cycle of input (four after the initial sensitization) is required to build the east- or north-directed ordinary transmission state than any of the other states (which require three cycles of input after the initial sensitization),
  • the “default” quiescent state resulting in construction is the east-directed ordinary transmission state- which requires an initial sensitization input, and then four cycles of no input.

Destruction rule

Roughly 4000 bits of data in a curled up “tape” constructing a complex pattern. This uses a 32-state variation of von Neumann cellular automata known as Hutton32.

  • An input into a confluent-state cell from a special-transmission state cell will result in the confluent state cell being reduced back to the ground state.
  • Likewise, an input into an ordinary transmission-state cell from a special-transmission state cell will result in the ordinary-transmission state cell being reduced back to the ground state.
  • Conversely, an input into a special transmission-state cell from an ordinary-transmission state cell will result in the special-transmission state cell being reduced back to the ground state.

Sample cpp code from Golly latest version:

generationsalgo.cpp

Codd’s cellular automaton

A simple configuration in Codd’s cellular automaton. Signals pass along wire made of cells in state 1 (blue) sheathed by cells in state 2 (red). Two signal trains circulate around a loop and are duplicated at a T-junction onto an open-ended section of wire. The first (7-0) causes the sheathed end of the wire to become exposed. The second (6-0) re-sheathes the exposed end, leaving the wire one cell longer than before.

Codd’s cellular automaton is a cellular automaton (CA) devised by the British computer scientist Edgar F. Codd in 1968. It was designed to recreate the computation- and construction-universality of von Neumann’s CA but with fewer states: 8 instead of 29. Codd showed that it was possible to make a self-reproducing machine in his CA, in a similar way to von Neumann’s universal constructor but never gave a complete implementation.

History

In the 1940s and 50’s, John von Neumann posed the following problem:[1]

  • What kind of logical organization is sufficient for an automaton to be able to reproduce itself?

He was able to construct a cellular automaton with 29 states, and with it a universal constructor. Codd, building on von Neumann’s work, found a simpler machine with eight states.[2] This modified von Neumann’s question:

  • What kind of logical organization is necessary for an automaton to be able to reproduce itself?

Three years after Codd’s work, Edwin Roger Banks showed a 4-state CA in his PhD thesis that was also capable of universal computation and construction, but again did not implement a self-reproducing machine.[3] John Devore, in his 1973 masters thesis, tweaked Codd’s rules to greatly reduce the size of Codd’s design, to the extent that it could be implemented in the computers of that time. However, the data tape for self-replication was too long; Devore’s original design was later able to complete replication using GollyChristopher Langton made another tweak to Codd’s cellular automaton in 1984 to create Langton’s loops, exhibiting self-replication with far fewer cells than that needed for self-reproduction in previous rules, at the cost of removing the ability for universal computation and construction.[4]

Comparison of CA rulesets

CA NUMBER OF STATES SYMMETRIES COMPUTATION- AND CONSTRUCTION-UNIVERSAL SIZE OF SELF-REPRODUCING MACHINE
von Neumann 29 none yes 130,622 cells
Codd 8 rotations yes 283,126,588 cells[5]
Devore 8 rotations yes 94,794 cells
Banks-IV 4 rotations and reflections yes unknown
Langton’s loops 8 rotations no 86 cells

Specification

The construction arm in Codd’s CA can be moved into position using the commands listed in the text. Here the arm turns left, then right, then writes a cell before retracting along the same path.

Codd’s CA has eight states determined by a von Neumann neighborhood with rotational symmetry.

The table below shows the signal-trains needed to accomplish different tasks. Some of the signal trains need to be separated by two blanks (state 1) on the wire to avoid interference, so the ‘extend’ signal-train used in the image at the top appears here as ‘70116011’.

PURPOSE SIGNAL TRAIN
extend 70116011
extend_left 4011401150116011
extend_right 5011501140116011
retract 4011501160116011
retract_left 5011601160116011
retract_right 4011601160116011
mark 701160114011501170116011
erase 601170114011501160116011
sense 70117011
cap 40116011
inject_sheath 701150116011
inject_trigger 60117011701160116011

Universal computer-constructor

Codd designed a self-replicating computer in the cellular automaton, based on Wang’s W-machine. However, the design was so colossal that it evaded implementation until 2009, when Tim Hutton constructed an explicit configuration.[5] There were some minor errors in Codd’s design, so Hutton’s implementation differs slightly, in both the configuration and the ruleset.

Langton’s loops

Langton’s Loop, in the starting configuration.

Langton’s loops are a particular “species” of artificial life in a cellular automaton created in 1984 by Christopher Langton. They consist of a loop of cells containing genetic information, which flows continuously around the loop and out along an “arm” (or pseudopod), which will become the daughter loop. The “genes” instruct it to make three left turns, completing the loop, which then disconnects from its parent.

History

In 1952 John von Neumann created the first cellular automaton (CA) with the goal of creating a self-replicating machine.[1] This automaton was necessarily very complex due to its computation- and construction-universality. In 1968 Edgar F. Codd reduced the number of states from 29 in von Neumann’s CA to 8 in his.[2] When Christopher Langton did away with the universality condition, he was able to significantly reduce the automaton’s complexity. Its self-replicating loops are based on one of the simplest elements in Codd’s automaton, the periodic emitter.

Specification

Langton’s Loops run in a CA that has 8 states, and uses the von Neumann neighborhood with rotational symmetry. The transition table can be found here: [1].

As with Codd’s CA, Langton’s Loops consist of sheathed wires. The signals travel passively along the wires until they reach the open ends, when the command they carry is executed.

A colony of loops. The ones in the centre are “dead”.

Colonies

Because of a particular property of the loops’ “pseudopodia”, they are unable to reproduce into the space occupied by another loop. Thus, once a loop is surrounded, it is incapable of reproducing, resulting in a coral-like colony with a thin layer of reproducing organisms surrounding a core of inactive “dead” organisms. Unless provided unbounded space, the colony’s size will be limited. The maximum population will be asymptotic to left lfloor frac{A}{121} right rfloor, where A is the total area of the space in cells.

Encoding of the genome

The loops’ genetic code is stored as a series of nonzero-zero state pairs. The standard loop’s genome is illustrated in the picture at the top, and may be stated as a series of numbered states starting from the T-junction and running clockwise: 70-70-70-70-70-70-40-40. The ’70’ command advances the end of the wire by one cell, while the ’40-40′ sequence causes the left turn. State 3 is used as a temporary marker for several stages.

While the roles of states 0,1,2,3,4 and 7 are similar to Codd’s CA, the remaining states 5 and 6 are used instead to mediate the loop replication process. After the loop has completed, state 5 travels counter-clockwise along the sheath of the parent loop to the next corner, causing the next arm to be produced in a different direction. State 6 temporarily joins the genome of the daughter loop and initialises the growing arm at the next corner it reaches.

The genome is used a total of six times: once to extend the pseudopod to the desired location, four times to complete the loop, and again to transfer the genome into the daughter loop. Clearly, this is dependent on the fourfold rotational symmetry of the loop; without it, the loop would be incapable of containing the information required to describe it. The same use of symmetry for genome compression is used in many biological viruses, such as the icosahedral adenovirus.

Comparison of related CA loops

CA NUMBER OF STATES NEIGHBORHOOD NUMBER OF CELLS (TYPICAL) REPLICATION PERIOD (TYPICAL) THUMBNAIL
Langton’s loops[3] (1984): The original self-reproducing loop. 8 von Neumann 86 151 Langtons Loop.png
Byl’s loop[4] (1989): By removing the inner sheath, Byl reduced the size of the loop. 6 von Neumann 12 25 Byl Loop.png
Chou-Reggia loop[5] (1993): A further reduction of the loop by removing all sheaths. 8 von Neumann 5 15 Chou-Reggia Loop.png
Tempesti loop[6] (1995): Tempesti added construction capabilities to his loop, allowing patterns to be written inside the loop after reproduction. 10 Moore 148 304 Tempesti Loop.png
Perrier loop[7] (1996): Perrier added a program stack and an extensible data tape to Langton’s loop, allowing it to compute anything computable. 64 von Neumann 158 235 Perrier Loop.png
SDSR loop[8] (1998): With an extra structure-dissolving state added to Langton’s loops, the SDSR loop has a limited lifetime and dissolves at the end of its life cycle. This allows continuous growth and turn-over of generations. 9 von Neumann 86 151 SDSR Loop.png
Evoloop[9] (1999): An extension of the SDSR loop, Evoloop is capable of interaction with neighboring loops as well as of evolution. Often, the greatest selection pressure in a colony of Evoloops is the competition for space, and natural selection favors the smallest functional loop present. Further studies demonstrated more complexity than originally thought in the Evoloop system.[10] 9 von Neumann 149 363 Evoloop closeup.png
Sexyloop[11] (2007): Sexyloop is a modification of Evoloop in which collisions often result in transfer of genetic material between loops. This increases diversity in evolution of new species of loops. 10 von Neumann  ?  ?

The Byl’s loop is an artificial lifeform similar in concept to Langton’s loop. It is a two-dimensional, 5-neighbor cellular automaton with 6 states per cell, and was developed in 1989 by John Byl, from the Department of Mathematical Sciences of Trinity Western University.

Byl’s loop

The Byl’s loop was developed just a few years after Langton’s simplification of Codd’s automaton, which produced a simpler automaton that would reproduce itself in 151 time-steps. John Byl simpli­fied Langton’s automaton further, with an even smaller automaton that reproduced in just 25 time-steps. Byl’s automaton consisted of an array of 12 chips — of which 4 or 5 could be counted as the instruction tape — and 43 transition rules, while Langton’s device consisted of some 10×15 chips, including an instruction tape of 33 chips, plus some 190 transition rules.

Essentially, the simplification consisted in using less cellular states (6 as compared with Langton’s 8) and a smaller replicating loop (12 cells as compared with Langton’s 86).

Wang B-machine

As presented by Hao Wang (1954, 1957), his basic machine B is an extremely simple computational model equivalent to the Turing machine. It is “the first formulation of a Turing-machine theory in terms of computer-like models” (Minsky (1967) p. 200). With only 4 sequential instructions it is very similar to, but even simpler than, the 7 sequential instructions of the Post–Turing machine. In the same paper, Wang introduced a variety of equivalent machines, including what he called the W-machine, which is the B-machine with an “erase” instruction added to the instruction set.

Description

As defined by Wang (1954) the B-machine has at its command only 4 instructions:

  • (1)  : Move tape-scanning head one tape square to the right (or move tape one square left), then continue to next instruction in numerical sequence;
  • (2)  : Move tape-scanning head one tape square to the left (or move tape one square right), then continue to next instruction in numerical sequence;
  • (3) * : In scanned tape-square print mark * then go to next instruction in numerical sequence;
  • (4) Cn: Conditional “transfer” (jump, branch) to instruction “n”: If scanned tape-square is marked then go to instruction “n” else (if scanned square is blank) continue to next instruction in numerical sequence.

A sample of a simple B-machine instruction is his example (p. 65):

1. *, 2. →, 3. C2, 4. →, 5. ←

He rewrites this as a collection of ordered pairs:

{ ( 1, * ), ( 2, → ), ( 3, C2 ), ( 4, → ), ( 5, ← ) }

Wang’s W-machine is simply the B-machine with the one additional instruction

  • (5) E : In scanned tape-square erase the mark * (if there is one) then go to next instruction in numerical sequence.

XML

Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in aformat that is both human-readable and machine-readable. It is defined in the XML 1.0 Specification[3] produced by theW3C, and several other related specifications,[4] all free open standards.[5]

The design goals of XML emphasize simplicity, generality, and usability over the Internet.[6] It is a textual data format with strong support via Unicode for the languages of the world. Although the design of XML focuses on documents, it is widely used for the representation of arbitrary data structures,[7] for example in web services.

Many application programming interfaces (APIs) have been developed to aid software developers with processing XML data, and several schema systems exist to aid in the definition of XML-based languages.

Applications of XML

As of 2009, hundreds of document formats using XML syntax have been developed,[8] including RSSAtomSOAP, and XHTML. XML-based formats have become the default for many office-productivity tools, including Microsoft Office (Office Open XML), OpenOffice.org and LibreOffice (OpenDocument), and Apple‘s iWork. XML has also been employed as the base language for communication protocols, such as XMPP. Applications for the Microsoft .NET Framework use XML files for configuration. Apple has an implementation of a registry based on XML.[9]

XML has come into common use for the interchange of data over the Internet. RFC 3023 gives rules for the construction of Internet Media Types for use when sending XML. It also defines the media types application/xml and text/xml, which say only that the data are in XML, and nothing about its semantics. The use of text/xml has been criticized as a potential source of encoding problems and it has been suggested that it should be deprecated.[10]

RFC 3023 also recommends that XML-based languages be given media types ending in +xml; for example image/svg+xml for SVG.

Further guidelines for the use of XML in a networked context may be found in RFC 3470, also known as IETF BCP 70; this document is very wide-ranging and covers many aspects of designing and deploying an XML-based language.

Key terminology

The material in this section is based on the XML Specification. This is not an exhaustive list of all the constructs that appear in XML; it provides an introduction to the key constructs most often encountered in day-to-day use.

(Unicode) character
By definition, an XML document is a string of characters. Almost every legal Unicode character may appear in an XML document.
Processor and application
The processor analyzes the markup and passes structured information to an application. The specification places requirements on what an XML processor must do and not do, but the application is outside its scope. The processor (as the specification calls it) is often referred to colloquially as an XML parser.
Markup and content
The characters making up an XML document are divided into markup and content, which may be distinguished by the application of simple syntactic rules. Generally, strings that constitute markup either begin with the character < and end with a >, or they begin with the character & and end with a ;. Strings of characters that are not markup are content. However, in a CDATA section, the delimiters <![CDATA[ and ]]> are classified as markup, while the text between them is classified as content. In addition, whitespace before and after the outermost element is classified as markup.
Tag
A markup construct that begins with < and ends with >. Tags come in three flavors:

  • start-tags; for example: <section>
  • end-tags; for example: </section>
  • empty-element tags; for example: <line-break />
Element
A logical document component which either begins with a start-tag and ends with a matching end-tag or consists only of an empty-element tag. The characters between the start- and end-tags, if any, are the element’s content, and may contain markup, including other elements, which are called child elements. An example of an element is <Greeting>Hello, world.</Greeting> (see hello world). Another is <line-break />.
Attribute
A markup construct consisting of a name/value pair that exists within a start-tag or empty-element tag. In the example (below) the element img has two attributes, srcand alt:
Another example would be
where the name of the attribute is “number” and the value is “3”. 

An XML attribute can only have a single value and each attribute can appear at most once on each element. In the common situation where a list of multiple values is desired, this must be done by encoding the list into a well-formed XML attribute[note 1] with some format beyond what XML defines itself. Usually this is either a comma or semi-colon delimited list or, if the individual values are known not to contain spaces,[note 2] a space-delimited list can be used.

where the attribute “class” has both the value “inner greeting-box” and also indicates the two CSS class names “inner” and “greeting-box”. 
XML declaration
XML documents may begin by declaring some information about themselves, as in the following example:

Characters and escaping

XML documents consist entirely of characters from the Unicode repertoire. Except for a small number of specifically excluded control characters, any character defined by Unicode may appear within the content of an XML document.

XML includes facilities for identifying the encoding of the Unicode characters that make up the document, and for expressing characters that, for one reason or another, cannot be used directly.

Valid characters

Unicode code points in the following ranges are valid in XML 1.0 documents:[11]

  • U+0009, U+000A, U+000D: these are the only C0 controls accepted in XML 1.0;
  • U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters in the BMP (all surrogates, U+FFFE and U+FFFF are forbidden);
  • U+10000–U+10FFFF: this includes all code points in supplementary planes, including non-characters.

XML 1.1[12] extends the set of allowed characters to include all the above, plus the remaining characters in the range U+0001–U+001F. At the same time, however, it restricts the use of C0 and C1 control characters other than U+0009, U+000A, U+000D, and U+0085 by requiring them to be written in escaped form (for example U+0001 must be written as &#x01; or its equivalent). In the case of C1 characters, this restriction is a backwards incompatibility; it was introduced to allow common encoding errors to be detected.

The code point U+0000 is the only character that is not permitted in any XML 1.0 or 1.1 document.

Encoding detection

The Unicode character set can be encoded into bytes for storage or transmission in a variety of different ways, called “encodings”. Unicode itself defines encodings that cover the entire repertoire; well-known ones include UTF-8 and UTF-16.[13] There are many other text encodings that predate Unicode, such as ASCII and ISO/IEC 8859; their character repertoires in almost every case are subsets of the Unicode character set.

XML allows the use of any of the Unicode-defined encodings, and any other encodings whose characters also appear in Unicode. XML also provides a mechanism whereby an XML processor can reliably, without any prior knowledge, determine which encoding is being used.[14] Encodings other than UTF-8 and UTF-16 will not necessarily be recognized by every XML parser.

Escaping

XML provides escape facilities for including characters which are problematic to include directly. For example:

  • The characters “<” and “&” are key syntax markers and may never appear in content outside a CDATA section.[15]
  • Some character encodings support only a subset of Unicode. For example, it is legal to encode an XML document in ASCII, but ASCII lacks code points for Unicode characters such as “é”.
  • It might not be possible to type the character on the author’s machine.
  • Some characters have glyphs that cannot be visually distinguished from other characters: examples are

There are five predefined entities:

  • &lt; represents “<“
  • &gt; represents “>”
  • &amp; represents “&”
  • &apos; represents ‘
  • &quot; represents “

All permitted Unicode characters may be represented with a numeric character reference. Consider the Chinese character “中”, whose numeric code in Unicode is hexadecimal 4E2D, or decimal 20,013. A user whose keyboard offers no method for entering this character could still insert it in an XML document encoded either as or . Similarly, the string “I <3 Jörg” could be encoded for inclusion in an XML document as “I &lt;3 Jörg“.

” is not permitted, however, because the null character is one of the control characters excluded from XML, even when using a numeric character reference.[16] An alternative encoding mechanism such as Base64 is needed to represent such characters.

Comments

Comments may appear anywhere in a document outside other markup. Comments cannot appear before the XML declaration. Comments start with “<!--” and end with “-->“. For compatability with SGML, the string “--” (double-hyphen) is not allowed inside comments[17]; this means comments cannot be nested. The ampersand has no special significance within comments, so entity and character references are not recognized as such, and there is no way to represent characters outside the character set of the document encoding.

An example of a valid comment: “<!--no need to escape <code> & such in comments-->

International use

XML 1.0 (Fifth Edition) and XML 1.1 support the direct use of almost any Unicode character in element names, attributes, comments, character data, and processing instructions (other than the ones that have special symbolic meaning in XML itself, such as the less-than sign, “<“). The following is a well-formed XML document including both Chinese and Cyrillic characters:

Well-formedness and error-handling

Main article: Well-formed document

The XML specification defines an XML document as a well-formed text – meaning that it satisfies a list of syntax rules provided in the specification. Some key points in the fairly lengthy list include:

  • The document contains only properly encoded legal Unicode characters
  • None of the special syntax characters such as < and & appear except when performing their markup-delineation roles
  • The begin, end, and empty-element tags that delimit the elements are correctly nested, with none missing and none overlapping
  • The element tags are case-sensitive; the beginning and end tags must match exactly. Tag names cannot contain any of the characters !"#$%&'()*+,/;<=>?@[]^`{|}~, nor a space character, and cannot start with -., or a numeric digit.
  • A single “root” element contains all the other elements

The definition of an XML document excludes texts that contain violations of well-formedness rules; they are simply not XML. An XML processor that encounters such a violation is required to report such errors and to cease normal processing. This policy, occasionally referred to as “draconian error handling,” stands in notable contrast to the behavior of programs that process HTML, which are designed to produce a reasonable result even in the presence of severe markup errors.[18] XML’s policy in this area has been criticized as a violation of Postel’s law (“Be conservative in what you send; be liberal in what you accept”).[19]

The XML specification defines a valid XML document as a well-formed XML document which also conforms to the rules of a Document Type Definition (DTD).

Schemas and validation

In addition to being well-formed, an XML document may be valid. This means that it contains a reference to a Document Type Definition (DTD), and that its elements and attributes are declared in that DTD and follow the grammatical rules for them that the DTD specifies.

XML processors are classified as validating or non-validating depending on whether or not they check XML documents for validity. A processor that discovers a validity error must be able to report it, but may continue normal processing.

A DTD is an example of a schema or grammar. Since the initial publication of XML 1.0, there has been substantial work in the area of schema languages for XML. Such schema languages typically constrain the set of elements that may be used in a document, which attributes may be applied to them, the order in which they may appear, and the allowable parent/child relationships.

Document Type Definition

The oldest schema language for XML is the Document Type Definition (DTD), inherited from SGML.

DTDs have the following benefits:

  • DTD support is ubiquitous due to its inclusion in the XML 1.0 standard.
  • DTDs are terse compared to element-based schema languages and consequently present more information in a single screen.
  • DTDs allow the declaration of standard public entity sets for publishing characters.
  • DTDs define a document type rather than the types used by a namespace, thus grouping all constraints for a document in a single collection.

DTDs have the following limitations:

  • They have no explicit support for newer features of XML, most importantly namespaces.
  • They lack expressiveness. XML DTDs are simpler than SGML DTDs and there are certain structures that cannot be expressed with regular grammars. DTDs only support rudimentary datatypes.
  • They lack readability. DTD designers typically make heavy use of parameter entities (which behave essentially as textual macros), which make it easier to define complex grammars, but at the expense of clarity.
  • They use a syntax based on regular expression syntax, inherited from SGML, to describe the schema. Typical XML APIs such as SAX do not attempt to offer applications a structured representation of the syntax, so it is less accessible to programmers than an element-based syntax may be.

Two peculiar features that distinguish DTDs from other schema types are the syntactic support for embedding a DTD within XML documents and for defining entities, which are arbitrary fragments of text and/or markup that the XML processor inserts in the DTD itself and in the XML document wherever they are referenced, like character escapes.

DTD technology is still used in many applications because of its ubiquity.

XML Schema

XML Schema (W3C)

A newer schema language, described by the W3C as the successor of DTDs, is XML Schema, often referred to by the initialism for XML Schema instances, XSD (XML Schema Definition). XSDs are far more powerful than DTDs in describing XML languages. They use a rich datatyping system and allow for more detailed constraints on an XML document’s logical structure. XSDs also use an XML-based format, which makes it possible to use ordinary XML tools to help process them.

xs:schema element that defines a schema:

RELAX NG

RELAX NG was initially specified by OASIS and is now also an ISO/IEC International Standard (as part of DSDL). RELAX NG schemas may be written in either an XML based syntax or a more compact non-XML syntax; the two syntaxes are isomorphic and James Clark‘s conversion tool – ‘Trang‘, can convert between them without loss of information. RELAX NG has a simpler definition and validation framework than XML Schema, making it easier to use and implement. It also has the ability to use datatype framework plug-ins; a RELAX NG schema author, for example, can require values in an XML document to conform to definitions in XML Schema Datatypes.

Schematron

Schematron is a language for making assertions about the presence or absence of patterns in an XML document. It typically uses XPath expressions.

ISO DSDL and other schema languages

The ISO DSDL (Document Schema Description Languages) standard brings together a comprehensive set of small schema languages, each targeted at specific problems. DSDL includes RELAX NG full and compact syntax, Schematron assertion language, and languages for defining datatypes, character repertoire constraints, renaming and entity expansion, and namespace-based routing of document fragments to different validators. DSDL schema languages do not have the vendor support of XML Schemas yet, and are to some extent a grassroots reaction of industrial publishers to the lack of utility of XML Schemas for publishing.

Some schema languages not only describe the structure of a particular XML format but also offer limited facilities to influence processing of individual XML files that conform to this format. DTDs and XSDs both have this ability; they can for instance provide the infoset augmentation facility and attribute defaults. RELAX NG and Schematron intentionally do not provide these.

Related specifications

A cluster of specifications closely related to XML have been developed, starting soon after the initial publication of XML 1.0. It is frequently the case that the term “XML” is used to refer to XML together with one or more of these other technologies which have come to be seen as part of the XML core.

  • XML Namespaces enable the same document to contain XML elements and attributes taken from different vocabularies, without any naming collisions occurring. Although XML Namespaces are not part of the XML specification itself, virtually all XML software also supports XML Namespaces.
  • XML Base defines the xml:base attribute, which may be used to set the base for resolution of relative URI references within the scope of a single XML element.
  • The XML Information Set or XML infoset describes an abstract data model for XML documents in terms of information items. The infoset is commonly used in the specifications of XML languages, for convenience in describing constraints on the XML constructs those languages allow.
  • xml:id Version 1.0 asserts that an attribute named xml:id functions as an “ID attribute” in the sense used in a DTD.
  • XPath defines a syntax named XPath expressions which identifies one or more of the internal components (elements, attributes, and so on) included in an XML document. XPath is widely used in other core-XML specifications and in programming libraries for accessing XML-encoded data.
  • XSLT is a language with an XML-based syntax that is used to transform XML documents into other XML documents, HTML, or other, unstructured formats such as plain text or RTF. XSLT is very tightly coupled with XPath, which it uses to address components of the input XML document, mainly elements and attributes.
  • XSL Formatting Objects, or XSL-FO, is a markup language for XML document formatting which is most often used to generate PDFs.
  • XQuery is an XML-oriented query language strongly rooted in XPath and XML Schema. It provides methods to access, manipulate and return XML, and is mainly conceived as a query language for XML databases.
  • XML Signature defines syntax and processing rules for creating digital signatures on XML content.
  • XML Encryption defines syntax and processing rules for encrypting XML content.

Some other specifications conceived as part of the “XML Core” have failed to find wide adoption, including XIncludeXLink, and XPointer.

Programming interfaces

The design goals of XML include, “It shall be easy to write programs which process XML documents.”[6] Despite this, the XML specification contains almost no information about how programmers might go about doing such processing. The XML Infoset specification provides a vocabulary to refer to the constructs within an XML document, but also does not provide any guidance on how to access this information. A variety of APIs for accessing XML have been developed and used, and some have been standardized.

Existing APIs for XML processing tend to fall into these categories:

  • Stream-oriented APIs accessible from a programming language, for example SAX and StAX.
  • Tree-traversal APIs accessible from a programming language, for example DOM.
  • XML data binding, which provides an automated translation between an XML document and programming-language objects.
  • Declarative transformation languages such as XSLT and XQuery.

Stream-oriented facilities require less memory and, for certain tasks which are based on a linear traversal of an XML document, are faster and simpler than other alternatives. Tree-traversal and data-binding APIs typically require the use of much more memory, but are often found more convenient for use by programmers; some include declarative retrieval of document components via the use of XPath expressions.

XSLT is designed for declarative description of XML document transformations, and has been widely implemented both in server-side packages and Web browsers. XQuery overlaps XSLT in its functionality, but is designed more for searching of large XML databases.

Simple API for XML

Simple API for XML (SAX) is a lexicalevent-driven interface in which a document is read serially and its contents are reported as callbacks to various methods on ahandler object of the user’s design. SAX is fast and efficient to implement, but difficult to use for extracting information at random from the XML, since it tends to burden the application author with keeping track of what part of the document is being processed. It is better suited to situations in which certain types of information are always handled the same way, no matter where they occur in the document.

Pull parsing

Pull parsing[20] treats the document as a series of items which are read in sequence using the Iterator design pattern. This allows for writing of recursive-descent parsersin which the structure of the code performing the parsing mirrors the structure of the XML being parsed, and intermediate parsed results can be used and accessed as local variables within the methods performing the parsing, or passed down (as method parameters) into lower-level methods, or returned (as method return values) to higher-level methods. Examples of pull parsers include StAX in the Java programming language, XMLReader in PHP, ElementTree.iterparse in Python, System.Xml.XmlReader in the .NET Framework, and the DOM traversal API (NodeIterator and TreeWalker).

A pull parser creates an iterator that sequentially visits the various elements, attributes, and data in an XML document. Code which uses this iterator can test the current item (to tell, for example, whether it is a start or end element, or text), and inspect its attributes (local name, namespace, values of XML attributes, value of text, etc.), and can also move the iterator to the next item. The code can thus extract information from the document as it traverses it. The recursive-descent approach tends to lend itself to keeping data as typed local variables in the code doing the parsing, while SAX, for instance, typically requires a parser to manually maintain intermediate data within a stack of elements which are parent elements of the element being parsed. Pull-parsing code can be more straightforward to understand and maintain than SAX parsing code.

Document Object Model

The Document Object Model (DOM) is an interface-oriented application programming interface that allows for navigation of the entire document as if it were a tree of nodeobjects representing the document’s contents. A DOM document can be created by a parser, or can be generated manually by users (with limitations). Data types in DOM nodes are abstract; implementations provide their own programming language-specific bindings. DOM implementations tend to be memory intensive, as they generally require the entire document to be loaded into memory and constructed as a tree of objects before access is allowed.

Data binding

Another form of XML processing API is XML data binding, where XML data are made available as a hierarchy of custom, strongly typed classes, in contrast to the generic objects created by a Document Object Model parser. This approach simplifies code development, and in many cases allows problems to be identified at compile time rather than run-time. Example data binding systems include the Java Architecture for XML Binding (JAXB) and XML Serialization in .NET.[21]

XML as data type

XML has appeared as a first-class data type in other languages. The ECMAScript for XML (E4X) extension to the ECMAScript/JavaScript language explicitly defines two specific objects (XML and XMLList) for JavaScript, which support XML document nodes and XML node lists as distinct objects and use a dot-notation specifying parent-child relationships.[22] E4X is supported by the Mozilla 2.5+ browsers (though now deprecated) and Adobe Actionscript, but has not been adopted more universally. Similar notations are used in Microsoft’s LINQ implementation for Microsoft .NET 3.5 and above, and in Scala (which uses the Java VM). The open-source xmlsh application, which provides a Linux-like shell with special features for XML manipulation, similarly treats XML as a data type, using the <[ ]> notation.[23] The Resource Description Framework defines a data type rdf:XMLLiteral to hold wrapped, canonical XML.[24]

History

XML is an application profile of SGML (ISO 8879).[25]

The versatility of SGML for dynamic information display was understood by early digital media publishers in the late 1980s prior to the rise of the Internet.[26][27] By the mid-1990s some practitioners of SGML had gained experience with the then-new World Wide Web, and believed that SGML offered solutions to some of the problems the Web was likely to face as it grew. Dan Connolly added SGML to the list of W3C’s activities when he joined the staff in 1995; work began in mid-1996 when Sun Microsystems engineer Jon Bosak developed a charter and recruited collaborators. Bosak was well connected in the small community of people who had experience both in SGML and the Web.[28]

XML was compiled by a working group of eleven members,[29] supported by a (roughly) 150-member Interest Group. Technical debate took place on the Interest Group mailing list and issues were resolved by consensus or, when that failed, majority vote of the Working Group. A record of design decisions and their rationales was compiled by Michael Sperberg-McQueen on December 4, 1997.[30] James Clark served as Technical Lead of the Working Group, notably contributing the empty-element “<empty />” syntax and the name “XML”. Other names that had been put forward for consideration included “MAGMA” (Minimal Architecture for Generalized Markup Applications), “SLIM” (Structured Language for Internet Markup) and “MGML” (Minimal Generalized Markup Language). The co-editors of the specification were originally Tim Bray and Michael Sperberg-McQueen. Halfway through the project Bray accepted a consulting engagement with Netscape, provoking vociferous protests from Microsoft. Bray was temporarily asked to resign the editorship. This led to intense dispute in the Working Group, eventually solved by the appointment of Microsoft’s Jean Paoli as a third co-editor.

The XML Working Group never met face-to-face; the design was accomplished using a combination of email and weekly teleconferences. The major design decisions were reached in a short burst of intense work between August and November 1996,[31] when the first Working Draft of an XML specification was published.[32] Further design work continued through 1997, and XML 1.0 became a W3C Recommendation on February 10, 1998.

Sources

XML is a profile of an ISO standard SGML, and most of XML comes from SGML unchanged. From SGML comes the separation of logical and physical structures (elements and entities), the availability of grammar-based validation (DTDs), the separation of data and metadata (elements and attributes), mixed content, the separation of processing from representation (processing instructions), and the default angle-bracket syntax. Removed were the SGML declaration (XML has a fixed delimiter set and adopts Unicode as the document character set).

Other sources of technology for XML were the Text Encoding Initiative (TEI), which defined a profile of SGML for use as a “transfer syntax”; and HTML, in which elements were synchronous with their resource, document character sets were separate from resource encoding, the xml:lang attribute was invented, and (like HTTP) metadata accompanied the resource rather than being needed at the declaration of a link. The Extended Reference Concrete Syntax (ERCS) project of the SPREAD (Standardization Project Regarding East Asian Documents) project of the ISO-related China/Japan/Korea Document Processing expert group was the basis of XML 1.0’s naming rules; SPREAD also introduced hexadecimal numeric character references and the concept of references to make available all Unicode characters. To support ERCS, XML and HTML better, the SGML standard IS 8879 was revised in 1996 and 1998 with WebSGML Adaptations. The XML header followed that of ISO HyTime.

Ideas that developed during discussion which were novel in XML included the algorithm for encoding detection and the encoding header, the processing instruction target, the xml:space attribute, and the new close delimiter for empty-element tags. The notion of well-formedness as opposed to validity (which enables parsing without a schema) was first formalized in XML, although it had been implemented successfully in the Electronic Book Technology “Dynatext” software;[33] the software from the University of Waterloo New Oxford English Dictionary Project; the RISP LISP SGML text processor at Uniscope, Tokyo; the US Army Missile Command IADS hypertext system; Mentor Graphics Context; Interleaf and Xerox Publishing System.

Versions

There are two current versions of XML. The first (XML 1.0) was initially defined in 1998. It has undergone minor revisions since then, without being given a new version number, and is currently in its fifth edition, as published on November 26, 2008. It is widely implemented and still recommended for general use.

The second (XML 1.1) was initially published on February 4, 2004, the same day as XML 1.0 Third Edition,[34] and is currently in its second edition, as published on August 16, 2006. It contains features (some contentious) that are intended to make XML easier to use in certain cases.[35] The main changes are to enable the use of line-ending characters used on EBCDIC platforms, and the use of scripts and characters absent from Unicode 3.2. XML 1.1 is not very widely implemented and is recommended for use only by those who need its unique features.[36]

Prior to its fifth edition release, XML 1.0 differed from XML 1.1 in having stricter requirements for characters available for use in element and attribute names and unique identifiers: in the first four editions of XML 1.0 the characters were exclusively enumerated using a specific version of the Unicode standard (Unicode 2.0 to Unicode 3.2.) The fifth edition substitutes the mechanism of XML 1.1, which is more future-proof but reduces redundancy. The approach taken in the fifth edition of XML 1.0 and in all editions of XML 1.1 is that only certain characters are forbidden in names, and everything else is allowed, in order to accommodate the use of suitable name characters in future versions of Unicode. In the fifth edition, XML names may contain characters in the BalineseCham, or Phoenician scripts among many others which have been added to Unicode since Unicode 3.2.[35]

Almost any Unicode code point can be used in the character data and attribute values of an XML 1.0 or 1.1 document, even if the character corresponding to the code point is not defined in the current version of Unicode. In character data and attribute values, XML 1.1 allows the use of more control characters than XML 1.0, but, for “robustness”, most of the control characters introduced in XML 1.1 must be expressed as numeric character references (and #x7F through #x9F, which had been allowed in XML 1.0, are in XML 1.1 even required to be expressed as numeric character references[37]). Among the supported control characters in XML 1.1 are two line break codes that must be treated as whitespace. Whitespace characters are the only control codes that can be written directly.

There has been discussion of an XML 2.0, although no organization has announced plans for work on such a project. XML-SW (SW for skunkworks), written by one of the original developers of XML,[38] contains some proposals for what an XML 2.0 might look like: elimination of DTDs from syntax, integration of namespacesXML Base andXML Information Set (infoset) into the base standard.

The World Wide Web Consortium also has an XML Binary Characterization Working Group doing preliminary research into use cases and properties for a binary encoding of the XML infoset. The working group is not chartered to produce any official standards. Since XML is by definition text-based, ITU-T and ISO are using the name Fast Infoset for their own binary infoset to avoid confusion (see ITU-T Rec. X.891 | ISO/IEC 24824-1).

Criticism

XML and its extensions have regularly been criticized for verbosity and complexity.[39] Mapping the basic tree model of XML to type systems of programming languages or databases can be difficult, especially when XML is used for exchanging highly structured data between applications, which was not its primary design goal. Other criticisms attempt to refute the claim that XML is a self-describing language[40] (though the XML specification itself makes no such claim). JSONYAML, and S-Expressions are frequently proposed as alternatives (see Comparison of data serialization formats);[41] which focus on representing highly structured data rather than documents, which may contain both highly structured and relatively unstructured content.

from http://www.w3.org/TR/WD-xml-961114.html

2.8 Prolog and Document Type Declaration

The function of the markup in an XML document is to describe its storage and logical structures, and associate attribute-value pairs with the logical structure. XML provides a mechanism, the document type declaration, to define constraints on that logical structure, and to support the use of predefined storage units. An XML document is said to be valid if there is an associated document type declaration and if the document complies with the constraints expressed in it.

The document type declaration must appear before the first element in the document.

For example, the following is a complete XML document, well-formed but not valid:

The XML document type declaration may include a pointer to an external entity containing a subset of the necessary markup declarations, and may also directly include another, internal, subset of the necessary markup declarations.

These two subsets, taken together, are properly referred to as the document type definition, abbreviated DTD. However, it is a common practice for the bulk of the markup declarations to appear in the external subset, and for this subset, usually contained in a file, to be referred to as “the DTD” for a class of documents. For example:

The system identifier hello.dtd indicates the location of a full DTD for the document.

The declarations can also be given locally, as in this slightly larger example:

The character-set label <?XML encoding="UTF-8"> indicates that the document entity is encoded using the UTF-8 transformation of ISO 10646. The legal values of the character set code are given in the discussion of character encodings.

Individual markup declaration types are described elsewhere in this specification.