SciQL, Bridging the Gap between Science and Relational ?· SciQL, Bridging the Gap between Science and…

  • Published on
    24-Sep-2018

  • View
    212

  • Download
    0

Transcript

  • SciQL, Bridging the Gap between Science and RelationalDBMS

    Ying Zhang Martin Kersten Milena Ivanova Niels NesCentrum Wiskunde & Informatica

    Amsterdam, The NetherlandsYing.Zhang, Martin.Kersten, Milena.Ivanova, Niels.Nes@cwi.nl

    ABSTRACTScientific discoveries increasingly rely on the ability to efficientlygrind massive amounts of experimental data using database tech-nologies. To bridge the gap between the needs of the Data-IntensiveResearch fields and the current DBMS technologies, we proposeSciQL (pronounced as cycle), the first SQL-based query languagefor scientific applications with both tables and arrays as first classcitizens. It provides a seamless symbiosis of array-, set- and sequence-interpretations. A key innovation is the extension of value-basedgrouping of SQL:2003 with structural grouping, i.e., fixed-sizedand unbounded groups based on explicit relationships between el-ements positions. This leads to a generalisation of window-basedquery processing with wide applicability in science domains. Thispaper describes the main language features of SciQL and illustratesit using time-series concepts.

    Categories and Subject DescriptorsE.1 [Data Structures]: Arrays; H.2.3 [Languages]: Query lan-guages; H.2.8 [Database Applications]: Scientific databases

    General TermsLanguage

    KeywordsSciQL, array query language, array database, scientific databases,time series

    1. INTRODUCTIONThe array computational paradigm is prevalent in most sciences

    and it has drawn attention from the database research communityfor many years. The object-oriented database systems of the 90sallowed any collection type to be used recursively [4] and multi-dimensional database systems took it as the starting point for theirdesign [21]. The hooks provided in relational systems for user de-fined functions and data types create a stepping stone towards inter-action with array-based libraries, i.e. RasDaMan [6] is one of the

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.IDEAS11 2011, September 21-23, Lisbon [Portugal]Editors: Bernardino, Cruz, DesaiCopyright c2011 ACM 978-1-4503-0627-0/11/09 ...$10.00.

    few systems in this area that have matured beyond the laboratorystage. Nevertheless, the array paradigm taken in isolation is insuf-ficient to create a full-fledged scientific information system. Sucha system should blend measurements with static and derived meta-data about the instruments and observations. It therefore calls fora strong symbiosis of the relational paradigm and array paradigm.The SciQL language presented in this paper fills this gap.

    The mismatch between application needs and database technol-ogy has a long history, e.g., [22, 8, 55, 20, 23, 21]. The main prob-lems encountered with relational systems in science can be summedup as i) the impedance mismatch between query language and ar-ray manipulation, ii) the difficulty to write complex array-basedexpressions in SQL, iii) ARRAYs are not first class citizens, and iv)ingestion of terabytes of data is too slow. The traditional DBMSsimply carries too much overhead. Moreover, much of the scienceprocessing involves use of standard libraries, e.g., LINPACK, andstatistics tools, e.g., R. Their interaction with a database is oftenconfined to a simplified data import/export facility. The proposedstandard for management of external data (SQL3/MED) [42] hasnot materialised as a component in contemporary system offerings.

    A query language is needed that achieves a true symbiosis of theTABLE and ARRAY semantics in the context of existing externalsoftware libraries. This led to the design of SciQL, where arraysare made first class citizens by enhancing the SQL:2003 frameworkalong three innovative lines:

    Seamless integration of array-, set-, and sequence- seman-tics.

    Named dimensions with constraints as a declarative meansfor indexed access to array cells.

    Structural grouping to generalize the value-based groupingtowards selective access to groups of cells based on posi-tional relationships for aggregation.

    A TABLE and an ARRAY differ semantically in a straightforwardmanner. A TABLE denotes a (multi-) set of tuples, while an AR-RAY denotes a (sparsely) indexed collection of tuples called cells.All cells covered by an arrays dimensions always exist conceptu-ally and their non-dimensional attributes are initialised to a defaultvalue, while in a TABLE tuples only come into existence after anexplicit insert operation. Arrays may appear wherever tables areallowed in an SQL expression, producing an array if the columnlist of a SELECT statement contains dimensional expressions. TheSQL iterator semantics associated with TABLEs carry over to AR-RAYs, but iteration is confined to cells whose non-dimensional at-tributes are not NULL.

    An important operation is to carve out an array slab for furtherprocessing. The windowing scheme in SQL:2003 is a step into this

  • 0.0 0.0 0.0 0.00.0 0.0 0.0 0.00.0 0.0 0.0 0.00.0 0.0 0.0 0.00 1

    null

    nullnull

    null

    0

    1

    2 3

    2

    3

    x

    y

    (a) matrix

    0.0 0.0 0.0 0.0null null null null0.0 0.0 0.0 0.0null null null null0 1

    null

    null

    null

    null

    0

    1

    2 3

    2

    3

    x

    y

    (b) stripes

    null null null 0.0null null 0.0 nullnull 0.0 null null0.0 null null null0 1

    null

    null

    null

    null

    0

    1

    2 3

    2

    3

    x

    y

    (c) diagonal (d) sparse

    0.0 0.0 0.0 0.00.0 0.0 0.0 0.00.0 0.0 0.0 0.00.0 0.0 0.0 0.00 1

    null

    null

    null

    null

    0

    1

    2 3

    2

    3

    x

    y

    Figure 1: SciQL fixed arrays with different forms

    -3.0 -2.0 -1.0 0.0-2.0 -1.0 0.0 5.0-1.0 0.0 3.0 4.00.0 1.0 2.0 3.00 1

    null

    null

    null

    null

    0

    1

    2 3

    2

    3

    x

    y

    (a) matrix

    3.0 4.0 5.0 6.0null null null null1.0 2.0 3.0 4.0null null null null0 1

    null

    null

    null

    null

    0

    1

    2 3

    2

    3

    x

    y

    (b) stripes

    null null null 10.0null null 10.0 nullnull 10.0 null null10.0 null null null0 1

    null

    null

    null

    null

    0

    1

    2 3

    2

    3

    x

    y

    (c) diagonal (d) sparse

    6.0 7.0 4.0 0.00.0 1.0 4.0 1.02.0 6.0 0.0 5.09.0 0.0 3.0 8.00 1

    null

    null

    null

    null

    0

    1

    2 3

    2

    3

    x

    y

    Figure 2: Results of updating the four fixed arrays by the firstfour queries in Section 2.2.

    direction. It was primarily introduced to better handle time series inbusiness data warehouses and data mining. In SciQL, we take it astep further by providing an easy to use language feature to identifygroups of cells based on their positional relationships. Each groupforms a pattern, called a tile, which can be subsequently used toderive all possible incarnations for, e.g., statistical aggregation.

    One way to evaluate SciQL is to confront the language designwith a functional benchmark. Unfortunately, the area of array- andtime series databases is still too immature to expect a (commer-cially) endorsed and crystallised benchmark. Instead, we exercisethe SciQL design using well-chosen use cases in various scientificdomains. In our previous paper [33], we have shown the expres-siveness of SciQL in Landsat and astronomy image processing. Inthis paper, we use SciQL on a different class of array problems,namely time series. Time series play a significant role in many sci-ence areas, such as statistics, econometrics, mathematical finance,(digital) signal processing and all kinds of sensor data.

    This paper is further organised as follows. Section 2 introducesSciQL through a series of examples. Section 3 demonstrates queryfunctionality. Section 4 evaluates the language with a functionalbenchmark using seismological time series data. Section 5 dis-cusses related work. Section 6 concludes the paper with a summaryand an outlook on the open issues.

    2. LANGUAGE MODELIn this section we summarize the features offered in SciQL con-

    cerning ARRAY definition, instantiation and modification, as wellas coercions between TABLE and ARRAY.

    2.1 Array DefinitionsWe purposely stay as close as possible to the syntax and seman-

    tics of SQL:2003. An ARRAY object definition reuses the syn-tax of TABLE with a few minor additions. An array has one-or-more dimensional attributes (for short: dimensions) and zero-or-more non-dimensional attributes. A dimension is a measurementof the size of the array in a particular named direction, e.g., x,y, z or time. A dimensional attribute is denoted by the key-word DIMENSION with optional constraints describing the dimen-sion range. The data type of a dimension can be any of the basicscalar data types, including TIMESTAMP, FLOAT and VARCHAR.The non-dimensional attributes of an array can be of any data typesa normal table column can be and they may use a DEFAULT clauseto initialize their values. The default value may be arbitrarily takenfrom a scalar expression, a cells dimensional value(s) (i.e., thecells coordinates on the array dimensions) or a side-effect freefunction. Omission of the default or assignment of a NULL-valueproduces a hole, which is ignored by the built-in aggregationfunctions.

    Arrays are either fixed or unbounded. An array is fixed iff allits dimensions are fixed, otherwise it is unbounded. The range andsize of a fixed dimension are exactly specified using the sequencepattern [::], which is composed outof expressions each producing one scalar value. The interval be-tween start and stop has an open end-point, i.e., stop is notincluded. For integer dimensions, the traditional syntax using aninteger upper bound [] is allowed as a shortcut of thesequence pattern [0:1:]. Figure 1 shows four fixed ar-rays with different forms. In addition to the most common C-stylerectangular arrays (Fig.1-a), stripes (Fig.1-b) can be defined as onewhere the default value of some rows is indistinguishable from outof bound access, i.e., those cells are explicitly excluded by carry-ing NULL values in their non-dimensional attributes. A diagonalarray (Fig.1-c) is easily formulated using a predicate over the di-mensions involved. It is even possible to carve out an array basedon its content (Fig.1-d), thereby effectively nullifying all cells out-side the domain of validity and producing a sparse array. This fea-ture is of particular interest to remove outliers using an integrityconstraint. Evidently, different array forms can lead to very dif-ferent considerations with respect to their physical representation,a topic discussed in a companion paper. The following statementsshow how the four arrays in Figure 1 are created in SciQL:

    CREATE ARRAY matrix (x INT DIMENSION[4],y INT DIMENSION[4],v FLOAT DEFAULT 0.0

    );

    CREATE ARRAY stripes (x INT DIMENSION[4],y INT DIMENSION[4] CHECK(MOD(y,2) = 1),v FLOAT DEFAULT 0.0

    );

    CREATE ARRAY diagonal (x INT DIMENSION[4],y INT DIMENSION[4] CHECK(x = y),v FLOAT DEFAULT 0.0

    );

    CREATE ARRAY sparse (x INT DIMENSION[4],y INT DIMENSION[4],v FLOAT DEFAULT 0.0 CHECK(v BETWEEN 0 AND 10)

    );

    A dimension is unbounded if any of its start, step, or stopexpressions is identified by the pseudo expression *. A DIMEN-SION clause without a sequence pattern implies the most open pat-tern [*:*:*]. Cells in an unbounded array can be modified us-ing the INSERT and DELETE statements carried over from the tablesemantics. An unbounded array has an implicitly defined actualsize derived from the minimal bounding rectangle that enclosesall cells with an explicitly inserted non-NULL value in the array.When walking through an array instance, cells outside the minimalbounding rectangle are ignored. However, direct access to any cellswithin the arrays dimension bounds is guaranteed to produce thedefault value. The effect is that listing an array with unboundeddimensions still produces a finite result, but it may be huge. An un-bounded dimension is typically used for an n-dimensional spatialarray where only part of the dimension range designates a non-empty array cell. Time series are also prototypical examples ofarrays with unbounded dimensions.

    2.2 Array ModificationsThe SQL update semantics is extended towards arrays in a straight-

    forward manner. The array cells are initialised upon creation with

  • the default values. A cell is given a new value through an ordinarySQL UPDATE statement. A dimension can be used as a bound vari-able, which takes on all its dimension values (i.e., valid values ofthis dimension) successively. A convenient shortcut is to combinemultiple updates into a single guarded statement. The evaluationorder ensures that the first predicate that holds dictates the cell val-ues. The refinement of the array matrix is shown in the first querybelow. The cells receive a zero only in the case x = y. The remain-ing queries demonstrate setting cell values in the arrays stripes,diagonal and sparse, respectively. The results are shown in Fig-ure 2.

    UPDATE matrix SET v =CASE WHEN x > y THEN x + y WHEN x < y THEN x - y ELSE 0 END;

    UPDATE stripes SET v = x + y;

    UPDATE diagonal SET v = v + 10;

    UPDATE sparse SET v = MOD(RAND(),16);

    Assignment of a NULL value to an array cell leads to a holein the array, a place indistinguishable from the out of bounds area.Such assignments overrule any predefined DEFAULT clause attachedto the array definition. For convenience, the built-in array aggregateoperations SUM(), COUNT(), AVG(), MIN() and MAX() are appliedto non-NULL values only.

    -3.0 -2.0 0.0 0.0-2.0 -1.0 5.0 0.0-1.0 0.0 4.0 0.00.0 1.0 3.0 0.00 1

    null

    null

    null

    null

    0

    1

    2 3

    2

    3

    x

    y

    Figure 3: Result of shift-ing and zero filling thelast column of matrix.

    Arrays can also be updated us-ing INSERT and DELETE state-ments. Since all cells seman-tically exists by definition, bothoperations effectively turn intoupdate statements. The DELETEstatement creates holes by assign-ing a NULL value for all qualifiedcells. The INSERT statement sim-ply overwrites the cells at posi-tions as specified by the input columns with new values. Notethat although the UPDATE, INSERT and DELETE statements do notchange the existence of array cells, for unbounded arrays they mayresult in scaling the minimal bounding rectangle up/down. Thethree queries below together illustrate how to delete a column inthe array matrix where x = 2, then shift the remaining columns,and (manually) set the last column of matrix to its default value.In the second and third queries, the x and y dimensions of the arraymatrix are matched against the projection columns of the SELECTstatements. Cells at matching positions are assigned new values(see Figure 3).

    DELETE FROM matrix WHERE x = 2;

    INSERT INTO matrix SELECT x-1, y, v FROM matrix WHERE x > 2;

    INSERT INTO matrix SELECT x, y, 0 FROM matrix WHERE x = 3;

    2.3 Array and Table CoercionsOne of the strong features of SciQL is to switch easily between

    a TABLE and an ARRAY perspective. Any array is turned into acorresponding table by simply selecting its attributes. The dimen-sions then form a compound primary key. For example, the matrixdefined earlier becomes a table using the expression SELECT x,y, v FROM matrix or using a CAST operation like CAST(...

Recommended

View more >