Design principles of the Metadata Querying Language (MQL) implemented in the ATLAS Metadata Interface (AMI) ecosystem

. This document describes the design principles of the Metadata Querying Language (MQL) implemented in ATLAS Metadata Interface (AMI), a metadata-oriented domain-specific language allowing to query databases without knowing the relation between tables. With this simplified yet generic grammar, MQL permits writing complex queries more simply than with Structured Query Language (SQL) .


Introduction
ATLAS Metadata Interface (AMI) is a generic ecosystem for metadata aggregation, transformation and cataloguing which benefits from about 20 years of feedback in the Large Hadron Collider (LHC) context. It is the official metadata repository for datasets and production parameters of the ATLAS [1] experiment. It implements Metadata Query Language (MQL), a metadata-oriented Domain-Specific Language (DSL) allowing to query databases without specifying all the relation between tables. With its simplified and generic grammar, MQL permits writing complex queries more simply than with SQL. This document describes how AMI compiles MQL into SQL queries using the underlying table relations graph automatically extracted through a reflection mechanism. A detailed description of the AMI framework design principles is available in earlier CHEP proceedings [2,3,4,5,6].

Concepts and benefits
MQL queries are similar to SQL queries but easier to write. MQL queries are more user oriented than expert oriented. Even if the MQL syntax is simpler than the SQL one, it permits to execute the same kind of queries, with full control on the generated SQL. In particular, it will always be possible to write an MQL query reproducing the result of a given SQL query even if it is trading verbosity for more assumption about JOIN type. For example, consider a physicist wants to list the names of the datasets that have files with size > 0. In MQL, The MQL language does not contain any FROM clause nor join (on foreign keys). Nevertheless, even if joins are not necessary in MQL queries, the end-user can provide constraints on the relation paths. If no constraint is provided, the MQL to SQL algorithm follows all the possible paths of the relation graph. As a result, for complex database structures, it could generate slow or irrelevant SQL query. The way of specifying path constraints is described Section 2.3. Figure 1. Sample database structure. In this schema, each table contains a unique auto-incremented identifier called "id". The "PROJECT" table contains projects information. This table doesn't have any foreign key (relation). The "DATASET" table contains entities representing a set of data. "DATASET" is linked to "PROJECT" with the foreign key constraint (DATASET.PROJECTFK = PROJECT.ID). The "DATASET_PARAM" table contains metadata parameters linked to the records of "DATASET". Those parameters have a name and a typed value. The "FILE" table is linked to "DATASET" thought the bridge to the "DATASET_FILE_BRIDGE" table. This table has two foreign key dependencies on both "DATASET" and "FILE". The "DATASET_TYPE" and "FILE_TYPE" tables introduce cycles in the relational dependency graph. Each of them have a foreign key dependency to the "PROJECT" table and to their related entity.

Relational model
In the following Figure 1, the relational model of a concrete use case is described, in order to expose various aspects of the MQL language and the resulting consequences on the generated SQL.

MQL specification
The MQL specification provides an interface to interact with any relational data source. It allows one to perform generic selection, insertion, modification and deletion operations, keeping benefits of the underlying relational model, but with a syntax less verbose of SQL. MQL introduces the notion of basic Qualified Identifier (QId) for representing either an entity (aka a table), or a field. Its syntax is [database.]entity to a table and [database.entity.]field or [entity.]field for a field.
As previously explained, there is neither a FROM clause nor join expression in MQL. In most cases, it means that if there is no cycle in the relation graph, MQL is able to autogenerate both the FROM clause and join expressions. If there are cycles, it is necessary to indicate the paths to be included or excluded by specifying between braces, at QId level, a set of constraint QId (tables or fields). See Figure 2.

Rule QId
Rule constraintQId MQL also introduces an isolation mechanism. It permits producing SQL embedded subqueries in order to perform complex relational operations on the data source. This is why, in addition to normal expression groups, delimited with parentheses, MQL introduces isolated groups delimited with square brackets. See Figure 4. The effect is to isolate the condition from the main query and thus to restrict the paths used to reach the QIds contained in the condition to only this condition.  According to Figure 1, the query represented in Figure 5 searches for dataset names, having both "Xsection" > "x" and "Luminosity" < "y" and linked to non-empty files only through the "DATASET_FILE_BRIDGE" table (the paths reaching PROJECT.id are excluded by the {!PROJECT.id} constraint). Note that the two isolated conditions point to the same table.

SQL generation in AMI
After server-side processing, MQL queries are converted to optimized SQL queries before being executed on the data source by the appropriate JDBC [6] driver. This section describes the steps for generating SQL from MQL.

Reflection and paths resolution
The AMI framework has a JDBC based reflexion subsystem permitting the extraction of the database structure (tables, fields, foreign keys). It allows one to build a graph of relations which is put in a cache inside AMI. On top of this subsystem, AMI provides a mechanism to automatically resolve all the possible paths linking the QIds in a MQL query.
As an example, from Figure 1

MQL parsing and QIds resolution
The AMI MQL has its dedicated Java tokenizer and parser, autogenerated from a LL(*) grammar (top-down parsing) by using the Another Tool for Language Recognition (ANTLR) framework [8].
The parsing of an MQL query produces an Abstract Syntax Tree (AST) object containing all the necessary information to, later, generate a SQL query. During the parsing, all the QIds in the query are resolved (catalogs, tables, fields, relations) using the AMI reflexion sub-system taking into account the provided QId path constraints.

SQL Generation
Using the resolved query AST, AMI is able to build both the FROM clauses and SQL joins in the WHERE clauses. The isolated expressions are encapsulated into nested sud-SQL queries. Figure 6 shows the generated SQL query for the MQL query. It is more verbose than the initial MQL, and this length ratio is even higher when the query becomes more complex. Note that for performance reason AMI has an AST optimizer, transforming isolated expressions to non-isolated expressions if there is no ambiguity.

Conclusion
Both database experts and end-users can take advantage of the AMI MQL language. It provides the same features of SQL but with a lightweight syntax. For non-expert users, it can totally mask the database relations and gives the possibility to easily perform complex queries. The AMI Core Framework takes full benefit of MQL with almost no overhead compared to SQL, especially thanks to cache usage. The new version of the AMI framework is in production in ATLAS and has already been chosen by other experiments for their metadata workflow.