Bibliography, catalogs, pixel data: Management of heterogeneous Big Data at CDS by the documentalists

High speed Internet and the evolution of data storage space in terms of costeffectiveness has changed the way data are managed today. Large amounts of heterogeneous data can now be visualized easily and rapidly using interactive applications such as “Google Maps”. In this respect, the Hierarchical Progressive Survey (HiPS) method has been developed by the Centre de Données astronomiques de Strasbourg (CDS) since 2009. HiPS uses the hierarchical sky tessellation called HEALPix to describe and organize images, data cubes or source catalogs. These HiPS can be accessed and visualized using applications such as Aladin. We show that structuring the data using HiPS enables easy and quick access to large and complex sets of astronomical data. As with bibliographic and catalog data, full documentation and comprehensive metadata are absolutely required for pertinent usage of these data. Hence the role of documentalists in the process of producing HiPS is essential. We present the interaction between documentalists and other specialists who are all part of the CDS team and support this process. More precisely, we describe the tools used by the documentalists to generate HiPS or to update the Virtual Observatory standardized descriptive information (the “metadata”). We also present the challenges faced by the documentalists processing such heterogeneous data on the scales of megabytes up to petabytes. On one hand, documentalists at CDS manage small size textual or numerical data for one or few astronomical objects. On the other hand, they process large data sets such as big catalogs containing heterogeneous data like spectra, images or data cubes, for millions of astronomical objects. Finally, by participating in the development of an interactive visualization of images or three-dimensional data cubes using the HiPS method, documentalists contribute to a long-term management of complex, large astronomical data.

. The more you zoom in on a particular area the more details show up large, enormous. Hence, managing astronomical data has always been a matter of Big Data. But how big? Present or future missions bring astronomy into the era of petabyte surveys, with large volumes of high quality images and catalog data. The quantity of data is continuously growing. To point this out, we take as example the future LSST project 2 . Due to the high speed with which the telescope will map the sky, but also to the depth it can see, LSST will produce about 15 TB of data per night, with a total amount of collected data of 60 PB. Processing this data will produce a 15 PB catalog database. But already today the Gaia 3 satellite, launched in 2013, is observing one billion stars several times in order to get their brightness, proper motions, distances and positions over time. CDS documentalists, astronomers and an IT specialists team made the Gaia Data Release 1 (DR1) available for the scientific community 4 ; it can be accessed in various ways using the Gaia@CDS 5 page: either by using the catalog service VizieR Web interface, Cone Search or TAP tools or by cross-identifying the Gaia sources with other catalogs using the X-Match service. Additionally, today, Gaia identifiers are present for almost 1.8 million of SIMBAD 6 sources.
Another way of accessing interactively large data is via Aladin 7 : pixel or catalog data structured using the Hierarchical Progressive Surveys (HiPS) [1] method can be displayed by panning or zooming just like in google map, but looking at the sky and using sky survey data. HiPS is an IVOA 8 standard (since May 2017) for packaging, storing, querying and describing astronomical data. It has been developed at CDS since 2009 and it has enjoyed a big success: various data centers such as IRAP 9 , IAS 10 , SSC-XMM 11 , CADC 12 , JAXA 13 or ESAC 14 use this method to provide hundreds of terabytes of data representing hundreds of HiPS. Fig. 1 shows an Aladin view of a large data set structured in HiPS, accessed at finer and finer resolution by zooming on a wide area until reaching a detailed view of the survey. Restructuring images into HiPS involves a resampling of the original pixels and a preservation of the associated meta data, as described in section 2 and 3. This paper describes how documentalists structure original pixel data into HiPS. Section 2 outlines the mechanism behind the HiPS method. Section 3 shows the main steps documentalists follow to organize data using the HiPS method. In section 4 we discuss challenges faced by the documentalists and we provide our conclusions.

Documentalists at CDS
Documentalists, or in other words, 'information scientists', manage content coming from electronic publications of scientific literature, dedicated web sites, internal or external servers. Additionally, sometimes documentalists receive data directly from the authors. Next step after getting the data is processing it using various internal tools developed by the IT specialists, such as DJIN [3], COSIM [4] or HiPSgen 15 . The astronomers' scientific expertise complements documentalists' knowledge in order to provide pertinent and useful information to the scientific community. Namely, documentalists update the following CDS services: • the SIMBAD database [6]: with cross-identifications, basic data, measurements and bibliography for individual astronomical objects outside the solar system; • the VizieR catalog service [7]: with astronomical sources , published tables, observation logs, spectra, light curves, polarization data, models , statistical analyses, compilations, etc grouped in a collection of astronomical catalogs with associated data [5]; • the interactive sky atlas Aladin [2]: with images and data cubes structured in HiPS, accessed and visualized interactively 16 . Therefore, in addition to the burst of the data volume in astronomy, documentalists at CDS also face data complexity. Indeed, CDS hosts numerical or text string data, related to a single SIMBAD object or organized in VizieR tables and catalogs; associated data such as spectra, time series, etc; images or data cubes in original format or structured into IVOA standards such as HiPS and MOC (Multi-Order coverage Map) 17 . Since 2016, documentalists joined the IT specialists and astronomers team handling big pixel data. Documentalists structure images and data cubes into HiPS, allowing Aladin or other specialized sky browsing tools developed by data centers such as JUDO 18 , ESASky 19 , MIZAR 20 to give a progressive view of surveys covering a part of, or the entire sky, at various spatial scales. Data reorganization into HiPS uses a hierarchical multi-resolution division of the sky called HEALPix, that we will present next.

HEALPix tessellation technique
Tessellation is used by architects for brickwork, by artists for decoration and bees use this technique to build their honeycombs. In astronomy we use tessellation to map data on the sky. Let us go into a little more detail. Different hierarchical multi resolution divisions of the sky are used in astronomy: WWT 21 uses a triangular partition of the sphere and Google 22 a cylindrical one (second and third upper panel in Fig. 2). The CDS Hierarchical Progressive Survey method uses a curvilinear partitioning called HEALPix (Hierarchical Equal Area isoLatitude Pixelization) [8] (the lower panel in Fig. 2). The HEALPix way of dividing the sphere has a wide usage in astronomy and it has been chosen by CDS as a balance between performance and quality. HEALPix has a base resolution used to divide the sphere into 12 equal cells, called the order 0 HEALPix map (the most left frame from the lower panel 15 http://aladin.u-strasbg.fr/hips/HipsIn10Steps.gml 16 Big sources catalogs can also be structured into HiPS for a progressive view in Aladin (subsets of the catalog are displayed in function of the angular resolution), but this is done by other members of the team, not by documentalists 17   in Fig. 2). Increasing resolution is reached by sub-dividing each cell into 4 equal cells, recursively, until reaching the maximum survey resolution (order 1 HEALPix map . . . max order HEALPix map). The lower panel in Fig. 2 shows the hierarchical structure of HEALPix : each pixel is divided into four similar pixels at each successive order.    used locally by Aladin or other dedicated clients by using its root directory name. To enable on-line 25 use of HiPS, the directories and files are simply copied in a HTTP server.

HiPS creation process and metadata update 4 Challenges for documentalists
The diversification of the astronomical data types and size brings changes in the way documentalists at CDS deal with databases content. Next, we will discuss how documentalists handle issues related 26 Figure 5. Aladin all sky view of DSS blue images structured into HiPS to big volume data, heterogeneity, incomplete or incorrect associated metadata or big astronomical survey complexity.
The time required by the HiPS process can be very long, depending on the size of the original images, not to mention the time needed to transfer the final result to the server. During this time, the documentalist either switches to SIMBAD updating with the bibliography, cross-identifications, basic data and measurements or he/she updates the metadata of other existing HiPS. Working with large pixel data implies switching between servers, disks or partition, checking the currently running processes, handling new tools and linux console commands (wget, rsync, get, grep, gunzip, HiPSgen, etc). Multitasking skills of documentalists become an important requirement in today's Big Data era.
An astronomical survey may have images with randomly distributed bad pixels, different sky backgrounds, resolutions and filters. Given this input data heterogeneity, tests using a small number of images need to be performed before launching the calculation of the whole image survey. Furthermore, applying the same parameters to heterogeneous data may be a big challenge. Sometimes the HiPS generation process has to be restarted partially or entirely due to the irregularities created by the rich diversity of images or due to the lack of images for a certain location on the sky. The more surveys we handle, the more we learn about the heterogeneity problems which may arise, the faster we handle properly the structuring of images and data cubes into HiPS.
An insufficiency or inaccuracy of the mandatory keywords from the original data FITS format files (ex: input data is a HEALPix map which do not respect the HEALPix conventions) may slow the HiPS creation process. Consequently, contact is established with the data providers in order to get files with complete and correct metadata headers. Updating the resulting HiPS associated information is necessary for further usage. This may imply discussions with the members of the team concerning new formulas for converting various measurements into IVOA standard units. Although in most of the cases full metadata is provided for the new HiPS, aspects like the limits of the visualization tools in loading very large images may prevent the display of the HiPS original data.
Handling pixel data coming from various astronomical surveys implies communicating directly or by email with astronomers and IT specialists. HiPS validation, a better understanding of the input data, the choice of the pertinent metadata or information to be displayed in advanced HiPS clients, are just a few issues demanding that astronomers, IT specialists and documentalists have a close collaboration.