Distributed Data Collection for the Next Generation ATLAS EventIndex project

The ATLAS EventIndex currently runs in production in order to build a complete catalogue of events for experiments with large amounts of data. The current approach is to index all final produced data files at CERN Tier0, and at hundreds of grid sites, with a distributed data collection architecture using Object Stores to temporarily maintain the conveyed information, with references to them sent with a Messaging System. The final backend of all the indexed data is a central Hadoop infrastructure at CERN; an Oracle relational database is used for faster access to a subset of this information. In the future of ATLAS, instead of files, the event should be the atomic information unit for metadata, in order to accommodate future data processing and storage technologies. Files will no longer be static quantities, possibly dynamically aggregating data, and also allowing event-level granularity processing in heavily parallel computing environments. It also simplifies the handling of loss and or extension of data. In this sense the EventIndex may evolve towards a generalized whiteboard, with the ability to build collections and virtual datasets for end users. This proceedings describes the current Distributed Data Collection Architecture of the ATLAS EventIndex project, with details of the Producer, Consumer and Supervisor entities, and the protocol and information temporarily stored in the ObjectStore. It also shows the data flow rates and performance achieved since the new Object Store as temporary store approach was put in production in July 2017. We review the challenges imposed by the expected increasing rates that will reach 35 billion new real events per year in Run 3, and 100 billion new real events per year in Run 4. For simulated events the numbers are even higher, with 100 billion events/year in run 3, and 300 billion events/year in run 4. We also outline the challenges we face in order to accommodate future use cases in the EventIndex.

• A catalog of data ( all events in all processing stages ) is needed to meet multiple use cases and search criteria.A small quantity of data per event is indexed.
Events stored in files (identified by GUID) Files are grouped into DATASETS Wanted Event Index information ~= 300bytes to 1Kbyte per event: •Event identifiers (run / event numbers, trigger stream, luminosity block) •Online trigger pattern (l1, l2, ef) •References (pointers) to the events at each processing step (RAW, ESD, AOD, DAOD) in all permanent files on storage

Event Info Event Info
RAWref RAWref ESDref ESDref … • We are indexing Billions of Events, stored in Millions of files replicated at CERN and hundreds of grid-sites worldwide, adding 100 Petabytes of data→ A complex big data distributed system.

CHEP2018
Use Cases 1) Event picking: users able to select single events depending on constraints.Order of hundreds of concurrent users, with requests ranging from 1 event (common case) to 30k events (occasional).

2) Production consistency checks
• Duplicate event checkings: events with same Id appearing in same or different files/datasets.

•
Overlap detection in derivation framework: construct the overlap matrix identifying common events across the different files.
3) Trigger checks and event skimming: Count or give an event list based on trigger selection.-Mean of 5600 objects created/day with peaks of 44K objects.
-A total of 2M objects residing on CERN Ceph Object Store.

Size creation rates:
-Mean of 15 GiB stored/day, with peaks of 100GiB.
-A total of ~5 TiB stored on CERN Ceph Object Store.

Objects Size:
-Mean object size: 2.6 MiB ( 99% of objects are less that 15 MiB ) -Some bigger object, biggest one is 670 MiB -Factor 10 compressed with respect to original eventindex data.• Consumers retrieve the eventindex files from ObjectStore and write it in HDFS Hadoop • Granularity at the dataset(tid) level

Object creation rates Size creation rates
• 40% contained in a single object.
• Most of the datasets < 75 MiB • Biggest one: 8000 objects (~7GiB) • Single Consumer event throughput performance improved from 1K events/s (Messaging only), to 15K events/s (ObjectStore).• Overcoming messaging brokers bottleneck, we can also now scale horizontally.
• Stored in a single HDFS file per dataset (tid) in a directory named after the container.Reduced the number of HDFS files compared with previous approach with a HDFS file per indexed GUID.
• Current (all years) EventIndex Data in Hadoop: -Future: One and only one logical record per event ( Event Identification, Inmutable information (trigger, lumiblock, …), and for each processing step: • Link to algorithm ( processing task configuration) • Pointer(s) to output(s) • Flags for offline selections (derivations) •

Support Virtual Datasets:
-A logical collection of events • Created either explicitly ( giving a collection of Event Ids) of implicitly ( selection based on some other collection or event attributes) • Labelling individual events by a process or a user with attributes (key:value) • Evolve EventIndex technologies to future demanding rates: -Currently: ALL ATLAS processes: ~30billion events/day ( up to 350Hz on average) → update rate throughout the whole system ( all years, real and simulated data).Read 8 M files/day and produce 3 M files -See poster in this conference for more information • JSON encoded trigger bits to allow versatile operations.Less columns that binary approach, but without the possibility of bitwise IMPALA operations.

Event location db, oid1, oid2
• GUID of the file where to find this event, and object id inside that file provenance • JSON.Where to find this event in previous processing versions

-
Future: due to expected trigger rates, need to scale for next ATLAS runs: at least half an order of magnitude for Run3 (2021-2023): 35 B new real events/year and 100 B new MC event/year.For run4: 100 B new real events and 300 B new MC events per year.Then sum up replicas and reprocessing CHEP2018 Exploring new backend storage solutions Optimization and unification of data storage for the EventIndex Apache Kudu: new columnar-based storage that allows fast insertions and retrieval.-Tables with defined schema, primary keys and partitions.No foreign keys -Ingestion and query scanning are distributed among the servers holding the partitions ( tablets) -Partition pruning and projection/predicate pushdown Benefits for EventIndex: -Unify data for all use cases (random access + analytics) -Related data (reprocessings) sit close to each other on disc.Reduce redundancies and improve navigation.

EventIndex Architecture Data Production task to ensure all event index grid production performs correctly.
•Trigger Overlap detection: number of events in a real data Run/Stream satisfying trigger X which also satisfies trigger Y.Requirement: Storing and accessing thousands of files and millions of events in reasonable time.CHEP2018 2.
: "A PROTOTYPE FOR THE EVOLUTION OF ATLAS EVENTINDEX BASED ON APACHE KUDU STORAGE" CHEP2018 Kudu events table schema• Event uniqueness is forced among all tids and files • Key keeps events that belong to same dataset close • Scans for events in the same dataset are concentrated•