IT Service Management at CERN: Data Centre and Service monitoring and status

The Information Technology department at CERN has been using ITIL Service Management methodologies [1] and ServiceNow since early 2011. In recent years, several developments have been accomplished regarding the data centre and service monitoring, as well as service status reporting. The CERN Service Portal, built on top of ServiceNow, hosts the CERN Service Status Board, which informs end users and supporters of ongoing service incidents, planned interventions and service changes. The Service Portal also includes the Service Availability Dashboard, which displays the technical status of CERN computing services. Finally, ServiceNow has been integrated with the data centre monitoring infrastructure, via GNI (General Notification Infrastructure) in order to implement event management and generate incidents from hardware, network, operating system and application alarms. We detail how these developments have been implemented, and how they help supporters monitor and solve issues and keep users informed of service status. Also, we highlight which lessons have been learnt after the implementation. Finally, possible future improvements are discussed.


Introduction
With the planned start of the LHC in 2008, the number of users at CERN started to grow at a very quick pace, while the number of CERN Staff decreased [2]. Moreover, the computing, storage and network capacity of the CERN Data Centres (Meyrin and Wigner) was increased so as to adapt to the surge in data produced by the LHC experiments. February 2011 was the go-live of the CERN Service Management tool, ServiceNow, and the Service Desk [3].
Even in the first years on the project, it was a priority to report to end users the current and future status of the CERN various services. The CERN Service Status Board [4] and Service Availability Overview [5] websites fulfil this goal. We present the motivation to develop these tools, their history, current functionality and benefits, and future related tasks.
Monitoring the devices of the CERN Data Centres and integrating the monitoring infrastructure with the Service Management process have been another important priority of IT Service Management at CERN [6]. We also present the history of this integration, its current functionality and benefits, and future related tasks.
The contents of this paper were also presented in the CHEP 2018 conference in the form of a poster [7], which provides a more visual presentation of the information.

Service Status Board
The Service Status Board, also known as SSB, is a website [4] which informs end users and supporters of ongoing service incidents, planned interventions and service changes.

History
The first ancestor of the SSB was first implemented in 1980 on top of the IBM Mainframe used by CERN [8]. Its name then was "VM News", as the operating system of the mainframe was VM/CMS [9]. In 1995, a new web version was developed with Macromedia Dreamweaver [10], already with the name "Service Status Board". The website was maintained by webmasters which published announcements on request of service managers. In 2011, a new version of the SSB was implemented on Drupal [11]. In this interactive version of the SSB, service managers could create the announcements themselves, and non-IT services started using it as well.
However, the Drupal SSB was only a temporary version before the current version hosted in the CERN Service Portal [12], developed on top of the ServiceNow platform [13], and integrated with the CERN Service Catalogue. The current version was deployed in October 2013, after the Service Catalogue and the ServiceNow implementation were considered mature enough. In 2018, more than 1800 announcements have been posted.

Functionality
The current Service Status Board allows any CERN supporter (i.e. any of the nearly 2000 persons helping to provide a service to CERN users) to post an announcement. As Table 1 shows, announcements have one of 3 types: Service Incident, Planned Intervention, and Service Change; and one of 3 levels of impact: Down, Degraded or No Impact. The SSB Items are displayed in the SSB homepage as a list, as shown in Figure 1. They are organised in sections: Summary, Service Incidents, Interventions and Service Changes.
-The Summary section shows the most relevant service incidents, interventions and changes; in particular, those ongoing, happening the same day, or upcoming soon. When clicking on one of the items from the list, more details are displayed via the "SSB Item View" page, as shown in Figure 1 as well. This includes always a detailed "Description" section, and can optionally include the sections "Updates" and "Root Cause" for service incidents, or the "Communication Plan" section for planned interventions and service changes. All SSB Items also belong to one or more Service Elements, which represent the service(s) affected; and to one Functional Element, which represents the team and/or technical infrastructure supporting the service(s). The Functional Element team is responsible for the announcement and for the resolution of the incident, intervention or change. By clicking on the "+" button, users can also configure custom sections that will appear on the right of the tab list of the homepage, via ServiceNow filters. For example: any recent announcements related to any service provided by the IT-CS group.
From the SSB Item View page, or from ServiceNow, supporters can edit the SSB Item, via a standard ServiceNow record editing form. SSB Items are stored in the ServiceNow "Outage" table. For simplicity, any supporter can edit any SSB Item, although all edits are tracked. This was found to be a good balance between convenience and security.
SSB Items also have a Visibility attribute, which can be "Public" or "CERN". Public items are completely visible for any user in the world. However, in the case of "CERN" items, the detailed description requires logging in; although the title, type, impact, begin and end dates, and related service(s) are visible even before logging in.
When editing an intervention or change, the supporters can preview if it would collide with other interventions or changes that happen on the same dates, as shown in Figure 2. The SSB also provides RSS feeds for each of its sections and subsections. Users can use RSS clients to more easily follow when new items are added, although most RSS clients are not good at detecting changes in RSS items. A dedicated channel in the Mattermost [14] chat system used at CERN will also automatically receive messages whenever there is a new service incident, a planned intervention, or a service change happening soon.

Benefits
First of all, the Service Status Board provides a live view of the current incidents and issues. This increases the transparency between service providers and service users for current, past and future service status changes, resulting in a better relation and more trust with the users.
Furthermore, service incidents and planned interventions posted in the SSB significantly reduce the number of incident tickets submitted by end users. For this reason, the SSB is displayed in screens present in the control rooms of several LHC experiments.
The SSB also supports the planning of service maintenance and transition by avoiding colliding interventions or changes to happen at the same time. Some SSB Items can also be included in the weekly C5 report (CERN Computer Centre Coordination Committee, IT Department) and C3 Report (Coordination Communication and Coffee, Site Management and Buildings department). A weekly meeting reviews past and upcoming items, in order to both check the planning of interventions and changes and analyse recent incidents.
Integrated with the CERN Service Catalogue, the SSB is also a common announcement board for IT and non-IT services, helping common service management processes to be used across the Organisation.
Finally, in ServiceNow, a warning icon is displayed in Incident and Request tickets if the Service or Functional Element has a currently ongoing service incident or intervention.

Next steps
Various improvements are planned for the Service Status Board. First of all, a new version of the Service Portal, hosting the SSB, is planned for 2019. This will allow to review and improve the user experience of the website.
After that, a subscription mechanism is planned as a specific improvement for the SSB. End users will be able to receive notifications when an SSB item changes: incident resolution or intervention end. They will also be able to receive notifications when new SSB Items are posted for certain services that are interesting for them. Users will be able to configure from which SSB Items and services they wish to receive notifications.
Finally, a new "Major" type of Impact will be introduced. Major service incidents, interventions or changes will be highlighted in the SSB homepage and in the Service Portal.

Service Availability Overview
The Service Availability Overview (SAO), also known as Service Level Status (SLS), is another website [5] which informs end users and supporters of the current technical status of various services provided by the IT Department.

History
The initial version was called Service Level Status. It was synchronised with the Service Database (SDB) and displayed an entry for each service in the SDB.
After the CERN Service Catalogue, ServiceNow and the Service Portal were deployed in production in 2011, the Service Level Status was migrated to the CERN Service Portal in 2014 and the website title was changed to Service Availability Overview.
In 2015, a new layout was developed for public screens [15] installed in several public places at CERN and in the control rooms of LHC experiments. The same year, the architecture was migrated from a "pull" mechanism, where ServiceNow downloaded the status from various services, to receiving updates from the IT Monitoring Infrastructure.

Functionality
The Service Availability Dashboard displays one entry per Service Element that has been registered in it. Entries are organised in groups following the CERN Service Catalogue: one per Customer Service. Each entry has a different icon and colour depending on its current status: Available (green), Degraded (orange), Unavailable (red) or Unknown (grey).
When hovering the mouse over an entry, additional information is displayed, showing ongoing incidents and interventions published in the Service Status Board, and any additional details provided by the monitoring systems. A link to the monitoring dashboard implemented in Grafana [16] by the IT Monitoring team is also shown. The public screens view displays the availability of services in a grid layout. Both layouts are shown in Figure 3. The technical status of the service is displayed. This may depend on how many servers are online in the underlying infrastructure, or on the response time of the service. As shown in Figure 4, this information is received via the XSLS system (part of the IT Monitoring infrastructure), which in turn receives status messages from the various services. Each service team is responsible for implementing a system to periodically push information to XSLS: current status and optionally additional information in HTML format. When no information has been received for more than 1 hour, the entry's state becomes Unknown.
Finally, supporters can also register to get email notifications when the status of a service degrades or becomes unknown, which will be sent from ServiceNow. Fig. 4. Flow of information reaching the Service Availability Overview system

Benefits
In a similar way to the Service Status Board, the Service Availability Overview page and screen increase the transparency between the providers of IT services and their users, building up trust. It also makes it easier for users of technical services, such as the CERN experiments, to diagnose issues.
It also offers an easy alerting mechanism to teams responsible for service when the availability status of a service drops.

Data Centre Monitoring
The CERN Data Centres are monitored at various levels: hardware, network, operating system and application, with different teams responsible for each type of monitoring. In particular, hardware and software sensors produce alarms when situations requiring attention occur. These alarms are forwarded to the IT General Notification Infrastructure (GNI). From there, in many cases the alarms are forwarded to ServiceNow, where they are stored and can create Incidents automatically assigned to the appropriate support group.

History
The first instance of the Data Centre monitoring system was implemented in 2006, when the IT Computing Facilities group started using the BMC Remedy [17] product.
In early 2013, the implementation of the IT Computing Management workflow (ITCM) was migrated from Remedy into ServiceNow. ServiceNow was also integrated with the Lemon Alarm System (LAS, developed at CERN) [6]. In late 2013, ServiceNow was also integrated with the IT General Notification Infrastructure (GNI).
In 2014, a mechanism to group alarms into incidents was introduced, in order to reduce the number of different incidents by grouping related alarms in the same ticket. In 2016, LAS was phased out. The grouping mechanism was enhanced in order to further reduce the number of different incidents.

Functionality
ServiceNow receives alarms from the GNI infrastructure and stores them in its "Monitoring Events" table. The attributes of a Monitoring Event include among others: alarm name; alarm type (Hardware, Network, Operating System, Application, No Contact or Grafana); additional alarm information; troubleshooting recommendations; affected device(s), including their hostgroup in Puppet [18], environment and essential yes/no flag; suggested responsible users; and assignment information, such as Functional Element, Service Element and Assignment group level.
The first time an alarm concerning a device arrives to ServiceNow, it will always create a new Incident ticket, which will be assigned automatically following the included assignment information. However, if an alarm of the same type arrives later, with the attribute "Grouping flag" being true, and the alarm concerns the same device or a related device, the alarm will be added to the existing Incident ticket.
As of end of 2018, Hardware alarms are added only to existing incidents if they concern the same individual device (host) as previous alarms. Network, Operating System, Application and No Contact alarms can be added to incidents created by previous alarms concerning another device, if the two devices are related: same Puppet hostgroup or same network "cluster". Grafana-generated alarms are grouped based on their name only.
When an alarm arrives, information concerning the device is pulled from the ServiceNow Configuration Management Database (CMDB) tables. Part of this information is also refreshed in real time via web service queries to the CERN Network Database (LanDB).
The resulting incident has a description with all the above details. The description is modified every time a new event is added to the incident. All the related alarms and affected devices are listed in the incident form as well.
The supporters can act normally on the incidents. Typically, the workflow is as follows: -Take the incident in progress to declare that work has started -Resolve the underlying issue in the device(s) -Resolve the incident ticket, writing the details of the solution It is also possible for supporters to act on many tickets via "list actions". Special actions are also available to the Data Centre Operators team, such as logging calls to experts. Figure 5 illustrates the flow of alarm information from various devices to ServiceNow.

Benefits
Integrating the Data Centres device monitoring system with ServiceNow has resulted in many benefits. The cost of support is reduced by providing automatic assignment of issues and a common system to track them, which also facilitates the diagnosis of underlying root causes. The automatic assignment also allows for direct routing of incidents to suppliers where appropriate, for example for hardware repairs. This also reduces costs of support for CERN, as the CERN internal team does not have to reassign or create those incidents manually anymore; while being able to keep an overview of the situation and report on the data.
ServiceNow also has out-of-the-box APIs which allow teams to automatically read new alarms, act on the underlying infrastructure to attempt to solve the issue automatically, and even resolve the ServiceNow ticket automatically as well.
Finally, automatically-generated support cases are treated in the same tool as humangenerated tickets. This enables measuring, reporting and trending on the workload for both types of tickets, enabling better management and planning of required resources.

Next steps
One possible improvement is to add new grouping rules that adapt better to different teams' needs. For example, grouping certain alarms by Functional Element, or by root Puppet hostgroup instead of by full hostgroup. This would result in less incidents with more quality information, reducing the costs of support even further.
Another improvement currently in discussion is for the GNI system to automatically resolve incidents in ServiceNow when certain alarms stop happening for a given device. This would enable a complete automatic handling of alarms for cases where the service managers can automate the alarm resolution and the ServiceNow ticket resolution.