Gentoo Prefix as a Physics Software Manager

Gentoo Prefix is explored to manage sophisticated physics software stacks. It will be shown that Gentoo Prefix is an advantageous package management solution for big physics experiments, for its reusability on heterogeneous host environments, its vast collection of ebuild recipes, its extensibility for the future computing architectures and its deep root in an open diverse community inside and outside science.


Introduction
In big physics experiments, as simulation, reconstruction and analysis become more sophisticated, scientific reproducibility is not a trivial task [1][2][3]. Open data alone are not enough to make results replicable. With the advancement of data analysis and detector simulations, the physics software stack is getting deeper and becoming one of the biggest challenges.
Modularity is a common sense practice of software engineering to facilitate quality and reusability of code. However, that often introduces nested dependencies not obvious for physicists to work with. Package managers were invented by GNU/Linux distributions [4] and regarded as the single biggest advancement Linux has brought to the industry. It is the widely practised solution to organize dependencies systematically. Many scientific and high performance computing (HPC) oriented package managers exist with the goal to handle software complexity, such as EasyBuild [5], Spack [6,7], Conda [8], Nix [9,10] and GNU Guix [11], etc.
Gentoo Linux is a full GNU/Linux distribution since 1999. It is general purpose and used for daily and scientific tasks, and runs on the majority of instruction set architectures (ISA), including x86/amd64, sparc, arm/64, alpha, mips32/64, riscv, etc. It is regarded as a meta-distribution for its ultimate flexibility to be repurposed for specific needs. Development of Gentoo packages is open and distributed, and represents shared pieces of wisdom of the community.
Portage, the package manager of Gentoo Linux, is both robust and flexible, and is highly regarded by the free operating system community. In the form of Gentoo Prefix [12][13][14], portage can be deployed by a normal user into a directory prefix, on a workstation, a cloud instance or a supercomputing node. Software is described by its build recipes ebuilds along with dependency relations.

Dependency Handling
The basic function of a package manager is to automatically satisfy dependency graphs by recursively installing needed software. Figure 1 shows the example of the Pythia [15]

Gentoo Prefix as an Universal Environment
Gentoo Prefix installs a complete Gentoo userspace into a directory, represented by the shell variable EPREFIX. Prefix extends the host operating system with almost the full collections of software from Gentoo. A bootstrap helper script is provided to compile out Prefix from scratch on many of the existing POSIX environments, including macOS, Solaris, Windows/-Cygwin and most variants of GNU/Linux. Restricting our discussion to GNU/Linux, the script installs Prefix in 3 stages: compile python and portage in EPREFIX/tmp; install gcc in EPREFIX/tmp by portage, referred as Stage 2; build Gentoo in EPREFIX by portage and Stage 2 gcc. The flow is schematically shown in Figure 2. From Figure 2, it is evident that Prefix depends only on the kernel on the host, providing a full GNU userspace. Dynamic linking is performed within the Prefix. Neither LD_LIBRARY_PATH nor ELF DT_RPATH is needed. A Gentoo Prefix is independent of the host userspace, but unlike containers [16], it shares the file system view of the host and does not rely on any special privilege or configuration assumptions.
This flexibility makes Gentoo Prefix redistributable to another host provided the kernel APIs are compatible, which is straightforwardly achieved if Prefix is tuned to the lowest possible kernel version and the biggest common CPU instruction subsets. Combined with CernVM-FS [17], one single Gentoo Prefix can be mounted and shared by a heterogeneous set of client hosts, regardless of what GNU/Linux distributions they are, be it Redhat Enterprise Linux (RHEL), OpenSUSE, ArchLinux, Debian or Ubuntu, to name only a few. This facilitates collaboration and software reproducibility from all over the world. Figure 3, the file system hierarchy standard(FHS) [18] is followed inside EPREFIX. The ELF program headers INTERP points to the dynamic loader in EPREFIX. The dynamic loader uses libraries from EPREFIX/lib, etc. Everything is the same as the host, except a directory offset EPREFIX. /cvmfs/gentoo $ ls bin etc lib lib64 run sbin tmp usr var $ ldd 'which cp' linux-vdso.so.1 (0x00007ffe245e2000) libacl.so.1 => /cvmfs/gentoo/usr/lib64/libacl.so.1 (0x00007f0f7ec29000) libc.so.6 => /cvmfs/gentoo/lib64/libc.so.6 (0x00007f0f7ea59000) /cvmfs/gentoo/lib64/ld-linux-x86-64.so.2 (0x00007f0f7ec59000) All the existing build systems support specifying directory offsets to adopt Prefix. For autotools-based build systems, it is achieved by ./configure -prefix="${EPREFIX}" ... and for CMake-based build systems, by cmake -DCMAKE_INSTALL_PREFIX="${EPREFIX}" .... All the language-based packages inherit EPREFIX from the language interpreters, like Python, Perl, R, Haskell, etc. In the toolchain, gcc and binutils search for headers and libraries in EPREFIX and inject EPREFIX dynamic loader to ELF INTERP in the final step of linking . glibc instructs applications and itself to look for configurations in EPREFIX/etc.

Vast Collection of Packages
Gentoo is general purpose and is used by a diverse community of users. It features a big official package repository, consisting of 19437 packages as of November 2019 for all kinds of purposes. In the collection, the scientific categories include sci-astronomy, sci-biology, sci-calculators, sci-chemistry, sci-electronics, sci-geosciences, sci-mathematics, sci-physics and sci-visualization.
Thanks to the adaptable design of portage, it is interoperable with other software package managers. One good example is the GNU R [19] ecosystem offerred by the R Overlay, with overlay being a Gentoo term for a package repository other than the main one [20]. 18993 packages are automatically generated from R packages as of November 2019. Other ebuild generators, such as PyPI, Emacs ELPA, Octave Forge, Java Maven, Rust Cargo exist, making Gentoo extensible by the available packaging systems.

Case Studies
This section discusses some typical use cases of Gentoo Prefix.

Salvage of a System More than 10 Years Old
As of 2020, there still may be some HPC systems running end-of-life products like RHEL 5 initially released in 2007, for various reasons, most notably needing to support legacy scientific software and lack of funding for upgrade. Disregarding security consequences, the biggest drawback is that the old tools usually lack useful features.
Gentoo Prefix is independent of the host userspace. It selects the newest glibc possible to match the old running Linux kernel. The rest of Prefix is smoothly established with few patches. Taking RHEL 5 for example in November 2019, without intervening the host, a normal user can upgrade Python from 2.4 to 3.6, GCC from 4.1 to 8.2, ld from 2.17 to 2.30, glibc from 2.5 to 2.19. That makes the outdated system as usable as its cutting-edge counterparts.

Coexistance of Multiple Versions by Stacked Prefix
Conventional GNU/Linux distributions including Gentoo values deduplication of code in a single system. They hold well tested versions of software in stable branches or newest versions in development branches, but not both. This by default does not align well with physics computing work loads, because scientific tasks value reproducibility the most, which sometimes means a specific version of software to be used throughout the entire lifespan of an experiment for decades. Gentoo portage offers SLOTs to allow coexistance of multiple versions of a single piece of software [20], but those are only enabled in carefully selected packages.
A more flexible solution exists for Gentoo Prefix. Stacked Prefix makes thin instances based on a common parent. Build-time dependencies are reused to avoid code duplication. In the child Prefix, any version of needed software can be installed, masking those in the parent Prefix. With overlays, out-of-tree ebuilds can be added and maintained specifically for a certain experiment. Figure 4 shows coexistence of software stacks required by 3 underground experiments, Super-Kamiokande [21], XMASS [22] and Jinping Neutrino Experiment [23].

Portability to Future ISA
In the field of HPC, new ISAs are being developed for better power efficiency and scalability, for example the Sunway TaihuLight [24] with extended alpha ISA, the Post-K [25] with arm64 ISA. The portability of Gentoo allows a smooth migration path to a different ISA, without much extra efforts. Since 2013, Gentoo Prefix runs on Android-based mobile devices in the form of Gentoo on Android, giving native GNU userspace environment to handheld users. The same technology applies to arm64-based supercomputing facilities. It is remarkable that the most portable and the most powerful devices are being powered by the same ISA, while enhanced by Gentoo Prefix.

Institutional Facilities
Large institutions host many physics experiments. The computer infrastructures are often shared by people from different groups. In this scenario, Gentoo Prefix offers complementary functions to the existing software managers. Authors of this article have deployed Gentoo Prefix at CERN (European Organization for Nuclear Research) at /cvmfs/sft.cern.ch/lcg/contrib/gentoo/startprefix and IHEP (Institute of High Energy Physics) at /cvmfs/juno.ihep.ac.cn/ci/gentoo/startprefix.

Ebuild Example: ROOT
Gentoo ebuilds are written in carefully organized bash script, dictated by the package manager specification [20]. A code snippet of an ebuild for CERN ROOT [26,27] is reproduced below.

Conclusion
Ever since its birth in 1999, Gentoo has been a paradise for geek users and developers. Its package manager portage is a comprehensive solution meeting all the needs of software management in big physics experiments, and performs especially well with CernVM-FS [17]. The physics use case of Gentoo is a natural consequence of its flexibility and expertise in building operating systems. It is delightful to discover that systems developed by the community are so useful for science. By examining real world use cases of Gentoo Prefix, it has been shown that physicists could benefit from existing tools of proven superiority to guarantee reproducibility in simulation, reconstruction and analysis of big physics experiments.