Quickly create subsets from large, scattered datasets
On 10 December, MARIS hosted a DigiShape technical webinar on Beacon: an open-source technology that makes it easier to make large collections of observational data available for research, modelling and data analysis.
During the webinar, Peter Thijsse, Robin Kooyman and Tjerk Krijger gave an insight into the underlying technology, how Beacon can be set up, the way Beacon handles raw data and how users can work with it from notebooks and a graphical interface.
Watch the webinar Beacon
Download the webinar slides (PDF)
Summary
Why this technology is relevant
In many organizations, measurement data is spread over thousands to millions of individual files. These are findable and usable, but in practice difficult to search. Compiling subsets – consisting of pieces of the original datasets – often requires complex workflow, a lot of time and customization: scripts that have to open each file separately, temporary conversions or additional intermediate steps. At the same time, there is a growing need to make subsets of data collections readily available for analyses, notebooks, modelling and digital twins.
Many organizations are therefore building data lake-like solutions in which different data streams are made accessible in a consistent way. Beacon is in line with this because it does not require data restructuring, but approaches existing folder structures and object stores as if it were one coherent source, and is also lightning fast. This makes it easier to query and combine scattered files in different formats in a uniform way.
How Beacon Works
Beacon runs on top of existing file systems on physical servers or S3 buckets in the cloud. The files do not need to be pre-loaded or converted. The engine can directly create subsets from formats such as NetCDF, Zarr, Parquet, CSV, and Arrow. In doing so, Beacon automatically performs a number of operations: it fills in missing columns, converts data types into a usable form, and can convert units where needed. The result is one output file that can be used directly in notebooks or applications.
In the webinar, Robin Kooyman showed how Beacon is structured. The technology is written in Rust and combines a REST API with a set of core libraries that are responsible for managing collections, reading different formats, and executing queries. Under the hood, Beacon uses Apache Arrow and DataFusion for planning and executing queries. This allows relevant columns to be selected, filters to be forwarded to the source, and files to be processed in parallel. The effect is that subsets from large collections of files can be retrieved very quickly, without the need for a heavy infrastructure in advance.
An important component is the Beacon Binary Format (BBF). This allows large numbers of small files – such as NetCDFs – to be combined into a single container format with an index. This is especially useful for datasets that are traditionally not efficiently readable in parallel. BBF increases the accessibility of such files, especially for reconnaissance and analysis.
Set up a Beacon instance
In the second part, Robin showed how to set up a Beacon instance. With a sample repository and Docker configuration, a provider can start an instance, expose files, and define collections in minutes. This makes it possible for organizations to experiment with the technology in a relatively accessible way and to explore how it fits within their own data streams.
Working with Beacon in notebooks and through the Studio
For users who analyze or model, a Python Library has been developed that allows queries to be executed directly from notebooks. Tjerk Krijger demonstrated how filters are built up on time, space and parameters, and how subsets such as pandas or xarray objects are retrieved. In addition, Beacon Studio has been developed, which is linked to Beacon instances that allows datasets to be explored and downloaded via a graphical interface in a simple way, including map and graph views.
Sources and background material
- Beacon is freely available under an open-source license at https://beacon.maris.nl/
- Docs: https://maris-development.github.io/beacon/
- GitHub: https://github.com/maris-development/beacon
- Studio (user interface for IHM POC): https://beacon-ihm.maris.nl/studio
- Python library: https://pypi.org/project/beacon-api/
Continuation
Beacon continues to develop. Support for geotypes will be further expanded. Federation – being able to query multiple Beacon instances as a single source – is on the roadmap as Work in Progress (WIP). Organizations that work with large observational collections can already investigate whether parts of their data streams can be used more accessible or efficient with Beacon.
Contact
For questions about Beacon or ideas for collaboration:
Relevance for DigiShape
Interest in artificial intelligence, advanced modeling, and digital twins is growing rapidly. But in the end, each of these applications stands or falls with the question of whether the underlying data is available, reliable and can be brought together. This webinar showed that technologies like Beacon can help strengthen that foundation. It offers a practical way to make dispersed measurement collections accessible, without preparatory conversion steps or setting up complex architectures.
Within the DigiShape community, there is an increasing need for concrete examples and working solutions that help organizations make better use of data. Beacon is one such example. The technology shows that relatively simple configuration can already lead to a more accessible dataset for analyses and experiments.