DOE’s new SLAC X-ray laser data system will process one million frames per second

The lab’s data experts are finding ways to manage this massive amount of information as upgrades to the Linac’s Coherent Light Source (LCLS) come online over the next few years.

Credit: Greg Stewart/SLAC National Accelerator Lab

LCLS accelerates electrons to nearly the speed of light to generate extremely bright X-ray beams. These X-rays probe a sample such as a protein or quantum material, and a detector captures a series of images that reveal the atomic movement of the sample in real time. By stringing together these images, chemists, biologists and materials scientists can create molecular movies of events such as how plants absorb sunlight or how our drugs help fight disease.

As the LCLS is upgraded, scientists are moving from 120 pulses per second to 1 million pulses per second. This will create a 10,000 times brighter X-ray beam that will allow new studies of systems that could not be studied before. But it will also come with a huge data challenge: the X-ray laser will produce hundreds or even thousands of times more data per given period of time than before.

To manage this data, a group of scientists led by the director of the LCLS Data Systems Division, Jana Thayer, is developing new computational tools, including computer algorithms and means of connecting to supercomputers. Thayer’s group uses a combination of computing, data analysis and machine learning to determine patterns in X-ray images and then string together a molecular movie.

Go with the flow

At LCLS, data flows continuously. “When scientists have access to conduct an experiment, it’s either a 12-hour day or a 12-hour night, and limited to a few shifts before the next team arrives,” says scientist Ryan Coffee. main of SLAC. To effectively use valuable experimentation time, bottlenecks must be completely avoided to preserve data flow and analysis.

Streaming and storing data poses a significant challenge to network and computing resources, and to be able to monitor data quality in near real-time, the data must be processed immediately. An essential step to make this possible is to reduce the amount of data as much as possible before storing it for further analysis.

To achieve this, Thayer’s team implemented on-the-fly data reduction using several types of compression to reduce the size of the recorded data without affecting the quality of the scientific output. A form of compression, called a veto, discards unwanted data, such as images where X-rays have missed their target. Another, called feature extraction, records only scientifically important information, such as the location and brightness of a point on an X-ray image.

“If we backed up all the raw data, as we’ve done so far, it would cost us a quarter of a billion dollars a year,” Thayer says. “Our mission is to understand how to reduce data before writing it. One of the really cool and innovative parts of the new data system we’ve developed is the data reduction pipeline, which removes irrelevant information and reduces the data that needs to be transferred and stored.

Coffee says, “Second, you save a lot of energy, but more importantly, you save on throughput. If you have to send the raw data over the network, you will completely overwhelm it trying to send frames every microsecond.

The group also created an intermediate place to store the data before it is stored. Thayer explains, “We can’t write directly to storage because if there’s a problem in the system, it has to shut down and wait. Or if there is a network problem, you may lose data completely. So we have a small but reliable buffer to write to; we can then move the data to permanent storage. »

Stimulate innovation

Thayer points out that the data system is designed to provide researchers with the results of their work as quickly as the current system, so they get real-time information. It is also built to accommodate the expansion of LCLS science for the next 10 years. The big challenge is to keep up with the huge jump in data throughput.

“If you imagine going from analyzing 120 frames per second to 1 million per second, that requires a lot more scrolling,” she says. “Computing isn’t magic – it always works the same way – we just increase the number of brains working on each of the images.”

Supported by a recent DOE award and working with colleagues across the DOE national laboratory complex, the team is also looking to integrate artificial intelligence and machine learning techniques to further reduce the amount of data to process and to point out interesting features. in the data as it appears.

To understand the challenge of LCLS data, Coffee draws an analogy with self-driving cars: “They have to calculate in real time: they can’t analyze a batch of images that have just been recorded and then say, ‘We predict that you should have turned left on picture number “ten.” SLAC’s data rate is much higher than any of these cars, but the problem is the same: searchers must steer their experiment to find the most exciting destinations! »

The upgrades driving this massive increase in data throughput and performance will occur in two phases over the next few years, including LCLS-II and a high-energy upgrade that follows. The work of data experts will ensure that scientists can take full advantage of both. “Ultimately, this will have a dramatic effect on the kind of science we can do, opening up opportunities that aren’t possible today,” says Coffee.


Source: SLAC National Accelerator Laboratory

Comments are closed.