Solving UC Berkeley SETI’s Data Challenge: Green Bank Telescope
In 2016, the largest ever search for extraterrestrial intelligence commenced — Breakthrough Listen. This is a 10-year $100M program, funded as an act of philanthropy by tech investor Yuri Milner. Not only is the scope of the scientific effort completely unprecedented, so too is the dataset that has been made publicly available to citizen scientists. To date, nearly 2 PB of raw data has been made available on the Open Data Archive. Here you can search for and download data from three of the programs most sensitive instruments. However, files within the dataset are typically multiple gigabytes in size, and are stored in technical formats that require scientific software to process. Therefore, some amount of technical knowledge is required for a user to meaningfully handle the data.
In order to get users up and running, the team of scientists and engineers at UC Berkeley SETI has provided some background material on Breakthrough Listen which can be found here. This is a series of five pages providing an overview of the program, high level details of their research, technical details illustrating how their instruments are used, and information about the data itself. At the very end are two open challenges posed to readers, one of which uses data from GBT: the Green Bank Telescope. This isn’t an optical telescope, but rather a radio telescope. And the GBT isn’t just any radio telescope. It’s the largest fully steerable object on the planet, located in the middle of the National Radio Quiet Zone, making it a highly appropriate instrument for attempting to receive celestial signals from both natural and technological sources. The challenge asks users to post results to Twitter with the hashtag #BreakthroughGBTData.
Back in 2020, I solved the challenge. But I didn’t do so simply by downloading the data, processing it on my computer, and using conventional data analysis to get the answers. Instead, I created a new type of database that could serve GBT data from the cloud at scale, along with an Android app that could interactively query that database using typical touchscreen gestures. This allowed me then solve the challenge just by messing around on my phone, and doing some simple math on paper. I tweeted about it in hopes that others might find it cool to solve the challenge on their own, without needing to download all that data, process it, and solve the different engineering challenges along the way. It’s served as a neat conversation piece, but two and a half years later, there’s still only my one tweet about it. I figured it’s better late than never to share the solution more broadly, so here are the details. The remainder of this article is just from the perspective of a user of Radwave, the Android app I made.
Challenge Question 1:
Reproduce the waterfall plot
You might be asking “what’s a waterfall plot?”. Great first question. This a 2D color plot of power as a function of time and frequency. It’s also called a spectrogram. With radio signals, this shows the energy that exists at different times and different frequencies. Before delving straight into the GBT data, let’s look at something more familiar: FM radio. Here’s a example of what the RF spectrum looks like around 95.5 FM:
Let’s take a second to analyze this more closely. The top part is the waterfall plot. The color indicates the power of the signals, where brighter means more powerful. The vertical axis is time (seconds), and the horizontal is frequency (MHz). As you can see, centered directly at 95.5 MHz is a bright squiggly looking analog FM radio signal. We can also see weaker signals at 95.3 and 95.9 MHz. As you might expect, those signals would sound noisier. You can also see a tall rectangular signal between 95.6 MHz and 95.7 MHz. That is the HD FM signal for the same station. The bottom plot is called a Power Spectral Density (PSD) plot, and specifically shows the max (green), mean (cyan), min (blue) and most recent (red) energy at each frequency in the spectrogram. The vertical axis of the PSD is relative energy, measured in dB. As we can see, this particular spectrogram covers about 1 MHz of bandwidth and is almost 200 seconds in duration. This gives a Time-Bandwidth Product (TBP) of about 200 million, and is approximately the number of radio data samples within that span. For a data size comparison, this would be on the same order as an uncompressed 200 megapixel image, which is pretty sizable for a smartphone.
Now let’s look at the waterfall plot of some of the Voyager 1 data from the GBT.
Looking at this, we can see that we’re looking at 294 seconds of data spanning 8400 MHz to 8587.5 MHz. That’s 187.5 MHz wide, yielding a TBP of 55.125 billion. This is comparable in size to a 55 billion pixel image. Rendering such an image directly would at the very least require that much RAM, which neither my smartphone nor my computer has. Rendering really only becomes possible using some non-trivial processing. And processing the data to obtain a renderable image first requires unpacking 832 GB of raw radio samples, which is more hard drive space than my computer has. I don’t state this lightly: it’s a hard image to create.
Challenge Question 2:
Determine the observed frequency (in GHz) of the coherent downlink carrier (main signal in the center of the waterfall plot), and the offset (in kHz) of the modulated subcarrier with the telemetry data.
I believe that this question assumed that different data was being used, where the Voyager 1 was more clearly visible, and was actually in the center of the waterfall plot. As we look at the waterfall plot above, we can see several strong “signals” (bright vertical lines), the strongest being near 8567 MHz. You can also see that there are darker vertical lines that are regularly spaced. These darker vertical lines are due to RF filters that are applied to the received signal before digitization (converting the RF signal from analog to digital so that it can be processed by a computer). Those darker vertical lines separate bands (small portions of the overall bandwidth) that are individually digitized. As I understand it, GBT operates this way because digitizing the received signal at the full bandwidth would require much more expensive hardware, especially at the time this hardware was last updated at GBT.
In this dataset, one of the strong signals is from Voyager 1. We’re going to ignore the other strong signals for the moment (but I’ll give you a hint: the others are located at the exact center of those smaller bands). The best thing with Radwave is that using intuitive pan and stretch/pinch gestures, we can explore the entire 55.125 billion point spectrogram interactively, ultimately finding the Voyager 1 telemetry signal near 8420.2 MHz. As you do so, Radwave is pulling data from AWS in the background, simultaneously minimizing the latency observed by the user and the data downloaded from the cloud. This makes it possible for even novice users to manually detect signals using just their eyes, without the burden of handling the full 832 GB of raw data. Not only that, but it allows a scalable number of concurrent users to independently explore the data. As far as I know, this is a completely unique capability. All other systems that I’m aware of that allow for interactive exploration of wideband RF data have the data locally available on a server. A user must log directly into that server to see it. Commonly, this means only a single user can use the system at a time. Some systems might allow multiple users to login at the same time, but then you’re limited by the CPU, GPU, and RAM constraints of the server. But Radwave uses scalable cloud-based techniques to host the data, and leverages each user’s device to handle rendering, which should make the backend much more difficult to crash due to excessive numbers of users.
Here we see the three signal components that comprise the Voyager 1 telemetry signal. In the center is the coherent downlink carrier. On either side are the modulated subcarriers. The modulated subcarriers are equally spaced from the carrier. This is done to balance the signal, which aids in its efficient transmission from the space probe. If we zoom in more, we can get a better estimate of their exact frequencies:
Now we can see the carrier (middle picture) is at about 8420.21645 MHz, the lower subcarrier (left picture) is centered at about 8420.1939 MHz, and the upper subcarrier (right picture) is centered at about 8420.23895 MHz.
If I’m being honest, I don’t quite understand the phrasing of the second part of this question. It asks about the offset (in kHz) between the modulated subcarrier and the telemetry data. So maybe it’s asking for half the separation between the peaks in each of the subcarriers. I usually call that the frequency deviation, especially in this case where the data looks like an FSK (Frequency Shift Keying) type of signal. But maybe they mean the offset between the carrier and subcarriers. These are both easy enough to determine by just looking at the plots though, so let’s do both of them.
First, let’s get the offset of the carrier and subcarriers is found by subtracting the carrier and subcarriers frequencies that we already found. For the lower subcarrier, we get 8420.21645 MHz - 8420.1939 MHz = 22.55 kHz. For the upper subcarrier, we get 8420.23895 MHz - 8420.21645 MHz = 22.55 kHz, which (as expected) is the same as what we got with the lower subcarrier.
Second, let’s calculate the frequency deviation of the telemetry signals. This is determined by looking at the frequencies of the peaks in the left and right image. In the left image (lower subcarrier), the lower frequency peak is at 8420.1937 MHz, and the upper frequency peak is at 8420.1941 MHz. Subtracting these gives 0.0004 MHz = 0.4 kHz. The subcarrier itself is typically in the exact center of a signal, so the offset of the peaks from the subcarrier would be 0.4 kHz / 2 = 0.2 kHz. We get the same offset when looking at the upper subcarrier.
Challenge Question 3
Determine the drift rate of the above signals with time.
The drift rate is the rate at which a signal changes frequency. If you note in Figure 4 above, I zoomed in both in time and frequency to each component, highlight the initial portion of the data collection. This was done to illustrate that if our observation is too short in duration, we cannot observe drift rates well. So let’s look at the components over the full duration of the collection.
Now we can clearly see how the signal slightly drifts in frequency across time. This is most apparent in the carrier since it is so narrowband and strong, but is still observable in the subcarriers. The subcarriers just appear less slanted because we have to zoom out in frequency more to view them. Since the slant is most clearly observed in the carrier, we’ll use it to answer this question.
The drift rate is the slope of the carrier line. The carrier starts at time 0 seconds at a frequency of about 8420.21645 MHz, and ends at a time of about 290 seconds at a frequency of about 8420.21625 MHz. To calculate the slope, we divide the difference of frequencies by difference of times: (8420.21625–8420.21645 MHz)/(290–0 seconds) = -0.69 Hz/second.
Challenge Question 4
Calculate the Doppler velocity of Voyager with respect to Earth at the time of the observations.
Doppler is given by
fo = c/(c + v) * fe
where fo
is the observed frequency, fe
is the emitted frequency, c
is the speed of light, and v
is the Doppler velocity. We know the observed frequency from the spectrum, and we know the speed of light, but we don't yet know the emitted frequency. After a bit of googling, I found this paper on Voyager Telecommunications. Searching for "downlink carrier" brings us to Table 2-1. That table says that the coherent downlink frequency for Voyager 1 is 8420.432097 MHz. That's pretty close to our observed frequencies, so it should be the right one. Solving for v
above, we get
c + v = (c * fe)/fo
v = (c * fe)/fo - c
Substituting in our values, we get a Doppler velocity of +7677 meters/second. That is seriously fast!!
Bonus: Calculate the Doppler range
We know that Earth is orbiting the sun in a stable orbit at a known rate. Let’s assume that Voyager 1 is coasting through space, traveling at a constant velocity directly away** from the sun. We should then be able to calculate the expected range of Doppler shifts for Voyager 1 throughout the year. This can help us know what range of frequencies to look for at any point in the year.
With some more googling, we can learn that the Earth is traveling at 107,000 km/hr = 29722 m/s as it goes around the sun. And Voyager 1 is traveling at 17000 m/s. Since we’re using a two body Doppler model, the equation changes slightly:
fo = (c - vo)/(c + ve) * fe
where ve
is the emitter velocity (Voyager 1) and vo
is the observer velocity (Earth). Depending on the time of year, Earth is traveling either toward or away from Voyager 1, so its minimum velocity is -29722 m/s and its maximum is 29722 m/s. Substituting everything in, we get a maximum observed frequency of 8420.806402 MHz and a minimum observed frequency of 8419.102511 MHz.
** Note: I’m making a false assumption about Voyager 1’s current trajectory, where I assume it’s traveling on the planetary orbital plane. There’s some great content on this page showing the trajectories of Voyager 1 and 2. Voyager 1 is shown slingshotting from Saturn off the orbital plane, so my answers here aren’t perfect, but I believe are still valid bounds of the true answer. It’d be really cool to see someone give a more thorough answer.