Bug Fixes: Unlocking Health Secrets from Insect Genomes on Thelio Astra

Header image: "Interview: Unlocking Health Secrets from Insect Genomes on Thelio Astra"

On the System76 Transmission Log podcast, hosts Emma and Alex sat down with RJ Nowling, Associate Professor of Computer Science at the Milwaukee School of Engineering, to discuss his work in bioinformatics on System76’s Thelio Astra Ampere workstation.

Introductions

My research is focused on genomic data science. I'm also the Program Director for our Master's in Machine Learning. I work with biologists to process the raw sequencing data into a usable form, and then analyze it using data science techniques. We collect data during an experiment with a particular research question in mind, and often I'm helping determine what evidence the data might provide to help answer that question.

What’s the goal of your research?

To better understand the biological systems. We hope that by studying these insects, we'll also learn things that are transferable to other organisms. For example, one of my collaborators is trying to better understand why some of the Anopheles mosquitoes can fight off the malaria parasite, when others can't. Even if they're from the same parents, there's differences.

A lot of my collaborators work on mosquitoes such as Anopheles gambiae—the vector of malaria—as well as Aedes aegypti, which is a mosquito vector in Dengue fever. So these are medically important organisms to public health.

On my end, the work I'm doing isn't necessarily specific to the mosquito; I primarily work with insect genomes. The data I have is specific to the mosquito, but the techniques I'm using are general.

So in the case of malaria, that could lead to breeding mosquitoes to eventually eliminate the vulnerability to malaria if there's a gene associated?

Yeah, you're right. There's been some experiments, what they call gene drive, to try to

produce sterile males and such. Other groups have been developing techniques to release

some genetically modified mosquitoes into a population and take it over. Once they do take over, most of that population will no longer have those traits.

In this case, we can find what blocks the mosquito’s immune system from destroying the malaria parasite. We can then use gene drive techniques to ensure that the mosquitoes can fight off the parasite, so that they don't transmit it to humans.

How did you first get into this field?

When I was in high school, I interned at the University of Connecticut Health Center, making a simple website with some data from a paper, and I really fell in love with it. I loved that I could apply my computing skills to understanding something in the real world. Data processing is interesting to me. I made it a focus in college, and then I went and got a PhD.

During college, while many of my colleagues were doing internships with companies, I was doing research with professors. I did spend a couple of years in industry after my PhD, so I was at Red Hat for two years. And then an online advertising company called Adroll for two years. I started my faculty position at the Milwaukee School of Engineering in fall 2018.

When along that journey did you discover Linux? Was it at Red Hat or was it before then?

It involved Red Hat, but it was not while I was at Red Hat. In middle school, someone gave me a box copy of Red Hat Linux 6.2.

I want to clarify, I did not omit the word enterprise. This was before it became Red Hat Enterprise Linux. This was old Red Hat. If I remember correctly, the version I had ran GNOME desktop version one. Then I started using Slackware in high school, and I later switched to Debian. So now I use Debian for basically all of my workstations and servers.

With regards to your workflow, what software tools are you using, mainly?

When we get the raw data, we clean it with something like Trimmomatic, and then we align it to genomes using something like BWA. We'll use a software package called GATK to try to identify where individual samples differ from a reference genome. We might also use a tool called MACS for what's called peak calling.

There's certain experiments that basically highlight certain parts of the genome that have activity of interest. Once we’ve produced the output, then we can switch to more traditional data science tools like using pandas and Jupyter notebook, scikit-learn and things like that.

How has working with Thelio Astra affected your workflow and your project?

It's massively increased my throughput. The data sets I work with have a lot of samples; the biologist will seek the DNA from a thousand mosquitoes. The ability to have 120 cores instead of 24 cores means I can process a lot more samples at once. It also tends to be very memory-intensive, so the fact that I can have 512 GB of ECC RAM is an important component of scaling up that pipeline. And these pipelines often take weeks to run, so it can cut it down from a month to a week or a few days.

What led to your decision to go with System76 and the Thelio Astra?

I've been wanting an Ampere Altra system for a while. My current workstation had a processor with 24 cores and 128 GB RAM that I got around five or six years ago. That's been my main research machine. It was time for an upgrade.

Whenever I would look at other systems with similar processors, I just couldn't get more cores or much more memory within my budget. The Ampere Altra platform offers a really nice price point for getting a lot of cores and a lot of RAM at the same time—and I need both to do my work. System76 was the only company I could find that was really supporting a workstation option for Ampere.

I'm glad I went with System76 because support has been super helpful. I haven't run into any major problems. It's just my ignorance. I didn't realize that because it's running a server motherboard, it takes a really long time to post before the boot screen shows. I thought something was wrong with it when I got it, when it's like, why isn't this showing anything after five minutes? And it turns out I just needed to wait longer.

But the support was super helpful, and I really appreciate my interactions with the company. I think that has potentially turned me into a repeat buyer in the future.

Do you have advice for people who want to get into genomic data science?

Most of the people I know doing this are at universities. There are some companies in industry like Illumina, which provides some of the sequencing technology, also hires a number of bioinformaticians and data scientists and software engineers—but most of the people in this field are in universities. So you pretty much need a PhD.

That said, I have worked with people with PhDs in a wide range of fields, from computer science to biology to statistics, physics, chemistry, applied math, whatever, that we all work on these projects together. And so one thing I had to figure out is that you have to pick a home field.

In my particular case, I'm very comfortable teaching computer science and data science courses. Getting a PhD in computer science was the right fit for me, even if my research itself is very interdisciplinary. The trick is to basically find a faculty member who's working in bioinformatics who can advise students in a PhD program that you would be successful in.

And for college students, you need to start talking to professors and seeing if you can get opportunities to do research with them in undergrad. That'll tell you if you like the work, and that's how you build up your resume to increase your chances of getting accepted into a PhD program.

How do you foresee the combination of machine learning and genomic data science creating further innovations?

That’s some of the work that I find most interesting recently. One thing to know is our genome doesn't just have genes, it has other non-coding elements that play an important role. One example of these is called enhancers. Enhancers help determine how many copies of the gene are produced, and we're actually just learning that a lot of diseases are caused not by changes in the gene themselves, but in the enhancers. So your body just may not produce as many copies of the gene.

However, the enhancers are hard to find computationally. There's been some really interesting work lately using deep learning to identify enhancers in an accurate mannerism. Since it's finally pretty accurate, we could actually go through all the publicly available genomes of the NCBI database provided by the NIH and find the enhancers for all of them. And then that would allow us to do cross-species comparisons.

Another thing is that enhancers are generally not located near the genes they interact with. They're pretty far apart. So there's been some work to try to computationally determine which enhancers act on which genes, which will help us better understand what genes they impact and thus what potential diseases. The other thing is trying to understand how genetic variation within the enhancers has an impact. Sometimes the DNA changes and it does nothing. Sometimes it increases the number of copies of a DNA, sometimes it decreases it, but it's hard to predict.

If we can build accurate machine learning models to do this, we can vastly speed up research and reduce costs, and study organisms that we can't study right now because it may be infeasible to do the actual experimental lab work for whatever reason.

What are you most excited about for the future your field?

I get really excited when I can understand the mechanisms of the biology. In some ways, I'm a biologist who uses a computer as my instrument. I'm excited for advancements that will allow us to better understand how the biology works. That's my main motivation.

Is there anything else you'd like to share with our readers?

Just that I've really enjoyed my experience with the Thelio Astra so far. It's been a great machine worth every dollar, and it really does offer a lot for the price. Highly recommend it!

How can people get in touch with you if they want to learn more or connect with you?

They can check out my website, or they can send me an email at nowling@msoe.Edu.


Like what you see?

Share on Social Media