About me

I am a machine learning scientist with 5+ years experience adapting and advancing the latest ML techniques to solve applied problems at scale. With a further 4 years in high-performance computing, data analysis, and visualization, I enjoy tackling complex challenges and extracting insights from large datasets.

Below you will find a few of my recent research projects and published works.

Contact details and resume follow - please reach out to chat!

Millions of images, most without labelled information... So how do we learn?

Sky surveys image hundreds of millions of galaxies, but most will ever be examined by an expert or assigned a label. This provides a number of hurdles for off-the-shelf supervised machine learning methods to classify galaxy types and discover rare objects.

First, you need the data! So I compiled nearly 100 million galaxy images from the Dark Energy Spectroscopic Instrument (DESI) Legacy Imaging Surveys in a machine-learning-ready format.

Then, I utilized cutting-edge techniques from self-supervised learning - an emerging paradigm in computer vision capabable of learning without labelled information - to train convolutional neural networks capable of performing generalized tasks.

I used these networks to develop tools for exploration of this massive dataset, discover 1600 extremely rare strong gravitational lenses, and construct models to automatically classifiy millions of galaxies. Please check out a few of my works below!

Try out my interactive data discovery app here See it in the press Access the code and data

Or take a look at some published works!

Mining for strong gravitational lenses with self-supervised learning. Stein et al. 2022 Self-Supervised Representation Learning for Astronomical Images. Hayat & Stein et al., 2021

Finding a needle in a haystack

Searching for rare events is hard enough when you know what they look like, but what about when you have no idea?

The LHC Olympics (LHCO) data challenge was created to develop methods for this purpose. Participants were provided with a set of "black-box" datasets and charged to find any anomalous events - without having any idea what these events might look like!

While anomaly detection is a key application of machine learning, it is generally more focused on the detection of outlying samples in the low probability density regions of data. I instead developed a method for unsupervised in-distribution anomaly detection using a conditional density estimator, designed to find unique, yet completely unknown, sets of samples residing in high probability density regions.

Outperforming 12 other teams, we managed to detect a new particle appearing in only 0.08% of 1 million collision events, and took home the gold medal

See us in the press Read my writeup Or read about the competition in detail

Additional Projects

Machine learning in cosmology

The rapid pace of machine learning is difficult to keep on-top of, yet recognizing trends and analysing state-of-the-art methods is essential to achieve the best results for your problem.

Since 2018 I have curated a (now popular) comprehensive public archive of ML applications to the study of cosmology in order to help scientists and domain experts identify AI solutions applicable to novel problems and rapidly implement them. Take a look to see whats been going on in the field or to design your own project!

github.com/georgestein/ml-in-cosmology

Segmentation of Satellite Imagery

Multispectral satellite imagery is crucial for geospatial applications such as monitoring wildfires, mapping deforestation, and tracking erupting volcanos. But, the presence of clouds introduces noise and inaccuracy, and thus they need to be identified and removed.

Competing as part of a two person team in a popular data science competition, I trained an ensemble of segmentation models to identify cloud cover in satellite imagery. Our unique approach leveraged public APIs to increase the public dataset size 10-fold and allowed us to design a custom set of physically motivated augmentations that nearly eliminated overfitting. We ultimately finished in the top 3% of 850 participants.

Find a writeup of our approach here!

A probabilistic autoencoder for type Ia supernovae

Type Ia supernovae are rare yet extremeley useful objects for mapping the expansion history of our universe. Yet their complexity cannot be modelled from first principles, which requires us to understand them in a data-driven manner.

To do so I developed a Probabilistic Autoencoder to study the diversity of spectra we observe from them, to identify objects that do not fit average trends, and to minimize the error that propagates into cosmological analyses.

Stein et al., 2022

The WebSky Suite of Extragalactic CMB simulations

Realistic simulations of the universe are necessary to develop and test scientic analyses, but are a key technical challenge due to the large dynamic range and accuracy required.

I worked to modernize, parallellize, and optimize performance of a the mass-Peak Patch simulational pipeline to run on high-performance computing clusters. After completing testing and development of the package, I constructed and released some of the most advanced simualtions to date - the Websky suite

Websky paper: Stein et al., 2020 Methods paper: Stein et al., 2019

A volumetric deep Convolutional Neural Network for simulation of mock dark matter halo catalogues

Simulations of the universe from the initial conditions to the present day are computationally expensive. To speed things up, we trained a U-net on 3D volumes of the universe to predict the final structure given the initial conditions. This allowed us to skip the whole simulation and jump right to the final answer.

Berger & Stein, 2019

COVID-19 in the USA: A state-by-state investigation

The early stages of the pandemic contained extreme uncertainty, so beginning March 2020 I compiled data from the CDC, NYT, and a variety of other outlets to visualize and study mortality trends. By comparing 2020 mortality from all causes to the official coronavirus death toll, I uncovered in a few states clear evidence for an excess of deaths above those officially attributed to the virus.

My analysis at medium