George Stein

Machine learning in cosmology

The rapid pace of machine learning is difficult to keep on-top of, yet recognizing trends and analysing state-of-the-art methods is essential to achieve the best results for your problem.

Since 2018 I have curated a (now popular) comprehensive public archive of ML applications to the study of cosmology in order to help scientists and domain experts identify AI solutions applicable to novel problems and rapidly implement them. Take a look to see whats been going on in the field or to design your own project!

github.com/georgestein/ml-in-cosmology

Segmentation of Satellite Imagery

Multispectral satellite imagery is crucial for geospatial applications such as monitoring wildfires, mapping deforestation, and tracking erupting volcanos. But, the presence of clouds introduces noise and inaccuracy, and thus they need to be identified and removed.

Competing as part of a two person team in a popular data science competition, I trained an ensemble of segmentation models to identify cloud cover in satellite imagery. Our unique approach leveraged public APIs to increase the public dataset size 10-fold and allowed us to design a custom set of physically motivated augmentations that nearly eliminated overfitting. We ultimately finished in the top 3% of 850 participants.

Find a writeup of our approach here!

A probabilistic autoencoder for type Ia supernovae

Type Ia supernovae are rare yet extremeley useful objects for mapping the expansion history of our universe. Yet their complexity cannot be modelled from first principles, which requires us to understand them in a data-driven manner.

To do so I developed a Probabilistic Autoencoder to study the diversity of spectra we observe from them, to identify objects that do not fit average trends, and to minimize the error that propagates into cosmological analyses.

Stein et al., 2022

The WebSky Suite of Extragalactic CMB simulations

Realistic simulations of the universe are necessary to develop and test scientic analyses, but are a key technical challenge due to the large dynamic range and accuracy required.

I worked to modernize, parallellize, and optimize performance of a the mass-Peak Patch simulational pipeline to run on high-performance computing clusters. After completing testing and development of the package, I constructed and released some of the most advanced simualtions to date - the Websky suite

Websky paper: Stein et al., 2020 Methods paper: Stein et al., 2019

A volumetric deep Convolutional Neural Network for simulation of mock dark matter halo catalogues

Simulations of the universe from the initial conditions to the present day are computationally expensive. To speed things up, we trained a U-net on 3D volumes of the universe to predict the final structure given the initial conditions. This allowed us to skip the whole simulation and jump right to the final answer.

Berger & Stein, 2019

COVID-19 in the USA: A state-by-state investigation

The early stages of the pandemic contained extreme uncertainty, so beginning March 2020 I compiled data from the CDC, NYT, and a variety of other outlets to visualize and study mortality trends. By comparing 2020 mortality from all causes to the official coronavirus death toll, I uncovered in a few states clear evidence for an excess of deaths above those officially attributed to the virus.

My analysis at medium

George Stein

George Stein, PhD

About me

Millions of images, most without labelled information... So how do we learn?

Finding a needle in a haystack

Additional Projects