False Positives, Be Gone! SkyCURTAINs Clears the Skies for Galactic Stream Hunters

tldt arrow

Too Long; Didn't Read

SkyCURTAINs uses weakly supervised ML and Gaia data to find stellar streams, improving purity and reducing false positives in galactic structure discovery.

People Mentioned

Mention Thumbnail

Coins Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - False Positives, Be Gone! SkyCURTAINs Clears the Skies for Galactic Stream Hunters
Magnetosphere: Maintaining Habitability on Earth HackerNoon profile picture
0-item

Authors:

(1) Debajyoti Sengupta, Département de physique nucléaire et corpusculaire, University of Geneva, Switzerland ([email protected]);

(2) Stephen Mulligan, Département de physique nucléaire et corpusculaire, University of Geneva, Switzerland;

(3) David Shih, NHETC, Dept. of Physics and Astronomy, Rutgers, Piscataway, NJ 08854, USA;

(4) John Andrew Raine,, Département de physique nucléaire et corpusculaire, University of Geneva, Switzerland;

(5) Tobias Golling, Département de physique nucléaire et corpusculaire, University of Geneva, Switzerland.

Abstract and 1. Introduction

2. Dataset

3. SkyCURTAINs Method and 3.1 CurtainsF4F

3.2. Line detection

4. Results

4.1. Metrics

4.2. Full GD-1 stream scan

5. Conclusion, Acknowledgments, Data Availability, and References


APPENDIX A: CurtainsF4F TRAINING AND HYPERPARAMETER TUNING DETAILS

A1. CurtainsF4F features preprocessing

A2. Hyperparameter tuning

ABSTRACT

We present SkyCURTAINs, a data driven and model agnostic method to search for stellar streams in the Milky Way galaxy using data from the Gaia telescope. SkyCURTAINs is a weakly supervised machine learning algorithm that builds a background enriched template in the signal region by leveraging the correlation of the source’s characterising features with their proper motion in the sky. This allows for a more representative template of the background in the signal region, and reduces the false positives in the search for stellar streams. The minimal model assumptions in the SkyCURTAINs method allow for a flexible and efficient search for various kinds of anomalies such as streams, globular clusters, or dwarf galaxies directly from the data. We test the performance of SkyCURTAINs on the GD-1 stream and show that it is able to recover the stream with a purity of 75.4% which is an improvement of over 10% over existing machine learning based methods while retaining a signal efficiency of 37.9%.

1 INTRODUCTION

When smaller gravitationally bound systems such as globular clusters or satellite dwarf galaxies, are disrupted by their host galaxy, the stars in these systems are tidally stripped off. This results in a stream of stars, named stellar streams, which, over time trace out the orbit of the progenitor system. Since the interactions between these largescale gravitationally bound systems occur over a very long timescale, real time observations of these events are impossible. Stellar streams are therefore an excellent alternative probe into the merger history of these systems (Johnston 1998; Helmi & White 1999; Carlberg 2017; Vera-Casanova et al. 2022; Belokurov et al. 2006). Moreover, the orbits of these streams are sensitive to the gravitational potential of the host galaxy, and thus can be used to constrain the mass distribution in it (Johnston et al. 1999; Ibata et al. 2001; Koposov et al. 2010; Sanders & Binney 2013; Banik & Bovy 2019). Over time, due to gravitational interaction with the surrounding matter, the shape of these streams change, and the density perturbations therein, such as gaps and spurs, can also provide insights into the dark matter distribution in the galaxy (Carlberg et al. 2012; Varghese et al. 2011; Sanders et al. 2016; Bonaca et al. 2019, 2020) and its properties (Purcell et al. 2012; Necib et al. 2019). The study of stellar streams is thus crucial to understanding the formation and evolution of galaxies, and the content thereof.


The Gaia mission (Gaia Collaboration et al. 2018) has provided an unprecedented dataset of stars in the Milky Way, with accurate astrometric and photometric measurements. This wealth of data has allowed the development of several techniques to detect stellar streams (Malhan & Ibata 2018; Malhan et al. 2018; Yuan et al. 2018; Meingast, Stefan & Alves, João 2019; Borsato et al. 2019; Meingast, Stefan et al. 2019; Ibata et al. 2021). In general, these methods leverage the astrophysics of stellar streams, such as their grouping in chemical composition and kinematics, to identify the stream candidates. For instance, the Streamfinder algorithm (Malhan & Ibata 2018; Malhan et al. 2018) assumes a specific model for the gravitational potential of the Milky Way galaxy, and searches for stars occupying the same hyperdimensional tubes through a six-dimensional positional and velocity space


More recently, several machine learning techniques have been employed to detect stellar streams. Particularly, Via Machinae (Shih et al. 2021, 2023), and CWoLa (Pettee et al. 2023) are fully datadriven and have very minimal model assumptions about the streams. These techniques were originally introduced in the context of High Energy Physics to find localised overdensities in the feature space. In the case of kinematically cold stellar streams, the member stars are expected to produce localised overdensities in the proper motion feature. One can define a signal region (SR) based on the proper motion, where there is an increased population of a stellar stream stars, and side bands (SB1, SB2) on either side of the SR, where the stream members are not expected to be present (or at a far lower rate, compared to the SR).



It is possible to circumvent this bias if a suitable template of the background is constructed to be used in the CWoLa method. We propose SkyCURTAINs, that constructs a background-enriched template of the stars in the SR in a data driven manner. SkyCURTAINs is based onCurtainsF4F, a method originally developed for anomaly detection in High Energy Physics introduced in (Raine et al. 2023; Sengupta et al. 2023). CurtainsF4F is a data-driven weakly supervised strategy that extends the CWoLa method to mitigate the problem of correlation of discriminatory features with the proper motion feature. We leverage the correlation of the features with the proper motion to generate a template in the signal region using the sidebands. This alleviates the need to sample data from the SB for CWoLa, and results in a template that is more representative of the background in the SR. One can then use the CWoLa method to tag the stars in the SR by training a classifier on the template of the SR data, followed by a line finding algorithm to identify the stream.


Constructing a background enriched template significantly reduces false positives, which is a big advantage of the SkyCURTAINs method over the standalone CWoLa method. As we will see in section 3 SkyCURTAINs has a modular design, and its data efficiency in training allows for an efficient scaling of the method to larger number of patches.


2 DATASET

We demonstrate the SkyCURTAINs method on the Gaia Data Release 2 (GDR2) (Gaia Collaboration et al. 2018) dataset. GDR2 contains detailed astrometric and photometric information for over 1.3 billion sources in the Milky Way galaxy. The dataset characterises the source by the right ascension (𝛼) and declination (𝛿), the parallax (𝜛), proper motions in right ascension (𝜇𝛼) and declination (𝜇𝛿), the apparent magnitude (𝐺), and the colour information in the form of the GBP and GRP bands (𝐺BP − 𝐺RP). The newer Gaia Data Release 3 (GDR3) comes with improved measurements on radial velocities, but as the SkyCURTAINs method does not utilise this information, we use the GDR2 dataset. This allows for a direct comparison with the Via Machinae and CWoLa methods, which were developed using the GDR2 dataset.





The kinematic cuts are applied to reject distant stars that produce an overdensity in the proper motion at ∼ 0 mas/yr and reduce the sensitivity of the model to overdensities produced by stellar streams. The cut on magnitude removes stars that are too dim and ensures we have a uniform coverage of stars from the Gaia dataset. The cut on colour isolates older, low-metallicity stars, which are more likely to be stellar stream members.


This paper is available on arxiv under CC BY 4.0 DEED license.


Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks