In this age of large astronomical surveys, one major scientific bottleneck is the analysis of enormous data sets. Traditionally, this task requires human input — but could computers eventually take over? A pair of scientists explore this question by testing whether computers can classify galaxies as well as humans.
Limits of Citizen Science
Galaxy Zoo is an internet-based citizen science project that uses non-astronomer volunteers to classify galaxy images. This is an innovative way to provide more manpower, but it’s still only practical for limited catalog sizes. How do we handle the data from upcoming surveys like the Large Synoptic Survey Telescope (LSST), which will produce billions of galaxy images when it comes online?
In a recent study by Evan Kuminski and Lior Shamir, two computer scientists at Lawrence Technological University in Michigan, a machine learning algorithm known as Wndchrm was used to classify a dataset of Sloan Digital Sky Survey (SDSS) galaxies into ellipticals and spirals. The authors’ goal is to determine whether their algorithm can classify galaxies as accurately as the human volunteers for Galaxy Zoo.
Automatic Classification
After training their classifier on a small set of spiral and elliptical galaxies, Kuminski and Shamir set it loose on a catalog of ~3 million SDSS galaxies. The classifier first computes a set of 2,885 numerical descriptors (like textures, edges, and shapes) for each galaxy image, and then uses these descriptors to categorize the galaxy as spiral or elliptical.
In addition, the classifier calculates a certainty level for each classification, with the certainties adding to 100%: a galaxy categorized as spiral at 85% certainty is categorized as elliptical at 15% certainty. This provides a quantity/quality tradeoff, allowing for the creation of subcatalogs by cutting at specific certainty levels. Selecting for a high level of certainty decreases the sample size, but increases the sample’s classification accuracy.Comparing the Outcome
To evaluate the accuracy of the algorithm’s findings, the authors examined SDSS galaxies that had also been classified by Galaxy Zoo. In particular, they used a 45,000-galaxy subset that consists only of “superclean” Galaxy Zoo galaxies — meaning the human volunteers who categorized them were in agreement at a level of 95% or higher.
In this set, Kuminski and Shamir found that if they draw a cut-off at the 54% certainty level for spiral galaxies and the 80% certainty level for ellipticals, they find 98% agreement between the computer classification of the galaxies and the human classification via Galaxy Zoo. Applying these cuts to the entire sample resulted in the identification of ~900,000 spiral galaxies and ~600,000 ellipticals, representing the largest catalog of its kind.The authors acknowledge that completeness is a problem; half the data had to be cut to achieve this level of accuracy. Sacrificing some data can still result in very large catalogs, however — and as surveys become more powerful and large databases become more prevalent, algorithms such as this one will likely become critical to the scientific process.
Citation
Evan Kuminski and Lior Shamir 2016 ApJS 223 20. doi:10.3847/0067-0049/223/2/20