A Dataset and Tools for Detection and Segmentation of Clouds in Optical Satellite Imagery
2021-10-01, 09:30–10:00, Aconcagua

Some 70% of Earth's surface is covered by clouds at any given time, according to NASA. The existence of clouds in optical satellite imagery limits its usefulness, blocking features of interest on the ground. We have developed a machine learning-based approach for detection and segmentation of clouds, and for production of cloud-free mosaics. Our efforts include development of an open dataset consisting of Sentinel-2 imagery and labeled clouds, creation of a lightweight ML architecture for cloud segmentation, and creation of open source tools for inexpensive production of cloud-free mosaics at continent scale. Our dataset and methods are applicable to many types of optical satellite imagery, not just Sentinel-2.

Since about 70% of the Earth’s surface is covered by clouds at any given time (according to NASA) existence of clouds in optical satellite imagery is a common problem.

The focus of our cloud-detection and cloud-removal efforts has been threefold. First, we have created an open dataset consisting of Sentinel-2 images and labeled clouds that is suitable for training a machine learning-based cloud segmentation model. Second, we have developed a new model architecture that can be used inexpensively at scale (a lightweight architecture that can run efficiently on CPUs rather than requiring GPUs). Third, we have developed tools to apply our ML models at scale to produce continent-scale cloudless mosaics.

Our lightweight architecture achieved an f1 score of greater than 0.82 on our multi-continent, multi-biome validation set, but could easily be biased in favor of either greater precision or greater recall if needed. We compared the results of our lightweight architecture to those produced by a traditional, deep architecture (a Feature Pyramid Network with a ResNet-18 backbone) and found that the latter achieved an f1 score of around 0.91 on the same validation set. Although the deep architecture is objectively better, the subjective results are quite close. If resources permit, one should consider using an ensemble containing both types of architectures (because each has its respective strengths and weaknesses). Please see this blog post for a preview of some of our results.

Our open cloud dataset consists of more than 32 unique scenes (32 unique Sentinel-2 tiles with L1-C and L2-A versions of each) from 25 unique locations, it spans all inhabited continents, it contains many different biome types, and it has scenes from all four seasons.

As previously mentioned, our dataset consists of Sentinel-2 imagery. Because Sentinel-2 imagery contains so many bands, we believe that this dataset provides a viable starting point for training models to detect clouds in other types of imagery, not just Sentinel-2. (For example, one can restrict to training on the bands that are most similar to those in the target imagery to produce a model which can then be fine-tuned to the target imagery.)

(At time of writing, our dataset has not yet been released, but we commit to doing that as soon as possible and well before October.)

Authors and Affiliations

McClain, James (1)

(1) Azavea


Open data


Data collection, data sharing, data science, open data, big data, data exploitation platforms


2 - Basic. General basic knowledge is required.

Language of the Presentation


James McClain is involved in R&D at Azavea. Most of his work concerns machine learning and adjacent topics, but he also does work related to algorithms more generally.