Behind the Scenes: Producing the SpaceNet 5 Roads Dataset

Maxar is excited to continue our support of SpaceNet, a non-profit organization we helped launch in August 2016 that’s dedicated to accelerating open source, artificial intelligence (AI), and machine learning (ML) applied research for geospatial applications. SpaceNet is a collaborative initiative between Maxar Technologies, CosmiQ Works, Intel^® AI, Amazon Web Services, and Capella Space. We are especially excited about the release of the SpaceNet 5 dataset announced August 22, which revisits the challenge of automated road network detection and routing, while adding the complexity of estimating travel time based on distance and road type. Routing is an important use case since it is essential for many humanitarian, civil, military and commercial applications.

Keeping foundational maps up to date, including roads, continues to remain a labor- and cost-intensive aspect of generating map data. Over the last three years, SpaceNet has shown how AI/ML computer vision algorithms applied to overhead imagery can automate tasks such as building footprint and road network extraction. That said, limitations to using algorithms for fully automated map production remain, such as improving the quality and generalizability of outputs for any area in the world.

Given Maxar’s role as part of the team to produce the SpaceNet 5 dataset, we thought this would be a good opportunity to share a glimpse “behind the scenes.” Generating labeled training datasets is one of the toughest aspects of SpaceNet. It’s worth explaining that the SpaceNet team creates hand-labeled datasets since they are intended to serve as AI/ML training datasets. Though semi-automated means of dataset creation are being researched, thus far, we’ve found that hand-labeled datasets with dedicated, expert labelers is the most effective approach.

Team members from Maxar’s AI/ML training datasets labeling team reviewing the SpaceNet 5 roads dataset.

As a prerequisite to labeling, the SpaceNet production team develops a detailed requirement document and production guide that outlines general requirements, topology rules and the attributes fields to label. This document provides examples of how and how not to label the main features, and addresses any unusual edge cases a labeler might encounter. After years of creating training datasets for AI/ML, we generally find that it’s a best practice to limit the number of features (or attributes) being labeled to 3-5 to avoid overload. The labeling team for this dataset consisted of 15 team members split between digitizers and expert analysts. Depending on the size and complexity, which typically relates to geographic area and number of features, it generally takes 3-6 months to create a training dataset.

To begin production, the SpaceNet team consider the areas that currently have training data cover and evaluate what new areas of interest (AOIs) to label to expand the geographic diversity of SpaceNet. The SpaceNet partners aim to label new geographic areas to encourage the development of algorithms that can be more geographically generalizable. Once the candidate AOIs are set, suitable satellite imagery is gathered from Maxar's 100+ PB library and provisioned to dedicated labelers via Maxar’s collaborative mapping software and task manager tool that Maxar developed based on iD editor. This tool allows us to divide mapping workload across the labeling team and monitor progress along the way.

Over 1,951 km of roads were labeled for Mumbai including attributes for road type, surface, speed limits, and number of lanes. Speed limit is visualized in the example image above (red = 65 mph for motorways to green = 25 mph for residential).

The SpaceNet 5 datasetcontains 2,879 square kilometers of Maxar 30 cm satellite imagery and 8,160 kilometers (~5,070 miles) of labeled roads. To give you a sense of how large the dataset is: With an average speed limit of 42 miles per hour across the AOIs, it would take about 5 days to drive the entire dataset. As announced in Adam Van Etten’s recent blog post, this data covers four new locations:

Moscow, Russia
Mumbai, India
San Juan, Puerto Rico
Mystery City

Yes – you read that correctly, SpaceNet 5 has a “Mystery City” whose name is being held back until after the challenge concludes to encourage participants to create more generalizable algorithms. Building from SpaceNet 3, the SpaceNet 5 challenge will use a modified version of the Average Path Length Similarity metric tuned to optimize travel times between nodes of interest to assess algorithm performance.

Quality assurance and quality control (QA/QC) is a key aspect of creating any dataset, but especially AI/ML training datasets since they are used to train algorithms to identify similar features - they are only as good as the input examples. Our goal is to make the datasets as complete, consistent and accurate as possible. The labeling team first runs semi-automated checks for consistent attribute labels and checks for correct geometries based on predetermined topology rules. For example, roads lines that cross must be joined at an intersection if they are not a bridge or overpass. The dataset is also visually inspected along the way and then is fully inspected again by a team of data scientists. As a final step, the data is tested with SpaceNet baseline algorithm implementations to ensure it can be used properly as training data.

Once a dataset is finalized, the SpaceNet production team chips the imagery and uploads the vectors (in this case road centerline vectors) to an Amazon S3 bucket, which is hosted by our partners at AWS as part of their Public Dataset program. The SpaceNet team provides links to each dataset via the main SpaceNet website and challenge page.

Hopefully this gives you an interesting glimpse behind the scenes and some insight about what goes into SpaceNet training dataset production. The SpaceNet team looks forward to hearing your feedback on the SpaceNet 5 roads dataset. A special thanks to WovenWare for their continued collaboration with Maxar in creating training datasets. Thanks also to Adam Van Etten from CosmiQ Works and Kevin McGee from Maxar for their input on this blog post and all of their outstanding contributions to SpaceNet.

Interested in participating in the SpaceNet 5 challenge or downloading the dataset for your own research? Visit SpaceNet.ai for more information.

Prev Post Back to Blog Next Post

Behind the Scenes: Producing the SpaceNet 5 Roads Dataset

Email Subscription

Related posts

Maxar Intelligence Supporting EY’s Open Science Data Challenge

Insights as a Service: The Key to Unlocking the Real Power of Data

Open Data Response to Flooding in Libya