The 5 Best Data Annotation Platforms & Tools for Machine Learning (2021)

Over 30 options considered for this Mega List. Map of Landscape and Buying Guide Included. For Training Data, Data Annotation, and Deep Learning.

Introduction — The Best Data Annotation Tools

We all know Artificial Intelligence (AI) is taking the world by storm. While the technology has significant limitations — it’s maturing rapidly and it’s absolutely critical to be up on the latest trends of it.

Here I walk through the top Artificial Intelligence Data Annotation platforms. If you are an executive, technical person new to the area, or just curious, this article is for you!

Let’s dive in! :)

Map of Training Data Landscape 2021

This map covers the top open source and closed source software. Additional firms are listed on text.

Overview

There are 3 main categories to be aware of. Open Source Software, Closed Source Software, and Outsourcing.

Open source, like Diffgram, means the source code of the software is available for inspection. It also generally means higher quality software because more people test it, review it, and contribute to it. Communities form around open source software — making long term support more reliable.

Some outsourcing firms are also starting to offer closed source software — but generally their core market is outsourcing. The focus of this article is on the software itself.

As the founder of Diffgram I am naturally biased. I encourage those curious to inspect the docs and source code, try the product to see for yourself. The Diffgram team is very open to feedback — create an issue here, or contact us. Connect with me personally or follow me on medium.

Best Overall: Diffgram

Best Overall Data Annotation Tools (2021):

  1. Diffgram
  2. Label Studio (Heartex)
  3. CVAT (Intel)
  4. SuperAnnotate
  5. Datasaur

Executive Summary

Diffgram is the overall best solution because:

  1. Open source.
  2. Most flexible deploy and most integrations. Diffgram runs anywhere, in the way you want.
  3. Most Scalable. Diffgram imposes the least limits and scales out of the box to large, multi-user setups, large datasets, has the most flexible integrations.

I’ll unpack why in more detail in this article.

Diffgram

Vital Stats: Founded in 2018 based in Santa Clara, CA, USA.

Diffgram has deep support for images and videos. Diffgram is a complete system — including team collaboration and large scale use cases.

Diffgram is open source — making it one the only “second generation” system to be open. This means a community of people are working together to build the best possible software.

Diffgram has the most integrations, with deep connections for AWS, Azure, GCP and more. The team blogs publicly including sharing learning lessons on architecture and End to End testing.

Diffgram has the most comprehensive vision — to support all media types, and all application needs.

Pros

  1. Flexible deploy and many integrations — Diffgram runs anywhere.
  2. Scale every aspect — from volume of data, to number of supervisors, to ML speed up approaches.
  3. Fully featured — ‘batteries included’.

Diffgram is 5 Star Rated on G2 Crowd

Selected Diffgram Reviews

Diffgram Features

Diffgram is Best for Images and Video Annotation

Diffgram supports all the spatial types including: Quadratic Curves, Cuboids, Segmentation, Box, Polygons, Lines, Keypoints, Classification Tags.

Diffgram has the most attribute support including: Radio buttons, Multiple select, Date pickers, Sliders, Conditional logic, Directional Vectors.

Diffgram is Best for Semantic Segmentation

With fully integrated auto bordering, easy “draw over”, and integrated auto annotation (for example for people).

Semantic Segmentation — AutoBordering Example

Diffgram is Easiest to integrate with customizable Webhooks

Plus Webhooks, Notifications, Reporting and more.

Overall Diffgram has one of the most balanced and complete feature sets. In general all features work as expected. There are few “golden paths” — in many parts of the system a user can accomplish their goal in many ways.

Diffgram is Best for Perpetual Data Improvement (Data Streaming)

Technical Example of Data Flor for Single Pipeline

Data Science teams often have the goal of perpetually improving training data. Diffgram is the only vendor to have a patent pending method to achieve perpetually model improvement. I have been innovating in this area since 2018.

It includes automatically streaming files as they are completed. This means you can continually improve your models and create training data pipelines.

Diffgram has the Best Integrated Issue tracking, Resolution and Quality Assurance:

Diffgram Runs Anywhere

Diffgram offer 2 minute docker compose install. And complete Production install on Kubernetes (Helm). See an example step by step AWS guide.

Diffgram has the Deepest Support for AWS, GCP, and Azure

You can choose any cloud vendor as your the storage provider for Diffgram itself. And import and export from any cloud vendor.

Essentially this means you can mix and match as desired, and it works with all of them right out of the box. For example — you can have your system of record on Azure, but export to Google for training. Or have all on one provider. (This is much more optimal then being forced to send the data to whichever cloud provider the vendor happens to run their system on).

Diffgram has the most Scalable Speed Approach with Userscripts Framework

With Diffgram Userscripts Framework you get the latest AI advances instantly.

Your own team can customize and improve it using built-in primitives — all in normal JavaScript. Train custom models for speed up (automatically), or use your own pre-trained models. All for free — it runs on the local machine.

Essentially — you get the best of the 10x-100x speed up promised by other vendors — for free. Learn more about why userscripts.

See all Examples
See all Examples

Diffgram’s vision is:

  1. Application: Support all popular media types for raw data; all popular schema, label, and attribute needs; and all annotation assist speed up approaches
  2. Support all popular training data management and organizational needs
  3. Integrate with all popular 3rd party applications and related offerings
  4. Support modification of source code
  5. Run on any hardware, any cloud, and anywhere

Diffgram Caveats — For Text Annotation Use Cases:

Diffgram text annotation is not yet released. In the meantime Diffgram has a strong integration with Datasaur that can solve many needs. The team (yes I’m biased) is very friendly and encourages people to create tickets.

Diffgram History

Diffgram went open source in April 2021. Since then the response has been tremendously positive — with absolutely stunning growth as shown by the stars and downloads

(Previously we had the SDK library here so the stars were lower, now we are growing fast!!)

Label Studio

Label studio is largely based around a configurable interface.

The breadth of Label Studio’s supported data types is impressive — however the depth is often a little disappointing. The high level demo looks good — but many types of deeper use cases fall short. Essentially there’s a little less polish — actions sometimes behave unexpectedly.

Their vision appears to be to support all types of data and many types of annotations.

It is not as clear what their data management vision is.

Or how that interacts with their open source product vs their paid product.

It’s not clear what integrations they support, or how to get it running on docker and Kubernetes.

Their speed up approach is unclear — it appears you need to be an expert in the specific area.

Intel CVAT (Computer Vision Annotation Tool)

Vital stats: Intel funded team. Open source.

CVAT is Intel’s attempt to build a moat around their OpenVino offering and sell more hardware. While I’m sure the maintainers are smart people and work hard — that’s a hard shadow to work from. Also this is nothing against Intel — it’s just an odd mix of concerns to have a hardware giant in this space.

CVAT is designed only for full time annotators. In general, subject matter experts who are new to annotation find it difficult, unintuitive and hard to use.

CVAT Caveats

Older UI: CVAT is, naturally, focused on computer vision exclusively. (Not text, NLP etc). In general, CVAT appears more concerned about meeting specific use case needs then shipping an all around new user friendly product.

Missing backend features: CVAT has the smallest backend. This means many data management features are missing. This means you will likely need another tool to do data management.

Intel favoritism for Speed Up Approaches? CVAT is backed by Intel and appears to overly favor the OpenVino (Intel) approaches. CVAT’s approach to mode speed up is fairly heavy and opinionated.

SuperAnnotate

Vital stats: Closed source. Based in Armenia and founded in 2018.

Focus area: Image segmentation

Super annotate is very focused on images (no video support). While video support is in the pipeline — at the moment it’s really an image annotation tool.

This hyper focus has benefits and drawbacks. It means your team will certainly need more than one tool to cover all needed use cases. The main benefit is that they have some very interesting speed up tools.

Closed source:

While they have part of their annotation studio Open, the secret sauce and majority of back end code is Closed.

My take on SuperAnnotate:

If you are serious about image segmentation and don’t need video they are worth a look! Keep in mind you will likely need to integrate with other offerings to get the degree of data management needed. I think Diffgram’s Userscripts is a more scalable approach to faster annotation and custom quality control.

Datasaur

Vital stats:

Closed source. Based in USA and founded in 2018

Diffgram Integration: ✅ Docs

Datasaur and Diffgram have a great integration.

Focus area: Text

Datasaur is exclusively a text annotation tool. Their ease of use and modern product take is great. Datasaur has a deep integration with Diffgram and can be used seamlessly for the text annotation part.

Example use cases covered:

  1. Transcribing and classifying medical symptoms and diagnoses from audio recordings of physician encounters
  2. Scanning scientific journals and academic papers for promising new medical treatments
  3. Classifying and labeling medical claims and billing codes

Caveats

Being a newer offering some features are still in development.

Closed source.

As with other tools mentioned, you will need to integrate it with others for full data management and for other data types.

Vision

I believe they are interested in expanding to other types but it’s not quite clear yet.

Labelbox

Vital stats: Closed source. Based in San Francisco. Founded 2017

Diffgram Integration: ✅ Docs

Labelbox supports images. They claim to have “…full natural language processing…” . And they have a video interface that has recently come out of beta.

Caveats

Is Downtime a Regular Occurrence With Labelbox? Labelbox appears to have outsourced core technical items, such as login (Auth0).

For example recently the entire service was down (inaccessible login). Something that’s not possible with Diffgram’s fully integrated login.

Labelbox has Low Ratings on Glassdoor

Labelbox is the lowest glassdoor rated company in the competitive set.¹ According to LinkedIn, the median tenure is under 1 year.²

The primary technical Labelbox founder said “Goodbye” halfway through 2020.

Where is Labelbox’s technical vision? Who is the technical leader at the company?

Labelbox is Closed Source

Labelbox initially claimed the product was open source, but then promptly removed most of the code — tying it back to their closed source offering.

Labelbox Summary

While Labelbox deserves kudos for being early movers in this space, they have been overtaken by stronger open source alternatives (Diffgram, LabelStudio), and new closed source options (SuperAnnotate, Datasuar) that are moving faster.

Labelbox Alternatives

All of these alternatives have similar feature s and enterprise support levels — and some have many more that labelbox doesn’t offer.

  1. Diffgram — Open source
  2. LabelStudio — Open source
  3. Datasaur (Text)

Honorable mentions

  1. V7 Labs — A medical focused offering — Closed Source
  2. BasicAI — Closed Source
  3. SuperbAI — Closed Source

I have also extensively researched and considered — including (for most) personally trying the software — the following alternatives: AWS Sagemaker GroundTruth, Alegion, ClarifaiAI, Hive, Playment, Scale AI, Appen, Lionbridge, Cloud Factory, SixGill, iMerit, Kili-Technology, MindySupport, IsaHit, LinkedAI, Edgecase.ai, HastyAI, Dataloop, Neurala, Keymakr, Prodigy, Tagtog, Google AI Data Platform, Vott, UDT, Yedda, Lost and a few others. *Keeping in mind some of these listed are outsourcing firms.

Summary

Buying guide below

I have covered Diffgram, LabelStudio, CVAT, SuperAnnotate, and Datasaur as the 5 best annotation tools for 2021. This review was focused on pure software.

While I respectfully realize there are many many talented people working at all of these companies — I truly believe Diffgram, LabelStudio, CVAT, SuperAnnotate, and Datasaur standout as having the best overall approach and vision. With both Diffgram and LabelStudio being the best overall choices becuase of their open source nature.

Annotation Data Tool Buying Guide (2021)

Are you exclusively doing images (not video)? This can include object detection, semantic segmentation and more.

Best Annotation tool for Images:

  1. Diffgram
  2. SuperAnnotate
  3. CVAT

Are you exclusively doing video?

Best Annotation tool for Videos:

  1. Diffgram
  2. CVAT
  3. Labelstudio

Are you using text data?

Best Annotation tool for Text (NLP):

  1. Datasaur + Diffgram
  2. Labelstudio
  3. Labelbox + Diffgram

Diffgarm is working on text! Upvote the discussion today for your needs

Are you working with a large team, or multiple teams?

Best for Team Collaboration:

  1. Diffgram — Enterprise Edition
  2. Heartex (Enterprise Labelstudio)

Are you looking for the best overall tool no matter the cost?

Best Overall Annotation Tools:

  1. Diffgram — Enterprise Edition
  2. Heartex (Enterprise Labelstudio)

Are you looking to avoid vendor lock-in?

Best Open Source Annotation Tools:

  1. Diffgram
  2. Labelstudio
  3. CVAT

Do you need version control?

Best version controlled Annotation Systems:

  1. Diffgram Versioning

Do you need speed up approaches?

Fastest Annotation Tools:

  1. Diffgram Userscripts Framework (Examples)
  2. CVAT
  3. Labelstudio

Do you need streaming data or data pipelines? Do you want to perpetually improve your AI Models?

Best Data Pipeline Annotation Systems:

Diffgram Streaming

Do you need a security sensitive use case? Defense?

Best Data Annotation for Defense and Security Critical

  1. Diffgram Secure — Run on docker or your Kubernetes cluster
  2. CVAT
  3. Labelstudio

Disclaimers

1:

Labelbox is the lowest glassdoor rated company in the competitive set, of those that have ratings, at the time of writing.

Folder with proof screenshots.

Of those companies that have ratings, the rating are as follows:

  1. Diffgram 5
  2. Alegion 4.4
  3. CVAT (Intel) 4.2
  4. Amazon 4.2
  5. Playment 4.1
  6. Clarifai 3.8
  • Labelbox 3.5
  • Heartex N/A Not yet rated
  • SuperAnnotate Not yet rated
  • Datasaur Not yet rated

2:

According to LinkedIn, labelbox median tenure is under 1 year