The 5 Best Data Annotation Platforms & Tools for Machine Learning (2022)

Over 30 options considered for this Mega List. Map of Landscape and Buying Guide Included. For Training Data, Data Annotation, and Deep Learning, and Open Source AI/ML.

Introduction — The Best Data Annotation Tools

We all know Artificial Intelligence (AI) is taking the world by storm. While the technology has significant limitations — it’s maturing rapidly and it’s absolutely critical to be up on the latest trends of it.

Here I walk through the top Training Data platforms. If you are an executive, technical person new to the area, or just curious, this article is for you!

Let’s dive in! :)

Map of Training Data Landscape

This map covers the top open source and closed source software. Additional firms are listed on text.

Top Open Source Data Labeling: Diffgram. LabelStudio (Heartex) has been moved to Closed Source since for practical commercial purposes you must buy their Enterprise offering so they are no longer really open source.

Overview

There are two main categories to be aware of.

  1. Open Source Software
  2. Closed Source Software

Open source, like Diffgram, means the source code of the software is available for inspection. It also generally means higher quality software because more people test it, review it, and contribute to it. Communities form around open source software — making long term support more reliable.

As the founder of Diffgram I am naturally biased. I encourage those curious to inspect the docs and source code, try the product to see for yourself.

Diffgram is evergreen software, it’s always improving. Create an issue or join our community.

Best Overall: Diffgram

Best Overall Data Labeling Tools (2022):

  1. Diffgram
  2. Label Studio (Heartex)
  3. CVAT (Intel)
  4. SuperAnnotate
  5. Datasaur

Executive Summary

Diffgram is the overall best solution because:

  1. Diffgram is the only fully Open Source option.
  2. All media types.
  3. Most integrations (HuggingFace, VertexAI, MinIO, GCP, AWS, etc)
  4. Most flexible deploy (K8s, Docker, etc). Diffgram runs anywhere.
  5. Most Scalable. Diffgram imposes the least limits and scales out of the box to large, multi-user setups, large datasets, has the most flexible integrations.

Diffgram

Vital Stats: Founded in 2018 based in Santa Clara, CA, USA.

Diffgram has deep support for all popular media types including images, video, 3D, text annotation, geospatial, and audio.

Diffgram is a complete system —from Annotation to Workflow and Feature Store. This includes team collaboration and large scale enterprise data annotation.

Diffgram is open source — making it one the only “modern” system to be fully open. This means a community of people are working together to build the best possible software.

Diffgram has the most integrations, with deep connections for AWS, Azure, GCP and more. The team blogs publicly including sharing learning lessons on architecture and End to End testing.

Diffgram has the most comprehensive vision: everything training data. From all media types, to managing training data ml processes, human workflow, exploring data, and more.

Integrations

HuggingFace, Deepchecks, VertexAI, Sagemaker, AWS, GCP, MinIO, Azure, Custom Integrations (write your own).

Pros

  1. Flexible deploy and many integrations — Diffgram runs anywhere.
  2. Scale every aspect — from volume of data, to number of supervisors, to ML speed up approaches.
  3. Fully featured — ‘batteries included’.

Diffgram is 5 Star Rated on G2 Crowd

Selected Diffgram Reviews

Diffgram Features

Diffgram is Best for Images and Video Annotation

Diffgram supports all the spatial types including: Quadratic Curves, Cuboids, Segmentation, Box, Polygons, Lines, Keypoints, Classification Tags.

Diffgram has the most attribute support including: Radio buttons, Multiple select, Date pickers, Sliders, Conditional logic, Directional Vectors.

Example of annotation types and annotation attributes

Diffgram is Best for Semantic Segmentation

With fully integrated auto bordering, easy “draw over”, and integrated auto annotation (for example for people).

Semantic Segmentation — AutoBordering Example

Diffgram is Easiest to integrate with customizable Webhooks

Plus Webhooks, Notifications, Reporting and more.

Overall Diffgram has one of the most balanced and complete feature sets. In general all features work as expected. There are few “golden paths” — in many parts of the system a user can accomplish their goal in many ways.

Diffgram is Best for Perpetual Data Improvement (Data Streaming)

Technical Example of Data Flor for Single Pipeline

Data Science teams often have the goal of perpetually improving training data. Diffgram is the only vendor to have a patent pending method to achieve perpetually model improvement. I have been innovating in this area since 2018.

It includes automatically streaming files as they are completed. This means you can continually improve your models and create training data pipelines.

Diffgram has the Best Integrated Issue tracking, Resolution and Quality Assurance:

Diffgram Runs Anywhere

Diffgram offer 2 minute docker compose install. And complete Production install on Kubernetes (Helm). See an example step by step AWS guide.

Diffgram has the Deepest Support for AWS, GCP, MinIO, and Azure

You can choose any cloud vendor as your the storage provider for Diffgram itself. And import and export from any cloud vendor.

Essentially this means you can mix and match as desired, and it works with all of them right out of the box. For example — you can have your system of record on Azure, but export to Google for training. Or have all on one provider.

Diffgram has the most Scalable Speed Approach with Userscripts Framework

With Diffgram Userscripts Framework you get the latest AI advances instantly.

Your own team can customize and improve it using built-in primitives — all in normal JavaScript. Train custom models for speed up (automatically), or use your own pre-trained models. All for free — it runs on the local machine.

Essentially — you get the best of the 10x-100x speed up promised by other vendors — for free. Learn more about why userscripts.

See all Examples
See all Examples

Diffgram’s vision is:

  1. Application: Support all popular media types for raw data; all popular schema, label, and attribute needs; and all annotation assist speed up approaches
  2. Support all popular training data management and organizational needs
  3. Integrate with all popular 3rd party applications and related offerings
  4. Support modification of source code
  5. Run on any hardware, any cloud, and anywhere

Diffgram History

Diffgram went open source in April 2021. Since then the response has been tremendously positive — with absolutely stunning growth as shown by the stars and downloads

Diffgram Enterprise

Label Studio

Label studio is based around a configurable interface.

Important things to keep in mind with Label Studio (Heartex)

Diffgram is 100% open source. Label studio is not.

All Diffgram Code is open source. Label studio removes several important features out of the box like persistent storage (and all the purple icons). Why would access control and persistent storage, two of the most basic possible things be in closed source?

The breadth of Label Studio’s supported data types is impressive — however the depth is often a little disappointing. The high level demo looks good — but many types of deeper use cases fall short. Essentially there’s a little less polish — actions sometimes behave unexpectedly.

Their vision appears to be to support all types of data and many types of annotations.

It is not as clear what their data management vision is.

Or how that interacts with their open source product vs their paid product.

It’s not clear what integrations they support, or how to get it running on docker and Kubernetes.

Their speed up approach is unclear — it appears you need to be an expert in the specific area.

Example of Label Studio Interface

Intel CVAT (Computer Vision Annotation Tool)

Vital stats: Intel funded team. Open source.

CVAT is Intel’s attempt to build a moat around their OpenVino offering and sell more hardware. While I’m sure the maintainers are smart people and work hard — that’s a hard shadow to work from. Also this is nothing against Intel — it’s just an odd mix of concerns to have a hardware giant in this space.

CVAT is designed only for full time annotators. In general, subject matter experts who are new to annotation find it difficult, unintuitive and hard to use.

CVAT Caveats

Older UI: CVAT is, naturally, focused on computer vision exclusively. (Not text, NLP etc). In general, CVAT appears more concerned about meeting specific use case needs then shipping an all around new user friendly product.

Missing many features: CVAT has the smallest backend. This means many data management features are missing. This means you will likely need another tool to do data management.

Intel favoritism for Speed Up Approaches? CVAT is backed by Intel and appears to overly favor the OpenVino (Intel) approaches. CVAT’s approach to mode speed up is fairly heavy and opinionated.

SuperAnnotate

Vital stats: Closed source. Based in Armenia and founded in 2018.

Focus area: Image segmentation

Super annotate is very focused on images (no video support). While video support is in the pipeline — at the moment it’s really an image annotation tool.

This hyper focus has benefits and drawbacks. It means your team will certainly need more than one tool to cover all needed use cases. The main benefit is that they have some very interesting speed up tools.

Closed source:

While they have part of their annotation studio Open, the secret sauce and majority of back end code is Closed.

My take on SuperAnnotate:

If you are serious about image segmentation and don’t need video they are worth a look! Keep in mind you will likely need to integrate with other offerings to get the degree of data management needed. I think Diffgram’s Userscripts is a more scalable approach to faster annotation and custom quality control.

Datasaur

Vital stats:

Closed source. Based in USA and founded in 2018

Focus area: Text

Datasaur is exclusively a text annotation tool. Their ease of use and modern product take is great. Datasaur has a deep integration with Diffgram and can be used seamlessly for the text annotation part.

Example use cases covered:

  1. Transcribing and classifying medical symptoms and diagnoses from audio recordings of physician encounters
  2. Scanning scientific journals and academic papers for promising new medical treatments
  3. Classifying and labeling medical claims and billing codes

Caveats

Being a newer offering some features are still in development.

Closed source.

As with other tools mentioned, you will need to integrate it with others for full data management and for other data types.

Vision

I believe they are interested in expanding to other types but it’s not quite clear yet.

Labelbox

Vital stats: Closed source. Based in San Francisco. Founded 2017

Update April 2022:

NEW -> Labelbox vs Diffgram Comparison

Labelbox supports images. They claim to have “…full natural language processing…” . And they have a video interface that has recently come out of beta.

Caveats

Is Downtime a Regular Occurrence With Labelbox? Labelbox appears to have outsourced core technical items, such as login (Auth0).

For example recently the entire service was down (inaccessible login). Something that’s not possible with Diffgram’s fully integrated login.

Labelbox has Low Ratings on Glassdoor

Labelbox is the lowest glassdoor rated company in the competitive set.¹ According to LinkedIn, the median tenure is under 1 year.²

The primary technical Labelbox founder said “Goodbye” halfway through 2020.

Where is Labelbox’s technical vision? Who is the technical leader at the company?

Labelbox is Closed Source

Labelbox initially claimed the product was open source, but then promptly removed most of the code — tying it back to their closed source offering.

Labelbox Summary

While Labelbox deserves kudos for being early movers in this space, they have been overtaken by stronger open source alternatives (Diffgram, LabelStudio), and new closed source options (SuperAnnotate, Datasuar) that are moving faster.

Labelbox Alternatives

All of these alternatives have similar feature s and enterprise support levels — and some have many more that labelbox doesn’t offer.

  1. Diffgram — Open source
  2. LabelStudio — Half open
  3. Datasaur (Text)

Other options

  1. V7 Labs — Closed Source
  2. BasicAI — Closed Source
  3. SuperbAI — Closed Source

I have also extensively researched and considered — including (for most) personally trying the software — the following alternatives: AWS Sagemaker GroundTruth, Alegion, ClarifaiAI, Hive, Playment, Scale AI, Appen, Lionbridge, Cloud Factory, SixGill, iMerit, Kili-Technology, MindySupport, IsaHit, LinkedAI, Edgecase.ai, HastyAI, Dataloop, Neurala, Keymakr, Prodigy, Tagtog, Google AI Data Platform, Vott, UDT, Yedda, Lost and a few others. *Keeping in mind some of these listed are outsourcing firms.

Summary

Buying guide below

I have covered Diffgram, LabelStudio, CVAT, SuperAnnotate, and Datasaur as the 5 best annotation tools for 2021. This review was focused on pure software.

While I respectfully realize there are many many talented people working at all of these companies — I truly believe Diffgram, LabelStudio, CVAT, SuperAnnotate, and Datasaur standout as having the best overall approach and vision. With both Diffgram and LabelStudio being the best overall choices becuase of their open source nature.

Annotation Data Tool Buying Guide (2021)

Are you exclusively doing images (not video)? This can include object detection, semantic segmentation and more.

Best Annotation tool for Images:

  1. Diffgram
  2. SuperAnnotate
  3. CVAT

Are you exclusively doing video?

Best Annotation tool for Videos:

  1. Diffgram
  2. CVAT
  3. Labelstudio

Are you using text data?

Best Annotation tool for Text (NLP):

  1. Datasaur
  2. Diffgram NEW April 2022 Diffgram Launches Text Annotation
  3. Labelstudio

Are you working with a large team, or multiple teams?

Best for Team Collaboration:

  1. Diffgram — Enterprise Edition
  2. Heartex (Enterprise Labelstudio)

Are you looking for the best overall tool no matter the cost?

Best Overall Annotation Tools:

  1. Diffgram — Enterprise Edition
  2. Heartex (Enterprise Labelstudio)

Are you looking to avoid vendor lock-in?

Best Open Source Annotation Tools:

  1. Diffgram
  2. Labelstudio
  3. CVAT

Do you need version control?

Best version controlled Annotation Systems:

  1. Diffgram Versioning

Do you need speed up approaches?

Fastest Annotation Tools:

  1. Diffgram Userscripts Framework (Examples)
  2. CVAT
  3. Labelstudio

Do you need streaming data or data pipelines? Do you want to perpetually improve your AI Models?

Best Data Annotation for Defense and Security Critical

  1. Diffgram Secure — Run on docker or your Kubernetes cluster
  2. CVAT
  3. Labelstudio

Disclaimers

1:

Labelbox is the lowest glassdoor rated company in the competitive set, of those that have ratings, at the time of writing.

Folder with proof screenshots.

Of those companies that have ratings, the rating are as follows:

  1. Diffgram 5
  2. Alegion 4.4
  3. CVAT (Intel) 4.2
  4. Amazon 4.2
  5. Playment 4.1
  6. Clarifai 3.8
  • Labelbox 3.5
  • Heartex N/A Not yet rated
  • SuperAnnotate Not yet rated
  • Datasaur Not yet rated

2:

According to LinkedIn, labelbox median tenure is under 1 year

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store