Training Data Platforms for Machine Learning & AI (2024)
Hello, my name is Anthony, I’m the Author of Training Data for Machine Learning published by O’Reilly Media, Inc. I’m also the founder of Diffgram. If you are looking for a modern open source platform I suggest checking out the new Diffgram License v2 (DLv2) and the Contributor License (CL).
For Training Data, Data Annotation, and Deep Learning, and Open Source AI/ML. (Data Labeling, Curation, Catalog, and Annotation, Tooling). This article is about what to know when comparing Training Data Platforms. Also known as AI Datastores and Data labeling.
This article covers popular platforms and alternatives like Label Studio, Labelbox, V7 Labs, Super Annotate and Diffgram.
Artificial Intelligence (AI) is taking the world by storm. While the technology has limitations — it’s maturing rapidly and it’s absolutely critical to be up on the latest trends of it. Machine Learning (ML) is a common AI approach. Here I walk through the top Training Data platforms. Let’s dive in!
I will cover popular product areas, opinons on the open, partial and closed source options, new Development System concepts, and provide a more detailed view at specific popular platforms.
- Product Areas: Annotation, Catalog, Workflow
- Open, Partial and Closed Source
- Development and Config-Only
- Popular Platforms
There are three major product areas to be aware of:
Annotation is the most basic expected feature set. Most of the providers listed here support multiple types (image, video, audio, text, geospatial, 3d, etc.) of annotations and features. Some providers have a more limited feature set:
- Diffgram, Label Studio, Labelbox cover multiple media types
- CVAT and V7 Labs Darwin are mostly focused on visual
- Datasaur is mostly focused on text
The idea of Catalog is to explore, curate, discovery, and use your unstructured data. To search and visualize all of your unstructured data in one place.
This area has more variance then Annotation in the feature sets. While all providers have some sort of tiled list view, this is different. Catalog is:
- Querying (think SQL, not just filtering)
- Curation (selecting and sampling data)
- Integration with other products (including search, automations, embedding similarity search etc.)
- Sharing selected slices
- Getting data to other ML programs etc.
Diffgram and Labelbox lead the pack here.
Workflow is about how Training Data interacts with your ML Programs. Workflow cares about things like:
- Interaction with an Ecosystem of tech
- MLOps, Modeling, and Programs
- Training Data Orchestration
- Human task workflow
- Surfacing your Training Data Processes
Workflow products have some of the most divergent opinons:
Closed, Partial, and Open Source
Closed, Partial, and Open Source are the three major options. This refers to how much of the platform code is available in the open. All commercial providers have various license terms, so open source does not always mean free. Instead it is more of an indication of the level of flexibility, community, and longevity of the tech.
My opinon is that more open is better. So this graphic provides a “good, better, best” idea. Open Source means higher quality software because more people test it, review it, and contribute to it. Communities form around open source software — making long term support more reliable.
Labelbox, Super Annotate, Dataloop, V7 labs Darwin, Datasaur.ai are mostly closed source. They may have certain integration or small samples that are open source, but all of the core platform is closed source.
Label-studio (Heartex) is partially open source. “Label-studio” code is open source, while “Heartex” code is closed source.
Development and Config-Only
There is a growing difference between providers that offer a “Development System” and those who’s system can only be configured within known bounds.
The main driving force here is that as a large, expanding area there there is often not a single product that can truly meet all of your needs. Some providers handle this by forcing custom consulting, or for you to adapt your process to their specific world view. Config-Only systems often end up becoming a constraint. Forcing a team to do something a specific way, or use a specific approach.
The Development System concept providers a greater level of control to your team. In general a Development System has these parts:
- A Baseline Platform
- Frameworks and Components
- Ability to Develop Novel Software
Diffgram is leading the way in this area with the Diffgram Development System.
Here I take a look at popular options and call out some unique differences.
Founded in 2018 in Santa Clara, CA, USA.
Diffgram is commercial open source. This means a community of people are working together to build the best possible software.
Diffgram is a complete system — Annotation, Catalog and Workflow. Diffgram has great support for most popular media types including images, video, 3D, text annotation, geospatial, and audio. This includes team collaboration and large scale enterprise data annotation.
Diffgram has a comprehensive feature stack, from all media types, to managing training data ml processes, human workflow, exploring data, and more. Diffgram Features
Vision: To be the world’s best Development System system for ML programs.
Enterprise: Sales & Executive Material
Label studio supports a breadth of interface types. Most interface types have a minimal level of depth. Label Studio is mostly focused on the annotation interface only. There is minimal Catalog and Workflow support.
Intel funded team. Open source.
CVAT is Intel’s attempt to build a moat around their OpenVino offering and sell more hardware. While I’m sure the maintainers are smart people and work hard — that’s a hard shadow to work from. Also this is nothing against Intel — it’s just an odd mix of concerns to have a hardware giant in this space.
CVAT is designed only for full time annotators. In general, subject matter experts who are new to annotation find it difficult, unintuitive and hard to use.
CVAT is, naturally, focused on computer vision exclusively. (Not text, NLP etc). In general, CVAT appears more concerned about meeting specific use case needs then shipping an all around new user friendly product.
Missing many modern features: CVAT is primarily a “classic” annotation only tool. For all intents and purposes it is in a different class from the rest of the modern tools listed here.
Intel favoritism for Speed Up Approaches? CVAT is backed by Intel and appears to overly favor the OpenVino (Intel) approaches. CVAT’s approach to mode speed up is fairly heavy and opinionated.
Closed source. Based in USA and founded in 2018. Datasaur is exclusively a text annotation tool.
Vital stats: Closed source. Based in San Francisco. Founded 2017 Labelbox vs Diffgram Comparison
Labelbox is Closed Source: Labelbox initially claimed the product was open source, but then promptly removed most of the code — tying it back to their closed source offering.
- V7 Labs — Closed Source
- BasicAI — Closed Source
- SuperbAI — Closed Source
I have also extensively researched and considered — including (for most) personally trying the software — the following alternatives: AWS Sagemaker GroundTruth, Alegion, ClarifaiAI, Hive, Playment, Scale AI, Appen, Lionbridge, Cloud Factory, SixGill, iMerit, Kili-Technology, MindySupport, IsaHit, LinkedAI, Edgecase.ai, HastyAI, Dataloop, Neurala, Keymakr, Prodigy, Tagtog, Google AI Data Platform, Vott, UDT, Yedda, Lost and a few others. *Keeping in mind some of these listed are outsourcing firms.