CISO Training Data Brief

Annotated Data Examples
  • Exposure of raw data (such as images, videos, text, and other multimedia artifacts)
  • Data Access — annotators, sometimes in foreign countries, contractors, data scientists
  • Data Movement — is exported data lying around on peoples local machines? Are there unmanaged copies of the data?
Your Data, Your Crown Jewels
  • High value data controls: “completed” or “high value” training data sets — think crown jewels
  • Threat landscape — as a new area the threat landscape is poorly defined and evolving
  • APTs (Advanced Persistent Threats)
  • Red/Blue/Purple team tests
  • DevSecOps — Let’s call it DevSecMLOps… Yep MLOps needs security too!
  • Dependency chains and Software Bill of Materials
  • Ransomware — what would happen if your training data was maliciously encrypted or inaccessible?

Raw Data

One of the first big challenges is the raw data. This is especially challenging because there may be hard compliance requirements or contractual items around this data.

Where is your Training Data stored? How is it Accessed?

Pass by Reference method

The super short story here is that the data doesn’t move. The Training Data platform (e.g. Diffgram) copies only a reference to the data, for example the blob path, and does not move the data. This means all of your existing compliance controls remain.

Dedicated Storage method

Another method with Diffgram is you can configure your own new, dedicated, storage provider.

Data Movement

Another big challenge is that of datasets “lying around” on people’s local machines. Even people with the best of intentions can create unnecessary security risks by doing this.

Is your data swimming in the Atlantic?

One Set of Controls

By using Diffgram Training Data software your Data Scientists will be creating a unified schema, a unified view of the meaning of the data within your installation of Diffgram.

Catalog Over File Exports

The default in most other systems is to have a “File” level export. This usually means a file ends up on a data scientists local computer.

Data Catalog (Query and Streaming)

Installation and Integrations

Let’s face it. The best security is the security you control. Diffgram installs on your own hardware, your own cluster.

DevSecMLOps meet Workflow

The development of ML models introduces many challenges for maintaining the security of data.

Diffgram Workflow Example


At the time of writing, other options like LabelBox, don’t have a published ransomware plan.

Ransomware example



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store