Skip to main content

Documentation

Discover and catalog sensitive data

The Sensitive Data Discovery collector (DWCC_SDD) uses machine learning to automatically detect and tag any sensitive data in the system, such as personal Identifiable Information (PII) Protected Health Information (PHI), and Payment Card Industry (PCI) Once the data is detected and tagged, users can view the tags as they use the data and can be aware of it.

The collector can scan any table that can be queried by SQL and has at least one row of data. It is not limited by the quantity of tables. Please note this tables are scanned at the column level.

Note

Note that DWCC-SDD does not scan live connected datasets or virtualized data assets.

DWCC-SDD internally uses private.ai to process and tag the data. The private.ai virtual machine is run in the customer's environment to discover the sensitive data. The collector is run on sample data from the data source and that sample data is never stored in the application and is not shared with private.ai.

Sensitive_Data_Discovery_diagram.png
Key Features

There are four key features of Sensitive Data Discovery:

  • Scan: Scan your different data sources. The tool is pre-trained using machine learning to identify 30+ sensitive data types out-of-the-box.

  • Classify: This capability enables you to differentiate between sensitive data types and the rules you apply for how you should work with that specific data type. Confidential, for example, may have a specific meaning within your organization. Applying the confidential classifications allows you to define your own business logic to the data.

  • Take action: All information is fully reportable. You can create a report that shows a tabular view of all of your assets and the sensitive data types and classifications that are applied to them.

  • Integrate: You can then export reports to your favorite BI tool to leverage as part of a broader system or initiative.

Scanned entity types

DWCC-SDD scans the following entity types. This is a subset of the entity types supported by private.ai. For a description of these fields, see For the list of Supported Entity Types, see the Private AI documentation.

  • LOCATION_ADDRESS

  • DATE

  • EMAIL_ADDRESS

  • SSN

  • NAME

  • PASSPORT_NUMBER

  • NUMERICAL_PII

  • ORGANIZATION

  • OCCUPATION

  • ORIGIN

  • PASSWORD

  • PHYSICAL_ATTRIBUTE

  • POLITICAL_AFFILIATION

  • RELIGION

  • TIME

  • URL

  • ZODIAC_SIGN

  • CREDIT_CARD

  • CREDIT_CARD_EXPIRATION

  • CVV

  • BANK_ACCOUNT

  • ROUTING_NUMBER

  • ID_NUMBER

  • IP_ADDRESS

  • USERNAME

  • HEALTHCARE_NUMBER

  • BLOOD_TYPE

  • MEDICAL_CONDITION

  • DRUG

  • INJURY

  • MEDICAL_PROCESS

  • MEDICAL_OTHER

  • MEDICAL_STATISTICS

Supported Sources

DWCC-SDD can be run on the following sources:

  • Amazon Athena

  • Google BigQuery

  • Snowflake

Supported languages for the content processed from the source:

  • The collector can be run on content in various languages. For the complete list of languages supported, see the Private AI documentation.

Important things to note
  • The DWCC-SDD collector does not work on the datasets that are already imported and virtualized into the system. It should be run along with the regular DWCC collector to tag the data as it comes in the application.

  • DWCC-SDD should be re-run every time new data is brought into the application using the regular DWCC collector.

Cataloging sensitive data with DWCC-SDD

There are three parts to using the DWCC-SDD collector to catalog the sensitive data in your data source:

  1. Set up the required resources.

  2. Run the collectors.

  3. Create the sensitive data catalog file.

The first time you run the SDD collector, you will need to do all of these steps in order. Once you have completed your configuration, however, you can set the collectors to run and upload your sensitive data catalog file automatically as needed.

Setting up the required resources

The first part of running the collector is to gather all the required resources. You will need:

  • To request the sensitive data files from support.

  • To create a ddw-sensitive data dataset

  • Docker installed on the machine from which you wish to run the collector. The DWCC-SDD collector only runs with Docker. You can find instructions for using Docker with the data.world Collector here.

Creating the ddw-sensitive data dataset

While you are waiting for support to send you the sensitive data files, create a dataset in your organization called ddw-sensitive data. It will be used to hold all of the sensitive data configuration files that you will use with your data source(s).

SDD-create_dataset_1.png

If you need help setting up your dataset, detailed instructions can be found here. Create the dataset with your organization as the owner and set it to private:

SDD-create_dataset_2.png

Note

You can name this dataset anything you would like--you can even use an existing dataset like ddw-catalogs. However. we recommend creating a specific ddw-sensitive data dataset to keep the workflow clean and separate.

Preparing the files from support

The three files you will get from support are:

  • sdd-shims.ttl - A file that extends the functionality of DWCC-SDD. You will use this file as is with no changes.

  • sdd-metadata-profile.ttl - A special metadata profile that is run with DWCC-SDD. It does NOT replace the metadata-profile.ttl in the ddw-catalogs dataset that you use with your regular Collector.

  • DWCC-SDD - The SDD catalog collector which is run in conjunction with your regular Collector to augment your catalog file with information about sensitive data. It will be in a tar.gz file that you will need to unzip and load into Docker. The filename will be dwcc-sdd-x.y.tar.gz where x.y is the current version of the file.

To prepare the files:

  1. Open the sdd-metadata-profile.ttl and replace "collector warehouse" in lines one and seven of the with your org name:

    sdd-metadata-profile_1.png

    If you need help finding your org name, it's on your organization overview page and also in the url of any resource accessed from the organization:

    org_name.png
  2. Verify that the dataset where you store your existing metadata profile is ddw-catalogs. If it is not, also change the dataset name in lines one and seven:

    sdd-metadata-profile_3.png
  3. Upload the sdd-metadata-profile.ttl and the sdd-shims.ttl to the ddw-sensitive data dataset.

The DWCC-SDD collector from support will come in dwcc-sdd-x.y.tar.gz where x.y is the current version of the file. To use it:

  1. Upload the zip file to the directory from which you wish to run the collector.

  2. Open a terminal window from the directory where you uploaded the zip file, and unzip and load it into Docker with the command gunzip dwcc-sdd-x.y.tar.gz && docker load -i dwcc-sdd-x.y.tar

    Important

    You will need to replace x.y with the correct version in the filename.

    SDD_unzip_for_Docker.png
Running the collectors

Each time you catalog your sensitive data you will run both your regular data.world Collector and the DWCC-SDD collector. The order in which you run them is not important, they just both need to be run to keep everything in sync. If you need help running the data.world Collector for your data source, instructions are available from the links here.Currently supported metadata collector sources

The DWCC-SDD collector can be run in all the same ways as the regular Collector. For the purposes of this article we will be running the command from a CLI. The prerequisites for running the SDD collector are the same as for the regular collector with one exception:

  • Docker needs at least 8 gig of memory and 2 gig of swap space allocated to it. We recommend allocating 16-32 gig of memory if possible.

Lack of memory will not cause the collector to fail to run, you will just not get the correct output. If you use the Docker desktop, you can check your resource allocation in the dashboard on the Resources tab under Settings:

SDD_Docker_resources.png

More information on Docker and memory usage can be found here.

Command examples for Snowflake

The example commands below are for a Snowflake database. See the basic example command for your datasource on its data.world Collector page (links to the individual pages are here), and use it as the basis for your DWCC-SDD collector command. The basic command for the Snowflake collector (from the the data.world Collector Snowflake page for Docker) is:Currently supported metadata collector sources

With example data it looks like this:

docker run -it --rm --mount type=bind,source=/tmp,target=/dwcc-output \
--mount type=bind,source=/tmp,target=/app/log datadotworld/dwcc:2.77 \
catalog-snowflake -A -a ddw-doccorp -d NIGHTOWL_DB -n snowflake -o "/dwcc-output" \
-P mypassword -r PUBLIC -s https://cu12345.snowflakecomputing.com -u Demobob

For the command to work with the DWCC-SDD collector:

  • Change the location and name of the collector - in our example from datadotworld/dwcc:2.77 to dwcc-sdd:latest

  • Catalog a specific schema instead of using All schemas - in our example from -A to -S NIGHTOWL_DB

  • Change the name of the catalog - in our example from -n snowflake to -n snowflake-sdd

    Warning

    If you do not change the name of the catalog and it is written to the same directory as your regular Collector catalog, it will overwrite the existing file.

  • If you used https:// at the beginning of your Snowflake server name, remove it - in our example from -s https://cu12345.snowflakecomputing.com to -s cu12345.snowflakecomputing.com

  • The port parameter is required - in our example added -p 443

The command with the changes for the DWCC-SDD collector for Snowflake from the example above with the differences bolded is:

docker run -it --rm --mount type=bind,source=/tmp,target=/dwcc-output \
--mount type=bind,source=/tmp,target=/app/log dwcc-sdd:latest \
catalog-snowflake -S NIGHTOWL_SCHEMA -a my-org-name -d DATABASE_DB -n snowflake-sdd -o "/dwcc-output" \
-P mypassword -r PUBLIC -s cu12345.snowflakecomputing.com -p 443 -u Demobob
After running both collectors

Once you have run both the regular Collector and the DWCC-SDD collector you will have two dwec.ttl files in the directory you specified as your output directory, and they will be named as follows:

  • <database_name>.<catalog_name>.dwec.ttl for the regular Collector output (in our example NIGHTOWL_DB.snowflake.dwec.ttl)

  • <database_name>.<catalog_name>.dwec.ttl for the DWCC-SDD output (in our example NIGHTOWL_DB.snowflake-sdd.dwec.ttl)

Note

If you used the same catalog name for each command you will only have one file for whichever collector you ran last.

SDD_dwec_ttl_files.png

Upload both of these files to your ddw-sensitive data dataset

Creating a query to build a sensitive data catalog

To create a catalog of the regular and sensitive metadata for your data source you will create a SPARQL query that reads from the four .ttl files that are now in your ddw-sensitive data dataset:

Special.png

To create the query:

  1. Open the ddw-sensitive data dataset

    SDD_explore_dataset.png
  2. Create a new SPARQL query

    SDD-new_query.png
  3. Delete the default prefix from the query:

    SDD_delete_default_prefix.png
  4. Copy and paste the example query below into the query pane:

    PREFIX : <https://ddw-doccorp.linked.data.world/d/ddw-catalogs/>
    PREFIX file: <https://ddw-doccorp.linked.data.world/d/ddw-sensitive-data/file/>
    
    CONSTRUCT { ?s ?p ?o }
    
    FROM file:NIGHTOWL_DB.snowflake.dwec.ttl
    FROM file:NIGHTOWL_DB.snowflake-sdd.dwec.ttl
    FROM file:sdd-shims.ttl
    FROM file:sdd-metadata-profile.ttl
    
    WHERE {
        {
            ?s dwec:hasSensitiveDataClassification ?c .
            {
                ?c dwec:hasSensitiveDataClassificationType ?ct .
                ?ct rdfs:label ?label .
                bind(:sensitiveDataClassificationType as ?p) .
                bind(?label as ?o) .
            } UNION {
                ?c dwec:hasSensitiveDataClassificationMetric ?cm .
                bind(:sensitiveDataClassificationMetric as ?p) .
                bind(?cm as ?o) .
            }
        } UNION {
            SELECT DISTINCT ?s ?p ?o WHERE {
                ?_col dwec:hasSensitiveDataClassification ?c .
                ?_col dct:isPartOf ?s .
                ?s a dwec:DatabaseTable .
                ?c dwec:hasSensitiveDataClassificationType ?ct .
                ?ct rdfs:label ?label .
                bind(:sensitiveDataClassificationType as ?p) .
                bind(?label as ?o) .
            }
        } UNION {
            ?s a dwec:Catalog .
            ?s prov:wasGeneratedBy ?agent .
            ?agent prov:wasAssociatedWith dwcc:AthenaSddCatalogCollector .
            bind(rdf:type as ?p) .
            bind(dwec:HiddenEntity as ?o) .
        } UNION {
            ?s ?p ?o .
        }
    }
  5. Modify the first and second prefixes to match your organization and datasets.

    SDD_edit_query_1.png
  6. Change the .ttl filenames in the query to match the filenames in the dataset:

    SDD_edit_query_2.png
  7. Run the query and save it to the ddw-sensitive data dataset:

    SDD_save_query.png
  8. Save the output of the query as a table by selecting Download and Save to dataset or project:

    SDD_save_to_ddw-catalogs_1.png
  9. On the dialog window leave Maintain link to saved query selected. That way when you need to run the collectors again and update your catalog, all you will need to do is upload the new dwec.ttl files and run the query for the catalog to automatically update.

    SDD_save_to_ddw-catalogs_2.png
  10. Name the file and save it to your ddw-catalogs dataset. If the dataset name is not displayed when you select the down arrow, begin typing the name and it will appear.

Viewing tagged data

Once the DWCC-SDD collector is run, the tags are added to the tables and columns that are identified with sensitive information.

To view the tagged data:

  1. In the application, search or browse to any table or column that is scanned and tagged by the DWCC-SDD collector.

  2. On the Overview tab, check the More information section to see if any tags are assigned to it. Along with the tags, you can also view the Sensitive Data Classification Metric score which tells the confidence level with which the tag was assigned to the data by the DWCC-SDD collector. All sensitive data classifications are assigned a confidence score on a scale of 0 to 1. The closer to 1, the more confident that model is that it identified the correct sensitive data type and applies a classification For example: If a string of numbers appears in a column, the tool analyzes the contents of the column to determine whether that string is more closely associated with social security or credit card numbers. If the string is nine digits, it may assign a confidence score of .93 for social security and .53 for credit card. Given the higher degree of confidence, the data is classified as social security numbers.

    sdd_column_tag.png