Skip to main content

Configuring data profiling while running collectors

Which Collectors support profiling for metadata?

Data profiling feature is available for the following collectors:

What profiling statistics are collected for metadata?

Table 1.

Object

Information cataloged

Columns

  • Distinct values

  • Non-null count

  • Integer value (min, max, avg)

  • Decimal value (min, max, avg)

  • String value (min, max)

  • String length (min, max, avg)



A sample view of data profiling information for metadata:

Note

Note that Data distribution will show a maximum of 50 buckets.

data_profiling_metadata.png

Does the system sample data to create profiling statistics?

  • The collector samples a small amount of data only to generate statistics, and it does not ingest the data at all, or view it in any other way.

Enabling profiling for metadata

You can enable profiling for metadata by using the following optional parameters in the command/YAML file for the collectors that support data profiling:

  • --sample-string-values: To enable harvesting of sample values and histograms for columns containing string data.

  • --enable-column-statistics: To enable harvesting of column statistics.

  • --target-sample-size: To control the number of rows sampled for computation of column statistics and string-value histograms. For example, to sample 1000 rows, set the parameter as: --target-sample-size=1000.