Integrations

Examples

The load_dataset() function facilitates maintaining copies of datasets on the local filesystem. It will download a given dataset's datapackage and store it under ~/.dw/cache. When used subsequently, load_dataset() will use the copy stored on disk and will work offline, unless it's called with force_update=True or auto_update=True. force_update=True will overwrite your local copy unconditionally. auto_update=True will only overwrite your local copy if a newer version of the dataset is available on data.world.

Once loaded, a dataset (data and metadata) can be conveniently accessed via the object returned by load_dataset().

Start by importing the datadotworld module:

import datadotworld as dw

Then, invoke the load_dataset() function, to download a dataset and work with it locally. For example:

intro_dataset = dw.load_dataset('jonloyens/an-intro-to-dataworld-dataset')

Dataset objects allow access to data via three different properties raw_data, tables and dataframes. Each of these properties is a mapping (dict) whose values are of type bytes, list and pandas.DataFrame, respectively. Values are lazy loaded and cached once loaded. Their keys are the names of the files contained in the dataset.

For example:

>>> intro_dataset.dataframes
LazyLoadedDict({
    'changelog': LazyLoadedValue(<pandas.DataFrame>),
    'datadotworldbballstats': LazyLoadedValue(<pandas.DataFrame>),
    'datadotworldbballteam': LazyLoadedValue(<pandas.DataFrame>)})

IMPORTANT: Not all files in a dataset are tabular, therefore some will be exposed via raw_data only.

Tables are lists of rows, each represented by a mapping (dict) of column names to their respective values.

For example:

>>> stats_table = intro_dataset.tables['datadotworldbballstats']
>>> stats_table[0]
OrderedDict([('Name', 'Jon'),
             ('PointsPerGame', Decimal('20.4')),
             ('AssistsPerGame', Decimal('1.3'))])

You can also review the metadata associated with a file or the entire dataset, using the describe function. For example:

>>> intro_dataset.describe()
{'homepage': 'https://data.world/jonloyens/an-intro-to-dataworld-dataset',
 'name': 'jonloyens_an-intro-to-dataworld-dataset',
 'resources': [{'format': 'csv',
   'name': 'changelog',
   'path': 'data/ChangeLog.csv'},
  {'format': 'csv',
   'name': 'datadotworldbballstats',
   'path': 'data/DataDotWorldBBallStats.csv'},
  {'format': 'csv',
   'name': 'datadotworldbballteam',
   'path': 'data/DataDotWorldBBallTeam.csv'}]}
>>> intro_dataset.describe('datadotworldbballstats')
{'format': 'csv',
 'name': 'datadotworldbballstats',
 'path': 'data/DataDotWorldBBallStats.csv',
 'schema': {'fields': [{'name': 'Name', 'title': 'Name', 'type': 'string'},
                       {'name': 'PointsPerGame',
                        'title': 'PointsPerGame',
                        'type': 'number'},
                       {'name': 'AssistsPerGame',
                        'title': 'AssistsPerGame',
                        'type': 'number'}]}}