Examples
The load_dataset()
function facilitates maintaining copies of datasets on the local filesystem. It will download a given dataset's datapackage and store it under ~/.dw/cache
. When used subsequently, load_dataset()
will use the copy stored on disk and will work offline, unless it's called with force_update=True
or auto_update=True
. force_update=True
will overwrite your local copy unconditionally. auto_update=True
will only overwrite your local copy if a newer version of the dataset is available on data.world.
Once loaded, a dataset (data and metadata) can be conveniently accessed via the object returned by load_dataset()
.
Start by importing the datadotworld
module:
import datadotworld as dw
Then, invoke the load_dataset()
function, to download a dataset and work with it locally. For example:
intro_dataset = dw.load_dataset('jonloyens/an-intro-to-dataworld-dataset')
Dataset objects allow access to data via three different properties raw_data
, tables
and dataframes
. Each of these properties is a mapping (dict) whose values are of type bytes
, list
and pandas.DataFrame
, respectively. Values are lazy loaded and cached once loaded. Their keys are the names of the files contained in the dataset.
For example:
>>> intro_dataset.dataframes LazyLoadedDict({ 'changelog': LazyLoadedValue(<pandas.DataFrame>), 'datadotworldbballstats': LazyLoadedValue(<pandas.DataFrame>), 'datadotworldbballteam': LazyLoadedValue(<pandas.DataFrame>)})
IMPORTANT: Not all files in a dataset are tabular, therefore some will be exposed via raw_data
only.
Tables are lists of rows, each represented by a mapping (dict) of column names to their respective values.
For example:
>>> stats_table = intro_dataset.tables['datadotworldbballstats'] >>> stats_table[0] OrderedDict([('Name', 'Jon'), ('PointsPerGame', Decimal('20.4')), ('AssistsPerGame', Decimal('1.3'))])
You can also review the metadata associated with a file or the entire dataset, using the describe
function. For example:
>>> intro_dataset.describe() {'homepage': 'https://data.world/jonloyens/an-intro-to-dataworld-dataset', 'name': 'jonloyens_an-intro-to-dataworld-dataset', 'resources': [{'format': 'csv', 'name': 'changelog', 'path': 'data/ChangeLog.csv'}, {'format': 'csv', 'name': 'datadotworldbballstats', 'path': 'data/DataDotWorldBBallStats.csv'}, {'format': 'csv', 'name': 'datadotworldbballteam', 'path': 'data/DataDotWorldBBallTeam.csv'}]} >>> intro_dataset.describe('datadotworldbballstats') {'format': 'csv', 'name': 'datadotworldbballstats', 'path': 'data/DataDotWorldBBallStats.csv', 'schema': {'fields': [{'name': 'Name', 'title': 'Name', 'type': 'string'}, {'name': 'PointsPerGame', 'title': 'PointsPerGame', 'type': 'number'}, {'name': 'AssistsPerGame', 'title': 'AssistsPerGame', 'type': 'number'}]}}