pycldf.dataset

The core object of the API, bundling most access to CLDF data, is the pycldf.Dataset. In the following we’ll describe its attributes and methods, bundled into thematic groups.

Dataset initialization

class pycldf.dataset.Dataset(tablegroup)[source]

API to access a CLDF dataset.

Parameters

tablegroup (csvw.metadata.TableGroup) –

__init__(tablegroup)[source]

A Dataset is initialized passing a TableGroup. For convenience methods to get such a TableGroup, see the factory methods

Parameters

tablegroup (csvw.metadata.TableGroup) –

classmethod from_data(fname)[source]

Initialize a Dataset from a single CLDF data file.

See https://github.com/cldf/cldf#metadata-free-conformance

Return type

Dataset

classmethod from_metadata(fname)[source]

Initialize a Dataset with the metadata found at fname.

Parameters

fname (typing.Union[str, pathlib.Path]) – A URL (str) or a local path (str or pathlib.Path). If fname points to a directory, the default metadata for the respective module will be read.

Return type

Dataset

classmethod in_dir(d, empty_tables=False)[source]

Create a Dataset in a (possibly empty or even non-existing) directory.

The dataset will be initialized with the default metadata for the respective module.

Return type

Dataset

Parameters

d (typing.Union[str, pathlib.Path]) –

Accessing dataset metadata

class pycldf.Dataset(tablegroup)[source]

API to access a CLDF dataset.

Parameters

tablegroup (csvw.metadata.TableGroup) –

property bibname: str
Return type

str

Returns

Filename of the sources BibTeX file.

property bibpath: Union[str, pathlib.Path]
Return type

typing.Union[str, pathlib.Path]

Returns

Location of the sources BibTeX file. Either a URL (str) or a local path (pathlib.Path).

property directory: Union[str, pathlib.Path]
Return type

typing.Union[str, pathlib.Path]

Returns

The location of the metadata file. Either a local directory as pathlib.Path or a URL as str.

property module: str
Return type

str

Returns

The name of the CLDF module of the dataset.

property properties: dict
Return type

dict

Returns

Common properties of the CSVW TableGroup of the dataset.

Accessing schema objects: components, tables, columns, etc.

Similar to capability checks in programming languages that use duck typing, it is often necessary to access a datasets schema, i.e. its tables and columns to figure out whether the dataset fits a certain purpose. This is supported via a dict-like interface provided by pycldf.Dataset, where the keys are table specifiers or pairs (table specifier, column specifier). A table specifier can be a table’s component name or its url, a column specifier can be a column name or its propertyUrl.

  • check existence with in:

    >>> if 'ValueTable' in dataset: pass
    >>> if ('ValueTable', 'Language_ID') in dataset: pass
    
  • retrieve a schema object with item access:

    >>> table = dataset['ValueTable']
    >>> column = dataset['ValueTable', 'Language_ID']
    
  • retrieve a schema object or a default with .get:

    >>> table_or_none = dataset.get('ValueTableX')
    >>> column_or_none = dataset.get(('ValueTable', 'Language_ID'))
    
  • remove a schema object with del:

    >>> del dataset['ValueTable', 'Language_ID']
    >>> del dataset['ValueTable']
    

Note: Adding schema objects is not supported via key assignment, but with a set of specialized methods described in Editing metadata and schema.

class pycldf.Dataset(tablegroup)[source]

API to access a CLDF dataset.

Parameters

tablegroup (csvw.metadata.TableGroup) –

__contains__(item)[source]

Check whether a dataset specifies a table or column.

Parameters

item – See __getitem__()

Return type

bool

__getitem__(item)[source]

Access to tables and columns.

If a pair (table-spec, column-spec) is passed as item, a Column will be returned, otherwise item is assumed to be a table-spec.

A table-spec may be

  • a CLDF ontology URI matching the dc:conformsTo property of a table

  • the local name of a CLDF ontology URI, where the complete URI matches the the dc:conformsTo property of a table

  • a filename matching the url property of a table

A column-spec may be

  • a CLDF ontology URI matching the propertyUrl of a column

  • the local name of a CLDF ontology URI, where the complete URI matches the propertyUrl of a column

  • the name of a column

Raises

SchemaError – If no matching table or column is found.

Return type

typing.Union[csvw.metadata.Table, csvw.metadata.Column]

property column_names: argparse.Namespace

In-direction layer, mapping ontology terms to local column names (or None).

Note that this property is computed each time it is accessed (because the dataset schema may have changed). So when accessing a dataset for reading only, calling code should use readonly_column_names.

Return type

argparse.Namespace

Returns

an argparse.Namespace object, with attributes <object>s for each component <Object>Table defined in the ontology. Each such attribute evaluates to None if the dataset does not contain the component. Otherwise, it’s an argparse.Namespace object mapping each property defined in the ontology to None - if no such column is specified in the component - and the local column name if it is.

property components: Dict[str, csvw.metadata.Table]
Return type

typing.Dict[str, csvw.metadata.Table]

Returns

Mapping of component name to table obejcts as defined in the dataset.

get(item, default=None)[source]

Acts like dict.get.

Parameters

item – See __getitem__()

Return type

typing.Union[csvw.metadata.Table, csvw.metadata.Column, None]

get_foreign_key_reference(table, column)[source]

Retrieve the reference of a foreign key constraint for the specified column.

Parameters
  • table – Source table, specified by filename, component name or as Table instance.

  • column – Source column, specified by column name, CLDF term or as Column instance.

Return type

typing.Optional[typing.Tuple[csvw.metadata.Table, csvw.metadata.Column]]

Returns

A pair (Table, Column) specifying the reference column - or None.

readonly_column_names[source]
Returns

argparse.Namespace with component names as attributes.

property tables: list
Return type

list

Returns

All tables defined in the dataset.

Editing metadata and schema

In many cases, editing the metadata of a dataset is as simple as editing properties(), but for the somewhat complex formatting of provenance data, we provide the shortcut add_provenance().

Likewise, csvw.Table and csvw.Column objects in the dataset’s schema can be edited “in place”, by setting their attributes or adding to/editing their common_props dictionary. Thus, the methods listed below are concerned with adding and removing tables and columns.

class pycldf.Dataset(tablegroup)[source]

API to access a CLDF dataset.

Parameters

tablegroup (csvw.metadata.TableGroup) –

add_columns(table, *cols)[source]

Add columns specified by cols to the table specified by table.

Return type

None

add_component(component, *cols, **kw)[source]

Add a CLDF component to a dataset.

Parameters

component (typing.Union[str, dict]) – A component specified by name or as dict representing the JSON description of the component.

Return type

csvw.metadata.Table

add_foreign_key(foreign_t, foreign_c, primary_t, primary_c=None)[source]

Add a foreign key constraint.

Note: Composite keys are not supported yet.

Parameters
  • foreign_t – Table reference for the linking table.

  • foreign_c – Column reference for the link.

  • primary_t – Table reference for the linked table.

  • primary_c – Column reference for the linked column - or None, in which case the primary key of the linked table is assumed.

add_provenance(**kw)[source]

Add metadata about the dataset’s provenance.

Parameters

kw – Key-value pairs, where keys are local names of properties in the PROV ontology for describing entities (see https://www.w3.org/TR/2013/REC-prov-o-20130430/#Entity).

add_table(url, *cols, **kw)[source]

Add a table description to the Dataset.

Parameters
  • url (str) – The url property of the table.

  • cols – Column specifications; anything accepted by pycldf.dataset.make_column().

  • kw – Recognized keywords: - primaryKey: specify the column(s) constituting the primary key of the table.

Return type

csvw.metadata.Table

Returns

The new table.

remove_columns(table, *cols)[source]

Remove cols from table’s schema.

Note: Foreign keys pointing to any of the removed columns are removed as well.

remove_table(table)[source]

Removes the table specified by table from the dataset.

Adding data

The main method to persist data as CLDF dataset is write(), which accepts data for all CLDF data files as input. This does not include sources, though. These must be added using add_sources().

class pycldf.Dataset(tablegroup)[source]

API to access a CLDF dataset.

Parameters

tablegroup (csvw.metadata.TableGroup) –

add_sources(*sources, **kw)[source]

Add sources to the dataset.

Parameters

sources – Anything accepted by pycldf.sources.Sources.add().

Reading data

Reading rows from CLDF data files, honoring the datatypes specified in the schema, is already implemented by csvw. Thus, the simplest way to read data is iterating over the csvw.Table objects. However, this will ignore the semantic layer provided by CLDF. E.g. a CLDF languageReference linking a value to a language will be appear in the dict returned for a row under the local column name. Thus, we provide several more convenient methods to read data.

class pycldf.Dataset(tablegroup)[source]

API to access a CLDF dataset.

Parameters

tablegroup (csvw.metadata.TableGroup) –

get_object(table, id_, cls=None, pk=False)[source]

Get a row of a component as pycldf.orm.Object instance.

Return type

pycldf.orm.Object

get_row(table, id_)[source]

Retrieve a row specified by table and CLDF id.

Raises

ValueError – If no matching row is found.

Return type

dict

get_row_url(table, row)[source]

Get a URL associated with a row. Tables can specify associated row URLs by

  • listing one column with datatype anyURI or

  • specfying a valueUrl property for their ID column.

For rows representing objects in web applications, this may be the objects URL. For rows representing media files, it may be a URL locating the file on a media server.

Parameters
  • table – Table specified in a way that __getitem__ understands.

  • row – A row specified by ID or as dict as returned when iterating over a table.

Return type

typing.Optional[str]

Returns

a str representing a URL or None.

iter_rows(table, *cols)[source]

Iterate rows in a table, resolving CLDF property names to local column names.

Parameters
  • table – Table name.

  • cols – List of CLDF property terms which must be resolved in resulting dict s. I.e. the row dicts will be augmented with copies of the values keyed with CLDF property terms.

Return type

typing.Iterator[dict]

objects(table, cls=None)[source]

Read data of a CLDF component as pycldf.orm.Object instances.

Parameters
  • table – table to read, specified as component name.

  • clspycldf.orm.Object subclass to instantiate objects with.

Return type

pycldf.util.DictTuple

Returns

Writing (meta)data

class pycldf.Dataset(tablegroup)[source]

API to access a CLDF dataset.

Parameters

tablegroup (csvw.metadata.TableGroup) –

write(fname=None, **table_items)[source]

Write metadata, sources and data. Metadata will be written to fname (as interpreted in pycldf.dataset.Dataset.write_metadata()); data files will be written to the file specified by csvw.Table.url of the corresponding table, interpreted as path relative to directory().

Parameters

table_items (typing.List[dict]) – Mapping of table specifications to lists of row dicts.

write_metadata(fname=None)[source]

Write the CLDF metadata to a JSON file.

Fname

Path of a file to write to, or None to use the default name and write to directory().

write_sources()[source]

Write the sources BibTeX file to bibpath()

Reporting

class pycldf.Dataset(tablegroup)[source]

API to access a CLDF dataset.

Parameters

tablegroup (csvw.metadata.TableGroup) –

stats(exact=False)[source]

Compute summary statistics for the dataset.

Return type

typing.List[typing.Tuple[str, str, int]]

Returns

List of triples (table, type, rowcount).

validate(log=None, validators=None, ontology_path=None)[source]

Validate schema and data of a Dataset:

  • Make sure the schema follows the CLDF specification and

  • make sure the data is consistent with the schema.

Parameters
  • log (typing.Optional[logging.Logger]) – a logging.Logger to write ERRORs and WARNINGs to. If None, an exception will be raised at the first problem.

  • validators (typing.Optional[typing.List[typing.Tuple[str, str, callable]]]) – Custom validation rules, i.e. triples (tablespec, columnspec, attrs validator)

Raises

ValueError – if a validation error is encountered (and log is None).

Return type

bool

Returns

Flag signaling whether schema and data are valid.

Dataset discovery

We provide two functions to make it easier to discover CLDF datasets in the file system. This is useful, e.g., when downloading archived datasets from Zenodo, where it may not be known in advance where in a zip archive the metadata file may reside.

pycldf.sniff(p)[source]

Determine whether a file contains CLDF metadata.

Parameters

p (pathlib.Path) – pathlib.Path object for an existing file.

Return type

bool

Returns

True if the file contains CLDF metadata, False otherwise.

pycldf.iter_datasets(d)[source]

Discover CLDF datasets - by identifying metadata files - in a directory.

Parameters

d (pathlib.Path) – directory

Return type

typing.Iterator[pycldf.dataset.Dataset]

Returns

generator of Dataset instances.

Sources

When constructing sources for a CLDF dataset in Python code, you may pass pycldf.Source instances into pycldf.Dataset.add_sources(), or use pycldf.Reference.__str__() to format a row’s source value properly.

Direct access to pycldf.dataset.Sources is rarely necessary (hence it is not available as import from pycldf directly), because each pycldf.Dataset provides access to an apprpriately initialized instance in its sources attribute.

class pycldf.Source(genre, id_, *args, **kw)[source]

A bibliograhical record, specifying a source for some data in a CLDF dataset.

classmethod from_entry(key, entry, **_kw)[source]

Create a cls instance from a pybtex entry object.

Parameters
  • key – BibTeX citation key of the entry

  • entrypybtex.database.Entry instance

  • _kw – Non-bib-metadata keywords to be passed for cls instantiation

Returns

cls instance

class pycldf.Reference(source, desc)[source]

A reference connects a piece of data with a Source, typically adding some citation context often page numbers, or similar.

Parameters
__str__()[source]

String representation of a reference according to the CLDF specification.

class pycldf.dataset.Sources[source]

A dict like container for all sources linked to data in a CLDF dataset.

add(*entries, **kw)[source]

Add a source, either specified as BibTeX string or as Source.

Parameters

entries (typing.Union[str, pycldf.sources.Source]) –

expand_refs(refs, **kw)[source]

Turn a list of string references into proper Reference instances, looking up sources in self.

This can be used from a pycldf.Dataset as follows:

>>> for row in dataset.iter_rows('ValueTable', 'source'):
...     for ref in dataset.sources.expand_refs(row['source']):
...         print(ref.source)
Parameters

refs (typing.Iterable[str]) –

Return type

typing.Iterable[pycldf.sources.Reference]

static parse(ref)[source]

Parse the string representation of a reference into source ID and context.

Raises

ValueError – if the reference does not match the expected format.

Parameters

ref (str) –

Return type

typing.Tuple[str, str]