pycldf.dataset
The core object of the API, bundling most access to CLDF data, is
the Dataset . In the following we’ll describe its
attributes and methods, bundled into thematic groups.
Dataset initialization
- class pycldf.dataset.Dataset(tablegroup)[source]
API to access a CLDF dataset.
- Parameters:
tablegroup (
csvw.metadata.TableGroup) –
- __init__(tablegroup)[source]
A
Datasetis initialized passing a TableGroup. The following factory methods obviate the need to instantiate such a TableGroup instance yourself:- Parameters:
tablegroup (
csvw.metadata.TableGroup) –
- classmethod from_data(fname)[source]
Initialize a
Datasetfrom a single CLDF data file.See https://github.com/cldf/cldf#metadata-free-conformance
- Return type:
- Parameters:
fname (
typing.Union[str,pathlib.Path]) –
- classmethod from_metadata(fname)[source]
Initialize a
Datasetwith the metadata found at fname.- Parameters:
fname (
typing.Union[str,pathlib.Path]) – A URL (str) or a local path (str or pathlib.Path). If fname points to a directory, the default metadata for the respective module will be read.- Return type:
Accessing dataset metadata
- property Dataset.directory: str | Path
- Returns:
The location of the metadata file. Either a local directory as pathlib.Path or a URL as str.
- property Dataset.module: str
- Returns:
The name of the CLDF module of the dataset.
- property Dataset.version: str
The CLDF version.
- property Dataset.metadata_dict: dict
The TableGroup instance as dict.
- property Dataset.properties: dict
- Returns:
Common properties of the CSVW TableGroup of the dataset.
- property Dataset.bibpath: str | Path
- Returns:
Location of the sources BibTeX file. Either a URL (str) or a local path (pathlib.Path).
- property Dataset.bibname: str
- Returns:
Filename of the sources BibTeX file.
Accessing schema objects: components, tables, columns, etc.
Similar to capability checks in programming languages that use
duck typing, it is often necessary
to access a datasets schema, i.e. its tables and columns, to figure out whether
the dataset fits a certain purpose. This is supported via a
mapping-like interface provided
by Dataset, where the keys are table specifiers or pairs (table specifier, column specifier).
A table specifier can be a table’s component name or its url, a column specifier can be a column
name or its propertyUrl.
check existence with
in:if 'ValueTable' in dataset: ... if ('ValueTable', 'Language_ID') in dataset: ...
retrieve a schema object with item access:
table = dataset['ValueTable'] column = dataset['ValueTable', 'Language_ID']
retrieve a schema object or a default with
Dataset.get():table_or_none = dataset.get('ValueTableX') column_or_none = dataset.get(('ValueTable', 'Language_ID'))
remove a schema object with
del:del dataset['ValueTable', 'Language_ID'] del dataset['ValueTable']
Note
Adding schema objects is not supported via key assignment, but with a set of specialized methods described in Editing metadata and schema.
- property Dataset.tables: list[csvw.metadata.Table]
- Returns:
All tables defined in the dataset.
- property Dataset.components: OrderedDict[str, Table]
- Returns:
Mapping of component name to table objects as defined in the dataset.
- Dataset.__getitem__(item)[source]
Access to tables and columns.
If a pair (table-spec, column-spec) is passed as
item, acsvw.Columnwill be returned, otherwiseitemis assumed to be a table-spec, and acsvw.Tableis returned.A table-spec may be
a CLDF ontology URI matching the dc:conformsTo property of a table
the local name of a CLDF ontology URI, where the complete URI matches the the dc:conformsTo property of a table
a filename matching the url property of a table.
A column-spec may be
a CLDF ontology URI matching the propertyUrl of a column
the local name of a CLDF ontology URI, where the complete URI matches the propertyUrl of a column
the name of a column.
- Parameters:
item (
typing.Union[str,csvw.metadata.Link,csvw.metadata.Table,tuple[typing.Union[str,csvw.metadata.Link,csvw.metadata.Table],typing.Union[str,csvw.metadata.Column]]]) – A schema object spec.- Raises:
SchemaError – If no matching table or column is found.
- Return type:
typing.Union[csvw.metadata.Table,csvw.metadata.Column]
- Dataset.__delitem__(item)[source]
Remove a table or column from the datasets’ schema.
- Parameters:
item (
typing.Union[str,csvw.metadata.Link,csvw.metadata.Table,tuple[typing.Union[str,csvw.metadata.Link,csvw.metadata.Table],typing.Union[str,csvw.metadata.Column]]]) – See__getitem__()
- Dataset.__contains__(item)[source]
Check whether a dataset specifies a table or column.
- Parameters:
item (
typing.Union[str,csvw.metadata.Link,csvw.metadata.Table,tuple[typing.Union[str,csvw.metadata.Link,csvw.metadata.Table],typing.Union[str,csvw.metadata.Column]]]) – See__getitem__()- Return type:
bool
- Dataset.get(item, default=None)[source]
Acts like dict.get.
- Parameters:
item (
typing.Union[str,csvw.metadata.Link,csvw.metadata.Table,tuple[typing.Union[str,csvw.metadata.Link,csvw.metadata.Table],typing.Union[str,csvw.metadata.Column]]]) – See__getitem__()- Return type:
typing.Union[csvw.metadata.Table,csvw.metadata.Column,None]
- Dataset.get_foreign_key_reference(table, column)[source]
Retrieve the reference of a foreign key constraint for the specified column.
- Parameters:
table (
typing.Union[str,csvw.metadata.Table]) – Source table, specified by filename, component name or as Table instance.column (
typing.Union[str,csvw.metadata.Column]) – Source column, specified by column name, CLDF term or as Column instance.
- Return type:
typing.Optional[tuple[csvw.metadata.Table,csvw.metadata.Column]]- Returns:
A pair (Table, Column) specifying the reference column - or None.
- property Dataset.column_names: SimpleNamespace
In-direction layer, mapping ontology terms to local column names (or None).
Note that this property is computed each time it is accessed (because the dataset schema may have changed). So when accessing a dataset for reading only, calling code should use readonly_column_names.
- Returns:
an types.SimpleNamespace object, with attributes <object>s for each component <Object>Table defined in the ontology. Each such attribute evaluates to None if the dataset does not contain the component. Otherwise, it’s an types.SimpleNamespace object mapping each property defined in the ontology to None - if no such column is specified in the component - and the local column name if it is.
- property Dataset.readonly_column_names: SimpleNamespace
- Returns:
types.SimpleNamespace with component names as attributes.
Editing metadata and schema
In many cases, editing the metadata of a dataset is as simple as editing
Dataset.properties(), but for the somewhat complex
formatting of provenance data, we provide the shortcut
Dataset.add_provenance().
Likewise, csvw.Table and csvw.Column objects in the dataset’s schema can
be edited “in place”, by setting their attributes or adding to/editing their
common_props dictionary.
Thus, the methods listed below are concerned with adding and removing tables
and columns.
- Dataset.add_table(url, *cols, **kw)[source]
Add a table description to the Dataset.
- Parameters:
url (
str) – The url property of the table.cols (
typing.Union[str,dict,csvw.metadata.Column]) – Column specifications; anything accepted bypycldf.dataset.make_column().kw (
typing.Any) – Recognized keywords: - primaryKey: specify the column(s) constituting the primary key of the table. - description: a description of the table.
- Return type:
csvw.metadata.Table- Returns:
The new table.
- Dataset.remove_table(table)[source]
Removes the table specified by table from the dataset.
- Parameters:
table (
typing.Union[str,csvw.metadata.Table]) –- Return type:
None
- Dataset.add_component(component, *cols, **kw)[source]
Add a CLDF component to a dataset.
- Parameters:
component (
typing.Union[str,dict]) – A component specified by name or as dict representing the JSON description of the component.kw – Recognized keywords: - url: a url property for the table; - description: a description of the table.
cols (
typing.Union[str,dict,csvw.metadata.Column]) –
- Return type:
csvw.metadata.Table
- Dataset.add_columns(table, *cols)[source]
Add columns specified by cols to the table specified by table.
- Parameters:
table (
typing.Union[str,csvw.metadata.Table]) –cols (
typing.Union[str,dict,csvw.metadata.Column]) –
- Return type:
None
- Dataset.remove_columns(table, *cols)[source]
Remove cols from table’s schema.
Note
Foreign keys pointing to any of the removed columns are removed as well.
- Parameters:
table (
typing.Union[str,csvw.metadata.Table]) –cols (
typing.Union[str,csvw.metadata.Column]) –
- Return type:
None
- Dataset.rename_column(table, col, name)[source]
Assign a new name to an existing column, cascading this change to foreign keys.
This functionality can be used to change the names of columns added automatically by
Dataset.add_component()- Parameters:
table (
typing.Union[str,csvw.metadata.Table]) –col (
typing.Union[str,csvw.metadata.Column]) –name (
str) –
- Return type:
None
- Dataset.add_foreign_key(foreign_t, foreign_c, primary_t, primary_c=None)[source]
Add a foreign key constraint.
..note:: Composite keys are not supported yet.
- Parameters:
foreign_t (
typing.Union[str,csvw.metadata.Table]) – Table reference for the linking table.foreign_c (
typing.Union[str,csvw.metadata.Column]) – Column reference for the link.primary_t (
typing.Union[str,csvw.metadata.Table]) – Table reference for the linked table.primary_c (
typing.Union[str,csvw.metadata.Column,None]) – Column reference for the linked column - or None, in which case the primary key of the linked table is assumed.
- Return type:
None
- Dataset.add_provenance(**kw)[source]
Add metadata about the dataset’s provenance.
- Parameters:
kw (
typing.Any) – Key-value pairs, where keys are local names of properties in the PROV ontology for describing entities (see https://www.w3.org/TR/2013/REC-prov-o-20130430/#Entity).- Return type:
None
Adding data
The main method to persist data as CLDF dataset is Dataset.write(),
which accepts data for all CLDF data files as input. This does not include
sources, though. These must be added using Dataset.add_sources().
- Dataset.add_sources(*sources, **kw)[source]
Add sources to the dataset.
- Parameters:
sources (
typing.Union[str,pycldf.sources.Source]) – Anything accepted bypycldf.sources.Sources.add().- Return type:
None
Reading data
Reading rows from CLDF data files, honoring the datatypes specified in the schema,
is already implemented by csvw. Thus, the simplest way to read data is iterating
over the csvw.Table objects. However, this will ignore the semantic layer provided
by CLDF. E.g. a CLDF languageReference linking a value to a language will be appear
in the dict returned for a row under the local column name. Thus, we provide several
more convenient methods to read data.
- Dataset.iter_rows(table, *cols)[source]
Iterate rows in a table, resolving CLDF property names to local column names.
- Parameters:
table (
typing.Union[str,csvw.metadata.Table]) – Table name.cols (
str) – List of CLDF property terms which must be resolved in resulting dict s. I.e. the row dicts will be augmented with copies of the values keyed with CLDF property terms.
- Return type:
collections.abc.Generator[collections.OrderedDict[str,typing.Any],None,None]
- Dataset.get_row(table, id_)[source]
Retrieve a row specified by table and CLDF id.
- Raises:
ValueError – If no matching row is found.
- Parameters:
table (
typing.Union[str,csvw.metadata.Table]) –- Return type:
collections.OrderedDict[str,typing.Any]
- Dataset.get_row_url(table, row)[source]
Get a URL associated with a row. Tables can specify associated row URLs by
listing one column with datatype anyURI or
specfying a valueUrl property for their ID column.
For rows representing objects in web applications, this may be the objects URL. For rows representing media files, it may be a URL locating the file on a media server.
- Parameters:
table (
typing.Union[str,csvw.metadata.Table]) – Table specified in a way that __getitem__ understands.row (
typing.Union[collections.OrderedDict[str,typing.Any],str]) – A row specified by ID or as dict as returned when iterating over a table.
- Return type:
typing.Optional[str]- Returns:
a str representing a URL or None.
- Dataset.objects(table, cls=None)[source]
Read data of a CLDF component as
pycldf.orm.Objectinstances.- Parameters:
table (
str) – table to read, specified as component name.cls (
typing.Optional[typing.Type]) –pycldf.orm.Objectsubclass to instantiate objects with.
- Return type:
pycldf.util.DictTuple- Returns:
- Dataset.get_object(table, id_, cls=None, pk=False)[source]
Get a row of a component as
pycldf.orm.Objectinstance.- Parameters:
table (
str) –id_ (
str) –
- Return type:
Writing (meta)data
- Dataset.write(fname=None, zipped=None, **table_items)[source]
Write metadata, sources and data. Metadata will be written to fname (as interpreted in
pycldf.dataset.Dataset.write_metadata()); data files will be written to the file specified by csvw.Table.url of the corresponding table, interpreted as path relative todirectory().- Parameters:
zipped (
typing.Optional[collections.abc.Iterable]) – Iterable listing keys of table_items for which the table file should be zipped.table_items (
list[collections.OrderedDict[str,typing.Any]]) – Mapping of table specifications to lists of row dicts.fname (
typing.Optional[pathlib.Path]) –
- Return type:
pathlib.Path- Returns:
Path of the CLDF metadata file as written to disk.
- Dataset.write_metadata(fname=None)[source]
Write the CLDF metadata to a JSON file.
- Fname:
Path of a file to write to, or None to use the default name and write to
directory().- Parameters:
fname (
typing.Union[str,pathlib.Path,None]) –- Return type:
pathlib.Path
- Dataset.write_sources(zipped=False)[source]
Write the sources BibTeX file to
bibpath()- Return type:
typing.Optional[pathlib.Path]- Returns:
None, if no BibTeX file was written (because no source items were added), pathlib.Path of the written BibTeX file otherwise. Note that this path does not need to exist, because the content may have been added to a zip archive.
- Parameters:
zipped (
bool) –
Reporting
- Dataset.validate(log=None, validators=None, ontology_path=None)[source]
Validate schema and data of a Dataset:
Make sure the schema follows the CLDF specification and
make sure the data is consistent with the schema.
- Parameters:
log (
logging.Logger) – a logging.Logger to write ERRORs and WARNINGs to. If None, an exception will be raised at the first problem.validators (
list[tuple[typing.Optional[str],str,typing.Callable[[pycldf.dataset.Dataset,csvw.metadata.Table,csvw.metadata.Column,collections.OrderedDict[str,typing.Any]],None]]]) – Custom validation rules, i.e. triples (tablespec, columnspec, attrs validator)ontology_path (
typing.Union[str,pathlib.Path,None]) –
- Raises:
ValueError – if a validation error is encountered (and log is None).
- Return type:
bool- Returns:
Flag signaling whether schema and data are valid.
Dataset discovery
We provide two functions to make it easier to discover CLDF datasets in the file system. This is useful, e.g., when downloading archived datasets from Zenodo, where it may not be known in advance where in a zip archive the metadata file may reside.
- pycldf.sniff(p)[source]
Determine whether a file contains CLDF metadata.
- Parameters:
p (
pathlib.Path) – pathlib.Path object for an existing file.- Return type:
bool- Returns:
True if the file contains CLDF metadata, False otherwise.
- pycldf.iter_datasets(d)[source]
Discover CLDF datasets - by identifying metadata files - in a directory.
- Parameters:
d (
typing.Union[str,pathlib.Path]) – directory in which to look for CLDF datasets (recursively).- Return type:
collections.abc.Generator[pycldf.dataset.Dataset,None,None]- Returns:
generator of Dataset instances.
Sources
When constructing sources for a CLDF dataset in Python code, you may pass
pycldf.Source instances into Dataset.add_sources(),
or use pycldf.Reference.__str__() to format a row’s source value
properly.
Direct access to pycldf.dataset.Sources is rarely necessary (hence
it is not available as import from pycldf directly), because each
pycldf.Dataset provides access to an apprpriately initialized instance
in its sources attribute.
- class pycldf.Source(genre, id_, *args, _check_id=True, _lowercase=False, _strip_tex=None, **kw)[source]
A bibliograhical record, specifying a source for some data in a CLDF dataset.
- Parameters:
genre (
str) –id_ (
str) –_check_id (
bool) –_lowercase (
bool) –_strip_tex (
typing.Optional[collections.abc.Iterable[str]]) –
- property entry: Entry
Converts Source to a pybtex Entry.
- classmethod from_entry(key, entry, **_kw)[source]
Create a cls instance from a simplepybtex entry object.
- Parameters:
key (
str) – BibTeX citation key of the entryentry (
simplepybtex.database.Entry) – simplepybtex.database.Entry instance_kw – Non-bib-metadata keywords to be passed for cls instantiation
- Returns:
cls instance
- class pycldf.Reference(source, desc)[source]
A reference connects a piece of data with a Source, typically adding some citation context often page numbers, or similar.
- Parameters:
source (
pycldf.sources.Source) –desc (
typing.Optional[str]) –
- class pycldf.dataset.Sources[source]
A dict like container for all sources linked to data in a CLDF dataset.
- add(*entries, **kw)[source]
Add a source, either specified as BibTeX string or as
Source.- Parameters:
entries (
typing.Union[str,pycldf.sources.Source]) –- Return type:
None
- expand_refs(refs, **kw)[source]
Turn a list of string references into proper
Referenceinstances, looking up sources in self.This can be used from a
pycldf.Datasetas follows:>>> for row in dataset.iter_rows('ValueTable', 'source'): ... for ref in dataset.sources.expand_refs(row['source']): ... print(ref.source)
- Parameters:
refs (
collections.abc.Iterable[str]) –- Return type:
collections.abc.Iterable[pycldf.sources.Reference]
- classmethod from_file(fname)[source]
Instantiate an instance from the data in a BibTeX file.
- Parameters:
fname (
typing.Union[str,pathlib.Path]) –- Return type:
- static parse(ref)[source]
Parse the string representation of a reference into source ID and context.
- Raises:
ValueError – if the reference does not match the expected format.
- Parameters:
ref (
str) –- Return type:
tuple[str,str]
- read(fname, zipped=False, **kw)[source]
Read sources from a BibTex file (possibly specified via URL).
- Parameters:
fname (
typing.Union[str,pathlib.Path]) –
Subclasses supporting specific CLDF modules
Note
Most functionality provided through properties and methods described below is implemented via
the pycldf.orm module, and thus subject to the limitations listed at ./orm.html
- class pycldf.Generic(tablegroup)[source]
Generic datasets have no primary table.
- Parameters:
tablegroup (
csvw.metadata.TableGroup) –
- property primary_table: None
Returns the primary table for the dataset.
- class pycldf.Wordlist(tablegroup)[source]
Wordlists have support for segment slice notation.
- Parameters:
tablegroup (
csvw.metadata.TableGroup) –
- get_segments(row, table='FormTable')[source]
Retrieve the list of segments of a form.
- Parameters:
row (
collections.OrderedDict[str,typing.Any]) –- Return type:
list[str]
- get_subsequence(cognate, form=None)[source]
Compute the subsequence of the morphemes of a form which is specified in a partial cognate assignment.
- Parameters:
cognate (
collections.OrderedDict[str,typing.Any]) – A dict holding the data of a row from a CognateTable.form (
typing.Optional[str]) –
- Return type:
list[str]
- property primary_table: str
Returns the primary table for the dataset.
- class pycldf.StructureDataset(tablegroup)[source]
Parameters in StructureDataset are often called “features”.
- Parameters:
tablegroup (
csvw.metadata.TableGroup) –
- property features
Just an alias for the parameters.
- property primary_table: str
Returns the primary table for the dataset.
- class pycldf.TextCorpus(tablegroup)[source]
In a TextCorpus, contributions and examples have specialized roles:
Contributions are understood as individual texts of the corpus.
Examples are interpreted as the sentences of the corpus.
Alternative translations are provided by linking “light-weight” examples to “full”, main examples.
The order of sentences may be defined using a position property.
>>> crp = TextCorpus.from_metadata('tests/data/textcorpus/metadata.json') >>> crp.texts[0].sentences[0].cldf.primaryText 'first line' >>> crp.texts[0].sentences[0].alternative_translations [<pycldf.orm.Example id="e2-alt">]
- Parameters:
tablegroup (
csvw.metadata.TableGroup) –
- get_text(tid)[source]
Retrieve a text by ID.
- Parameters:
tid (
str) –- Return type:
typing.Optional[pycldf.orm.Object]
- property primary_table: str
Returns the primary table for the dataset.
- property sentences: list[pycldf.orm.Example]
Sentences of the corpus.
- property texts: DictTuple | None
Retrieve texts.