This is an attempt to create and maintain a vocabulary for Scutters, reflecting the need for certain metadata associated with scuttering.

Contents

Previous and related work

* eikeon's InformationStore ontology
* mattb's scutter work and ideas.
* Kjetil's Perl RDF::Scutter supports most of this spec.
* LeighDodds has a Scutter called Slug which also supports this spec. See also the associated RDFS schema.
* a vocabulary for HTTP headers

Model

To avoid making statements about the retrieved document sources, the vocabulary features at least one level of indirection, similar to the event approach in eikeon's InformationStore ontology.

All resources created to describe the actions and contents of a scutter and its store, should be anonymous resources, i.e. resources without a URI, to help prevent others from making statements that would render the scutter inoperable. Additionally, the descriptions should be kept somewhat separate, possibly by storing them with a specific grouping identifier, such as Redland contexts.

Context

A Context is defined as the "shadow resource" of a source document, i.e. an anonymous resource. All statements will be made about the context instead of about the document itself.

* A context may have skip relationship with another anonymous resource, a Reason, to signal that the source document should not be retrieved, perhaps because it's a duplicate (e.g. a URI with and/or without "www."), because it doesn't contain any information or maybe even irrelevant/confusing information, e.g. chat logs. The reason resource should be described in enough detail for involved parties to complain and/or rectify any problems, including the identity of the agent making the assertion.
* A context has a source property, the URI of the source document itself. This property is a subproperty of dcterms:references.
* A context has an origin property, usually the URI of another document with an rdfs:seeAlso to the source document - where the source document was initially discovered, but the origin may also be an event triggered by a user, e.g. by manual submission into a store from an HTML form, or a notification event from another scutter system. In this case the source is an anonymous resource with a foaf:maker and relevant iCal properties. Ideally the origin would be a functional property, but that may lead to unintentional merging of resources when different scutters' stores are merged. This property is a subproperty of dcterms:referencedBy.
* A context will eventually have a number of fetch relationships with other anonymous resources, each representing an event - the actual GETs of the source document.
    Can I suggest the addition of a latestFetch property, which points to the most recent Fetch? It's easier to query for this (e.g. to get ETags rather than sort the Fetches by date -- LeighDodds 
    This may be implementation specific, but where the Scutter is maintaining a local cache of the original files, we need a relationship to indicate where that copy can be located. Useful if scuttering and parsing a separate processes. -- LeighDodds

Fetch

A Fetch represents a GET (or equivalent for non-http URIs) event of the source document of a context.

* A fetch has a date/time property, dc:date, giving the date/time of the fetch operation. A fetch is considered short-lived enough to be modelled without temporal extent, but iCal start/end dates/times may be added if convenient.
* A fetch has an interval property, a suggestion for the number of seconds that should (at least) pass between fetches of the source document. Together with the dc:date property this opens up opportunities for scutters to keep context-specific information about document update schedules, and possibly alter retrieval schedules based on this information. The default interval should be at least one week - an average document doesn't change that often, but may be adjusted for special document types, e.g. RSS channels. Experience seems to show, that a successful scheduling algorithm adds a fixed amount (usually the minimum interval) to the interval when the document is not updated, and halves the interval when it is updated.
* A fetch (at least for http URIs) results in an HTTP response code, the status property (code only, no message). Various actions must or should be taken based on this, see e.g. Mark Pilgrim's article on Atom aggregator behaviour and his Aggregator HTTP Test Suite. This property could possibly find a better place to live in a general HTTP vocabulary.
* A fetch hopefully also results in an etag property, the contents of the HTTP header ETag. This can and should be used for subsequent retrievals, see BDG to Etags. This property could possibly find a better place to live in a general HTTP vocabulary.
* A fetch should also result in a last_modified property, from the HTTP header Last-Modified. As with ETag, the value of this header can be used with a conditional GET, see HTTP Conditional GET for RSS Hackers. This property could possibly find a better place to live in a general HTTP vocabulary.
* A fetch may have a foaf:sha1 property, containing the SHA1 hash value of the contents. This may be used in cases where and etag and last_modified are not present, to prevent processing of documents that haven't changed. Note however, that foaf:sha1 is currently labeled unstable in the specification.
* A fetch may have a content_type literal property, corresponding to the returned HTTP header Content-Type. This property is included mostly for statistical reasons, as it sadly doesn't represent too much value in the wild.
* A successful fetch and parse results in a number of "raw" triples (not including any inferred statements), reflected in the property raw_triple_count. This property is mostly included for statistical reasons, and should always reflect the number of triples last obtained from the given context, even if the fetch or parse wasn't successful. In these cases, the number of triples from the previous fetch should be carried forward.
* A non-successful fetch or subsequent parse must result in an error relationship with an anonymous Reason resource. The reason resource should contain enough information to diagnose and fix the problem, e.g. cite the returned HTTP message or parsing error.

Reason

A Reason describes the cause of a fetch being in error or a context having a skip property.

* A reason should at least have a dc:description property, holding a textual error message or the rationale behind it, in case it's a skip reason.
* A reason should have a dc:date property, holding the date of the assertion.
* A reason may (for skip reasons) have a foaf:maker property, describing the agent asserting the relationship.

OWL Terms and Definitions

under construction...

Example Resource Descriptions

The following examples should not be considered normative, they exist only for illustrative purposes.

A successfully fetched document

under construction...

A disabled document

under construction...

A document in error

under construction...

Scratchpad

* How should permanent redirects etc. (including Content-Location) be handled? Perhaps by disabling the old context (by use of the skip property) and creating a new, possibly pointed to from the old context or the reason resource?
* The type of the source document, in case it makes sense, like for an RSS channel, should be saved with the context or fetch. How should this be done without asserting an instance of that type, and without making perhaps artificial statements about the source document?
* How does support for robots.txt fit into this? Perhaps a property off a context to another context describing the corresponding robots.txt? The robot context would then be shared by all URIs from that host, and could even have a property with the contents of the file?
* Terms from the web of trust vocabulary should be integrated/used with this vocabulary, to provide for authentication etc.
* Some terms for publish/subscribe may be necessary, but they likely should live in their own vocabulary, to facilitate general use. Inspiration for such a vocabulary: RNA: RESTful Notification Architecture.
* We need to handle Expiry/max-age headers as well as Last-Modified. But yet again, something for a general HTTP header vocab.


Namespace URI

The PURL http://purl.org/net/scutter points to this page.


CategorySpecification