Skip to content

Entity Tagging Scenario

goc9000 edited this page Mar 4, 2013 · 4 revisions

Entity Tagging Scenario

Description

This is a basic usage scenario for WikiReg in an "entity tagging" application. Such applications may be used to crowd-source properties of items that are difficult to machine analyze (e.g. objects or themes in an image), usually gamifying the process so that users have an incentive to participate.

As a particular example, we will consider a game inspired by the [Galaxy Zoo] 1 website, in which the kids are asked to classify galaxies by specifying their shape (elliptical, spiral, irregular, etc.) as well as any additional interesting features (rings, tails etc.). This is a difficult task for machines, particularly since the images are often low-resolution and noisy, problems that human vision has an amazing ability to overcome. The kids might also learn more about astronomy in the process.

Scenario

The game begins with a supervisor (e.g. the teacher) selecting a small number (10-20) of galaxies for the kids to tag. Her copy of the application will have to issue a query like:

search(po=('rdf:type', 'urn:ers:places:cosmic:Galaxy:1234'), limit=20)

Alternatively, the teacher could search for some famous galaxy by name. This requires a query like:

search(po=(('rdf:type', 'urn:ers:places:cosmic:Galaxy:1234'), ('rdf:label', 'Andromeda')))

Note 1: We need to filter by type as there are many other things named Andromeda (a book, a TV show, etc.). In this case, you could use the previous single-pair search variant to search by name only, and then do the filtering by type on the client, or doing two queries and intersecting the result. For the long term however, I don't think we will be able to get away with not supporting an AND mechanism in the search.

Note 2: The URN taxonomy used here is just tentative, of course.

The application now sends to each of the kids' computers the list of galaxies to be tagged, i.e. the list of URNs as found by the search. Each of the kids' computers will get the entities with a series of queries like:

get_annotation('urn:places:cosmic:Andromeda:1')
get_annotation('urn:places:cosmic:NGC4414:1')
get_annotation('urn:places:cosmic:IC883:1')
...

Alternatively, we could support a form that gets multiple entities in a single call:

get_annotation('urn:places:cosmic:Andromeda:1', 'urn:places:cosmic:NGC4414:1', ...)

For the kids to see the images, there will be a property like <urn:ers:properties:visual:image:lowres:0> pointing to a downloadable image (hopefully cached on a local server somewhere). Alternatively, since the vast majority of the images are very low-res anyway, you could even get away with encoding the image in BASE64 directly in a string property (would take 2KB or so).

Implementation note: Tricks such as these draw attention to the fact that some property values might be unusually large (in the KBs or tens of KBs; a real-life example is abstracts on DBPedia). We might have to keep this in mind when designing the document format.

The kids now start tagging, as fast as they can. Some of the properties they might specify are:

  • galaxyprop:hasShape (links to an entity of type GalaxyShape: elliptical, spiral etc.)
  • galaxyprop:elliptical
  • galaxyprop:spiralTightness
  • galaxyprop:hasRing (true or false)
  • galaxyprop:hasBar
  • galaxyprop:hasTail
  • galaxyprop:hasRing
  • etc.

(with the prefix galaxyprop: standing in for something like urn:ers:properties:places:cosmic:Galaxy:)

The client will do the tagging using queries like:

update_data('urn:places:cosmic:Andromeda:1', 'galaxyprop:hasTail', False)

Implementation note: The system will create the graph and provenance info behind the scenes.

At the end of the time limit, the kids will be rewarded for their performance. Congratulations will be received by kids such as:

  1. The first to correctly tag property X in galaxy G
  2. The one(s) who tagged the most galaxies for any given property X
  3. The one(s) who tagged the most properties within any given galaxy G

Question 1 requires a query like:

get_values('urn:places:cosmic:Andromeda:1', 'galaxyprop:hasTail')`

This gets all values contributed by all of the kids (with some info so that we can tell who contributed what). We can tell which value is 'correct' by using a majority rule, or by using the value from some special trusted contributor (the teacher, an institute, Wikipedia etc.).

Question 2 might be best answered with a query like:

search(p='galaxyprop:hasTail', source='urn:...:xo:12345')

iterated for all possible source values.

Question 3 might best be answered with a query like:

get_annotation('urn:places:cosmic:Andromeda:1', filter_by_source='urn:...:xo:12345')

Note that since the judging is deterministic, these queries are done in a decentralized way, i.e. every kid's laptop does the queries and announces any result relevant to the user.

Summary of used API calls

get_annotation(entity[, named_params])

Gets all the data associated with the specified entity. Unless otherwise specified, data from all provenances is returned.

The named_params are optional named parameters generally used for further filtering the results of a query. A list is given in a subsequent section.

get_annotation(entity1, entity2, ..., entityN[, named_params])

Trivial extension of the above that gets multiple entities in a single go.

get_values(entity, property[, named_params])

Gets all the values set for the given property in the given entity. Unless otherwise specified, data from all provenances is returned.

update_data(entity, property, value)

Sets the value for the given property in the given entity, replacing all the previous ones.

search(criteria[, named_params])

Gets the URNs for the entities that simultaneously fit all of the given criteria, given through named parameters. Criteria should include:

  • p=property: Entity contains some values for the given property (from any provenance)
  • po=(property,value): Entity contains the given property-value combination (from any provenance)
  • po=((p1,v1),(p2,v2)...): Entity contains all of the given p-v combinations
  • source=src: In combination with the above, the values should originate from src (i.e. there is a graph whose source is src, that sets the given p-v for that entity)
  • entity=(entity1,entity2,...entityN): Entity is one of entity1..entityN (i.e. search only among these entities)

Supported named parameters

  • limit=max_count (for search): Returns at most max_count results
  • filter_by_source=src (for get_annotation and get_values): Returns only the P-V pairs contributed by the given source, i.e. associated with a graph that has that given contributor as the source.

Place for comments

Teodor comment:

Cristian, that is great, but I would like to express the same a bit simpler. Assuming we have a quadruple (4-tuple) like this (s,p,o,pr) where s is subject, p is predicate/property, o is property value or another subject, pr stands for provenance.

Following your ideas, we will have the following query types: (s ? ? ?), (s p ? ?), (s ? ? pr), (? p o ?), (? p ? pr).

One question, should we relax a bit more the (? p o ?) which is the equivalent of GetEntities(p,o,[limit]) such that we can allow queries without knowing the p? That is, (? ? o ?).

As per my CouchDB experience, for each such different query we would have to build a view. After each update all the views must also be updated. These are materialized views, thus they would have an impact on our limited storage. Moreover, they are not surprisingly fast as the system is fully disk-oriented.

On the other hand, Cassandra is way more faster and CumulusRDF already allows us to run many different queries, like the ones mentioned above. For a better view, please have a quick look here: CumulusRDF paper.

As a bottom line, having this increased complexity (compared to a simple (s ? ?) query), shouldn't we reconsider the capabilities of the data store?

Cristian comment:

The point of this scenario is to flesh out the API that the client/user sees. He or she will not want to have anything to do with the graph or even work directly with triples. Whether these queries translate to quad lookup patterns behind the scenes, and what data store may best be used, are both implementation issues that should not, in principle, affect the API.