Extended attributes and tag file systems

The problem is as classic as they come. You’ve stored your books sorted by author name, and now you want all those written from 1950 to 1960. Or all those with a mostly nautical subject. Librarians used to solve this with differently sorted boxes of cards.

When storing files on your computer, of course, the problem is the same. For example, your photos are probably ordered by date, because that’s the natural thing to do. Each photo set is stored in the order it came in. But now you want all the photos where your aunt Louise appears.

'Tag file systems' try to solve this problem by presenting your files in directories which answer questions. You tag each aunt Louise photo with an 'auntlouise' property, and the file system will present you with a directory of all the relevant photos ('/photos/auntlouise'). This directory could have one subdirectory for each year where photos of aunt Louise were actually taken. There are many ways to design the interface, and many ways to implement the function.

Search tools solve the same problem differently, by indexing the tags or the files data, and letting users ask direct questions, but users often prefer to browse a choice tree instead of searching, because this avoids having to find the right keywords.

The two approaches are complementary, both can use tagged files.

The confidence issue

A few things are quite certain:

Your recordings or photos are precious because they can’t be done again if they are lost.
Your tag data is precious because it is a lot of work to assign descriptors to each file, and you have thousands of them.

Some of your tag-able data is likely to outlive most software systems. Open-source developers get bored, companies die, marketing departments kill products.

Implementations of common filesystems (DOS FAT, Linux Ext, BSD FFS, CDROM’s ISO9660, etc.) are probably the most enduring software objects. It is normal to see backwards compatibility extending for dozens of years. In fact it is unlikely that an OS supplier would kill compatibility with some widely used storage format and its specific functions ever. At the very least, they would provide a very good migration path. Also, filesystem developers are serious people, for the very simple reason that they get shot if they lose data (just joking - still they do have nightmares about flipped bits). Filesystems. Just. Don’t. Fail. (And you have backups anyway).

Higher-level application developers on the other hand are usually a bit less committed on continuity. They get distracted by pretty interfaces (Oh! Shiny!), new paradigms, whatever… Backward compatibility is not always an absolute priority for them. For example:

I just can’t count the number of times that the Digikam program, which I use to manage our photos, lost its central database. Completly. Maybe I was just unlucky some of the times, maybe, because I use the program relatively unfrequently, I missed an upgrade window and version N+2 could not convert the old data. I just don’t know. What I do know is that the only reason I never lost anything that could not be automatically rebuilt is that I instruct Digikam to store all the descriptive data inside the image files themselves in addition to the database storage. The latter is easily and automatically rebuilt by scanning the files.

Digikam has it easy because all photo image file formats have extensive provisions to store metadata internally. No such possibility exists in general: most file format have only limited or unexistant metadata storage capabilities. Text files have none of course. Among the surprising cripples: video files.

Trusting the system: extended file attributes

Back to tag file systems it is quite easy to derive what we really do not want:

A system that would take over the data storage part, either by using its own organisation over the regular filesystem, or by using some other form of storage, like a database or a just invented new kind of filesystem. Sorry, can’t trust you.
Slightly less terrifying, a system that would store the descriptive data in its own database only. That would be 'digikam' when not storing into the image files and we already know what will happen one day or another.

A tagging file system done right should:

Work with user files physically stored in whatever way the user chose (maybe organised by date, maybe anything else, maybe a total mess).
Make sure that the tagging data is stored as safely and openly as the files themselves, meaning managed by the filesystem itself.

One possibility to achieve these is to store the tag data in extended file attributes.

Beyond immediate safety, there are several advantages to storing tag data in extended attributes:

Persistence: The tagging data will remain available if the original tagging software goes away: extended attributes are just name/value text pairs, and tags are just lists of strings, so, there can be no major difficulty extracting the tags and inserting them into the new shiny system when this comes along.
Your tag data will be as lasting as your files, which is exactly what you want.
Availability: Because the data is stored transparently, and accessible using standard interfaces, it will be easily available to other access methods beyond the tagging software. For exemple, a full text search engine will be able to index it.
Immunity to name changes: The extended attributes data will follow the file, whatever renaming you chose to do (on the file or its parent directories). Renaming and other aliasing operations are a serious headache for approaches that would keep the attributes in a central index.

Of course, it cannot be enough to store the descriptive data along with the files, because selection would be too slow (and the current 'find' command can’t even select files according to extended attributes values !).

This means that you need a central index, just like 'digikam' has one, or like an SGBDR creates secondary indexes to speed up accesses. This central index will be rebuildable by scanning the filesystem for extended attributes data, and it will be used to feed the shiny interface.

One problem of using extended attributes is that readonly files can not be tagged. Beagle solved this issue by using the database as a fallback in this case (without filesystem backup then).

As far as I know, none of the existing semantic/tagging file system projects currently use extended attributes for tag storage.

However some of them would be compatible with such an approach. The following is a quick review of a number of projects.

A quick review of existing systems

Features

A number of distinguishing features can characterize the current projects.

Storage mode

There are three possible approaches:

Let the user organize their files inside a primary directory tree with a structure of their choice, and layer the tagging system over this.
Implement the primary file storage using an internal opaque directory structure over the standard filesystem. The storage can only be accessed through the tag interface (the names and paths are mangled).
Implement a full file system above a block storage device.

We’ll let the third approach aside.

Among the first two, only the first is compatible with the demands made in the first part. It does have certain drawbacks compared to the second one though, we may just as well list them.

When layering a tag structure above a physical hierarchy, there is no obvious way to create a file inside the virtual structure. Where would the physical file go ? So the virtual tree can only be used for finding and editing, and it is a bit troubling to be working in a tree and have to perform "Save As" into another tree (the physical one). It might be possible to overload the "symbolic link" operation for creating files on the command line, but this would not help your word processor.
Because you create files out of the tag structure, forgetting to tag them is easy. Too many untagged files will render the system useless, as these can’t be found in the virtual tree.
'Oyepa' solves this by monitoring the physical tree and popping up a dialog to ask for tags when a new file is created.

Facets

http://en.wikipedia.org/wiki/Faceted_classification

A 'faceted' system used qualified tags or name/value pairs, e.g. 'author: Dickens'.

When presenting the virtual tree, faceted systems alternate "name" and "value" directories in path such as in:

Author/
  Dickens/
  Poe/
    Type/
      Novel/
      Short Story/
        The pit and the pendulum.epub

Quite obviously, this approach allows useful distinctions (orange as colour or as fruit), it also complicates the system. It is invaluable for an online catalog, but may prove overkill for a personal organisation system, where a flat namespace, making everything simpler, may still be sufficient (and has been proven good enough in social media tagging applications).

Faceted systems are also sometimes defined in opposition to a strict hierarchy, with a reference to having more "dimensions" which does not make sense (both systems contain the same richness of information).

For exemple you could have a directory hierarchy of artist/album/title or a faceted system with the same qualifiers. What changes is the way you browse the system, not what information is stored or how many "dimensions" it has.

A faceted system is browseable, which supposes that each category has a reasonably limited number of possible values. Other than this, it’s quite equivalent to an rdbms.

Tag hierarchies

Reasoning (hierarchy/equivalences). A tag set can be structured and enriched to understand that 'France' is in 'Europe' and that 'USA' and 'United States of America' are the same, so that files tagged 'France' can be found in the folder 'Europe'. I am not sure that this is really what the user wants in most cases, actually.

Open or closed vocabularies

There is a choice to be made when adding a tag to a given file: can you make it up on the fly, or do you chose it from a predefined list (separately updateable). The first approach is apparently easier but leads to errors.

Tmsu (Paul Ruane)

Layered above the primary user storage hierarchy.

Commands can set non-faceted, open vocabulary tags on files and manage them

Metadata (tags and paths) is stored in a SQLite database.

A fuse file system gives a view of the file set organised by tag directories. Files in the virtual tree are symlinks to the actual ones. Directories countain both files and tags subdirectories, and tag paths are equivalent to AND queries.

File name collisions are handled by adding numbers to file names. All file names are modified (with a globally unique serial number?).

Tmsu knows which AND queries have no results and will not create subdirectories for tags that can not be associated to the current path (keeps track of all existing tag combinations).

Tagging a directory in the primary tree is recursive.

Renames inside the virtual tree are not implemented (should change the tags set I guess).

Oyepa (Manuel Arriaga)

oyepa stores the tagging data in the file names.

Oyepa has a GUI which monitors specific filesystem directories and asks for tagging information when files are created. When tagging is performed, the file is renamed based on the tags. E.g.: 'business john smith dogfood.doc'

This has the same properties as using extended attributes except from the dancing through hoops forced by overloading file names with additional function. It has the advantage that most tools have more respect for file names, than for extended attributes, but shares some of the problems anyway (e.g.: file name length limitations or character set issues).

Tagsistant

http://www.tagsistant.net/ http://en.wikipedia.org/wiki/Tagsistant

Tags are values associated with files, and have no type (they all share the same namespace).

Directories are boolean queries on tags, e.g.: 'london/AND/2010/AND/photo/'

It is possible to define a tag hierarchy with 'includes' and 'equivalent' relations.

File storage is opaque and managed by Tagsistant using the underlying file system, and tag data is held inside MySQL or SQLite. Of course there is no reasonable access to the data through the regular filesystem.

The tagging part seems well thought-out, but this fails all my criteria: if Tagsistant thrashes its data you’re totally out of luck.

Tagfs/Semfs

Tagfs was later renamed as semfs

The original paper about tagfs.

Implements internal / opaque storage method.

Access through WebDAV.

Metadata stored in an RDF store.

The actual scope of SemFS is wider than just file tagging, which is just the basic/initial function.

Tagfs (marook, Markus Pielmeier)

https://github.com/marook/tagfs#readme

Layered above the primary user storage hierarchy.

Fuse-based tagging file system written in Python

Tags are set per primary directory and held in a hidden file.

Tags are name-value pairs, with possible filtering on values. Tag names and value alternate in the directory hierarchy, e.g.:

genre/
genre/comedy
genre/comedy/itemdir
genre/drama
genre/drama/otheritemdir

You need to umount/remount to see tags changes.

When using the virtual hierarchy, you can copy files into the virtual directories matching actual items, but not elsewhere (neither can you create new item directories).

Recollfs (Piotr Długosz)

This is a bit outside the main subject, but I’ll mention it because it is based on Recoll…

This is a fuse file system where directories contain Recoll search results.

Directory names are search queries: mkdir "Louise OR Jane"

The directory contents are the Recoll search query results.

The directories do not appear automagically, you have to create them, there is no browseable tree, and there are no facilities to actually set the tags. So this is definitely a far outlier, still worth mentionning.

dhtfs

Analog to Tagsistant but less developped.

TaggedFS (lordikc)

http://lordikc.free.fr/wordpress/?p=689

I could not make full sense with the explanations. As far as I could understand the system uses the regular file-system for storage, but with a custom directory structure (not user-chosen). Tags are stored in aux files, this is a non-faceted system, but with defined hierarchical relationships between tags (family = (mom, pop)).

Dbfs (Onne Gorter)

This is a faceted system, an ambitious one, it stayed at the prototype stage as far as I understand (worked with KDE).

I only read the introduction.

Storage handled by the regular filesystem with an attempt at automatic location choice (e.g. word documents go to ~/Documents).

File attributes were automatically deduced (file type, dates) from files, with plans for extraction (author, etc.). I think that there was also a possibility for the user to add explicit tags.

Elyse

http://silkwoodsoftware.com/index.html

Mac and Windows only. AFAIK files stay in their original location in the filesystem with original tags matching their parent directory names. Tags can then be added and organised freely.

Tag2Find

Windows-only

A few other references and notes

Other references on this page

http://en.wikipedia.org/wiki/Virtual_folder Example: Mac smart folders (spotlight queries), emacs-vm virtual folders.