Classification, or (hash)tags

Damian Cugley 8 September 2024

When, some distant day, there are enough post entries or pages on a site that they cannot be conveniently listed on one page, we may want to allow our readers to search and filter posts based on subject. A full-text search requires back-end support, so for mow we will concentrate on filtering by topic.

A quick typology of filters

Classification schemes for filtering come in a variety of shapes. Mostly they involve associating pages with a set of terms. The reader uses some UI to select one or more terms as a filter. Choosing more terms generally means seeing fewer matches in the filtered list.

Some ways schemes may vary:

There may be multiple facets (also called dimensions or axes), each with its own vocabulary of terms. For example, a collection of fashion notes might be separately classified by designer and by colour. The alternative is a single facet (or perhaps we should say no facets): terms for designer and terms for colour would be in the same vocabulary.
The vocabulary of terms might be hierarchical, where some terms are refinements or narrowings of others. For example, the term green might have narrower terms like eau de nil and sage. When a reader filters to green then posts tagged eau de nil should match, even if not explicitly tagged green.
Vocabularies may be closed (terms are determined ahead of time) or open (terms can be coined while classifying).

Faceted and hierarchical systems tend to be much more complicated to set up and use, but the trade-off is they may be more specific and efficient. There are controlled vocabularies for classifying medical papers, for example.

How terms are identified can also vary.

The identifier for a term might be the term itself. These might be formatted as word phrases (like eau de nil), or be tokenized forms using camel case or similar (like eauDeNil). Very common for open systems.
Terms might have codes (like C563.3, say), or be identified by a UUID or database identifier. In this case the label for the term is the human-readable phrase. More common for closed systems.

In any case user input may need conversion to the identifier. This might use a menu, or the user might type text that is matched to existing tags.

Simple tagging

The simplest type of classification system for users is non-faceted, non-hierarchical tags, using the word(s) as their own identifier. This is how hashtags in most social media.

For example, hashtags emerged on a social media site called Twitter back in 2007. They are distinguished from the text of the post by stating with a # and ending at the next space. Since they cannot contain spaces, multiple words can be combined as camel case as #superbOwl, though often people just use all lower case, as #superbowl, even though that can lead to ambiguous readings. When matching, capitalization is ignored.

The particular format of hashtags comes from Twitter’s having a user interface consisting of just a box for typing in to. Flickr and Tumblr, by contrast, have a separate spaces in their forms for entering tags. This means they can allow spaces in multi-word tags. Other sites like Instagram and Mastodon persist in embedding tags in the post text, perhaps for the sake of looking like Twitter.

Negative tags

As social media become less about searching for things and more about passively scrolling through a stream of posts chosen for you, tags have come to be used to exclude rather than include topics. This started with third-party filters for sites, but has become official in Tumblr, for example. Thus one might mute the tag #ai art so as to reduce the amount of AI-generated images in your feed.

Posts on similar topics

tagging (2)