How to implement tag system

AlgorithmSystemTagging

Algorithm Problem Overview


I was wondering what the best way is to implement a tag system, like the one used on SO. I was thinking of this but I can't come up with a good scalable solution.

I was thinking of having a basic 3 table solution: having a tags table, an articles tables and a tag_to_articles table.

Is this the best solution to this problem, or are there alternatives? Using this method the table would get extremely large in time, and for searching this is not too efficient I assume. On the other hand it is not that important that the query executes fast.

Algorithm Solutions


Solution 1 - Algorithm

I believe you'll find interesting this blog post: Tags: Database schemas

> The Problem: You want to have a database schema where you can tag a > bookmark (or a blog post or whatever) with as many tags as you want. > Later then, you want to run queries to constrain the bookmarks to a > union or intersection of tags. You also want to exclude (say: minus) > some tags from the search result.

“MySQLicious” solution

In this solution, the schema has got just one table, it is denormalized. This type is called “MySQLicious solution” because MySQLicious imports del.icio.us data into a table with this structure.

enter image description hereenter image description here

Intersection (AND) Query for “search+webservice+semweb”:

SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
AND tags LIKE "%webservice%"
AND tags LIKE "%semweb%"

Union (OR) Query for “search|webservice|semweb”:

SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
OR tags LIKE "%webservice%"
OR tags LIKE "%semweb%"

Minus Query for “search+webservice-semweb”

SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
AND tags LIKE "%webservice%"
AND tags NOT LIKE "%semweb%"

“Scuttle” solution

Scuttle organizes its data in two tables. That table “scCategories” is the “tag”-table and has got a foreign key to the “bookmark”-table.

enter image description here

Intersection (AND) Query for “bookmark+webservice+semweb”:

SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId
HAVING COUNT( b.bId )=3

First, all bookmark-tag combinations are searched, where the tag is “bookmark”, “webservice” or “semweb” (c.category IN ('bookmark', 'webservice', 'semweb')), then just the bookmarks that have got all three tags searched for are taken into account (HAVING COUNT(b.bId)=3).

Union (OR) Query for “bookmark|webservice|semweb”: Just leave out the HAVING clause and you have union:

SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId

Minus (Exclusion) Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.

SELECT b. *
FROM scBookmarks b, scCategories c
WHERE b.bId = c.bId
AND (c.category IN ('bookmark', 'webservice'))
AND b.bId NOT
IN (SELECT b.bId FROM scBookmarks b, scCategories c WHERE b.bId = c.bId AND c.category = 'semweb')
GROUP BY b.bId
HAVING COUNT( b.bId ) =2

Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.


“Toxi” solution

Toxi came up with a three-table structure. Via the table “tagmap” the bookmarks and the tags are n-to-m related. Each tag can be used together with different bookmarks and vice versa. This DB-schema is also used by wordpress. The queries are quite the same as in the “scuttle” solution.

enter image description here

Intersection (AND) Query for “bookmark+webservice+semweb”

SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id
HAVING COUNT( b.id )=3

Union (OR) Query for “bookmark|webservice|semweb”

SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id

Minus (Exclusion) Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.

SELECT b. *
FROM bookmark b, tagmap bt, tag t
WHERE b.id = bt.bookmark_id
AND bt.tag_id = t.tag_id
AND (t.name IN ('Programming', 'Algorithms'))
AND b.id NOT IN (SELECT b.id FROM bookmark b, tagmap bt, tag t WHERE b.id = bt.bookmark_id AND bt.tag_id = t.tag_id AND t.name = 'Python')
GROUP BY b.id
HAVING COUNT( b.id ) =2

Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.

Solution 2 - Algorithm

Nothing wrong with your three-table solution.

Another option is to limit the number of tags that can be applied to an article (like 5 in SO) and add those directly to your article table.

Normalizing the DB has its benefits and drawbacks, just like hard-wiring things into one table has benefits and drawbacks.

Nothing says you can't do both. It goes against relational DB paradigms to repeat information, but if the goal is performance you may have to break the paradigms.

Solution 3 - Algorithm

Your proposed three table implementation will work for tagging.

Stack overflow uses, however, different implementation. They store tags to varchar column in posts table in plain text and use full text indexing to fetch posts that match the tags. For example posts.tags = "algorithm system tagging best-practices". I am sure that Jeff has mentioned this somewhere but I forget where.

Solution 4 - Algorithm

The proposed solution is the best -if not the only practicable- way I can think of to address the many-to-many relationship between tags and articles. So my vote is for 'yes, it's still the best.' I'd be interested in any alternatives though.

Solution 5 - Algorithm

If your database supports indexable arrays (like PostgreSQL, for example), I would recommend an entirely denormalized solution - store tags as an array of strings on the same table. If not, a secondary table mapping objects to tags is the best solution. If you need to store extra information against tags, you can use a separate tags table, but there's no point in introducing a second join for every tag lookup.

Solution 6 - Algorithm

I would like to suggest optimised MySQLicious for better performance. Before that the drawbacks of Toxi (3 table) solution is

If you have millions of questions, and it has 5 tags in each, then there will be 5 million entries in tagmap table. So first we have to filter out 10 thousand tagmap entries based on tag search then again filter out matching questions of those 10 thousand. So while filtering out if the artical id is simple numeric then it is ok, but if it is kind of UUID (32 varchar) then filtering out needs larger comparison though it is indexed.

My solution:

Whenever new tag is created, have counter++ (base 10), and convert that counter into base64. Now each tag name will have base64 id. and pass this id to UI along with name. This way you will be having maximum of two char id till we have 4095 tags created in our system. Now concatenate these multiple tags into each question table tag column. Add delimiter as well and make it sorted.

So table looks like this

enter image description here

While querying, query on id instead of real tag name. Since it is SORTED, and condition on tag will be more efficient (LIKE '%|a|%|c|%|f|%).

Note that single space delimiter is not enough and we need double delimiter to differentiate tags like sql and mysql because LIKE "%sql%" will return mysql results as well. Should be LIKE "%|sql|%"

I know the search is non indexed but still you might have indexed on other columns related to article like author/dateTime else will lead to full table scan.

Finally with this solution, no inner join required where million records have to be compared with 5 millions records on join condition.

Solution 7 - Algorithm

CREATE TABLE Tags (
    tag VARHAR(...) NOT NULL,
    bid INT ... NOT NULL,
    PRIMARY KEY(tag, bid),
    INDEX(bid, tag)
)

Notes:

  • This is better than TOXI in that it does not go through an extra many:many table which makes optimization difficult.
  • Sure, my approach may be slightly more bulky (than TOXI) due to the redundant tags, but that is a small percentage of the whole database, and the performance improvements may be significant.
  • It is highly scalable.
  • It does not have (because it does not need) a surrogate AUTO_INCREMENT PK. Hence, it is better than Scuttle.
  • MySQLicious sucks because it cannot use an index (LIKE with leading wild card; false hits on substrings)
  • For MySQL, be sure to use ENGINE=InnoDB in order to get 'clustering' effects.

Related discussions (for MySQL):
many:many mapping table optimization
ordered lists

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSaif BechanView Question on Stackoverflow
Solution 1 - AlgorithmNick DandoulakisView Answer on Stackoverflow
Solution 2 - AlgorithmJohnView Answer on Stackoverflow
Solution 3 - AlgorithmJuha SyrjäläView Answer on Stackoverflow
Solution 4 - AlgorithmDavid ThomasView Answer on Stackoverflow
Solution 5 - AlgorithmNick JohnsonView Answer on Stackoverflow
Solution 6 - AlgorithmKanagavelu SugumarView Answer on Stackoverflow
Solution 7 - AlgorithmRick JamesView Answer on Stackoverflow