How to store articles or other large texts in a database

XmlDatabase

Xml Problem Overview


I am currently in the process of designing myself a database driven website. The main reason is for learning purposes but I wont lie, there is a small amount of vanity included!

While I believe that my database design is pretty good so far, I am still not entirely sure on the best way of storing articles or other large texts. I know most DBMSs have the TEXT datatype or equivalent and can hold a massive amount of text. However, storing a full article as one long string makes for unhappy reading, so formatting is going to be needed.

Do I store the article text along with all of the HTML or BBcode tags - or is it better to simply create the page in either a HTML or XML document and store the path to this file in the DB?

I quite like the idea of storing articles as an XML document as I can easily markup an article with custom tags and use PHP's XML and XSLT functions to transform the XML to HTML [or indeed, any other format]. It also allows the author to dictate when to create line/page breaks. This approach would of course require extra coding [which I am not afraid of] but it does present a problem with making articles searchable.

I know MySQL, for example, has SQL syntax for searching for specific terms/phrases inside strings held in a text field. If I were to store text in separate files, how might I approach making these articles searchable?

There is quite a lot I have written here on such a simple question, so I will break it down:

1: Is there a "best" way of storing large amounts of formatted text directly in a database or
2: is it better to hold paths to that text in the form of HTML/XML/Whatever documents.

If 2, is there an elegant way of making that text searchable?

Thank you for your time :)

Xml Solutions


Solution 1 - Xml

Store everthing in one big text field as Alex suggested. For searching, don't hammer your database, use http://lucene.apache.org/java/docs/">Lucene</a>;, or http://htdig.sourceforge.net/">htdig</a> to create an index of your output. This way searches are very fast. The side effect is you make your searches a little more search engine friendly; you take your keywords field (as backslash suggested) and stick them in the meta-keywords attribute.

Edit

Unless you are only searching keywords, having the db do the searches will be horribly slow (ever searched a forum and it takes FOREVER?). There is no way for the database to index a

  select.. where FULLTEXTFIELD like '%cookies%'.  

It is frustrating looking for an article and the search doesn't return the results your are looking for because they weren't in the keyword field! Htdig allows you to search the full text of the article efficiently. Your searches will come back instantly, and EVERY term in the article is fully searchable. Putting the keywords in the meta tags will make searches on those terms come higher on the results page.

Another benefit is fuzzy matching. If you search for 'activate' htdigg will match pages that have active, activation, activity etc. (configurable). Or if the user misspells a word, it will still be matched. You want your users to have a Google like experience, not an annoying one. :)

You do need a script to create a list of links to all your pages from your database. Have htdig crawl this automatically and you never have to think about it again.

Also htdig will crawl your non database pages as well so your whole site is searchable through the same simple interface.

As for the keyword field , you should have a separate table called keywords with the id of the article and a keyword field (1 keyword per row). But for simplicity, having a single field in the db isn't a terrible idea, it makes updating the keywords pretty easy if you put it in a form.

If you don't want to fuss with all the hassle of that, you can try using http://www.google.com/cse">Google custom search. it is far less work, but you have no guarantee that all your pages will get indexed.

Good luck!

Solution 2 - Xml

The TEXT, BIGTEXT, LONGTEXT and others data types fields were created in order to store large amount of text (64 Kbytes to 4 Gbytes depending of the RDBMS). They just create a binary pointer to locate the text in the database and it is not stored directly in the table. Is almost the same procedure if you store a path in a varchar field to locate the document, but having it in the database makes it easier to maintain because if you delete the row the document disappears with it without the need to delete it in other procedure (as if you stored as a file). Logically this makes your database bigger and sometimes not so easier to backup and transport, but to transport the documents one by one would be tedious and slow.

As you see it depends on the number of documents and rows in the database.

For the searching procedure, I recommend to create a new "keywords" field in order to speed your searches. You can search too into the first n characters of the documents too, casting them as a CHAR or VARCHAR and locate the title and subtitle into these amounts if they don't have already a specific field.

Solution 3 - Xml

Depending on how you have arranged and installed everything, it can be hard to access outside files from remote clients that can access the DB just fine -- so why not save all of the XML into one TEXT field instead? You can refactor things to optimize that later if the DB engine can't handle that load well, but that's the easiest way to get started.

Solution 4 - Xml

Take a quick look at native xml DBs. There are several, and some very good ones are free.

Search eXist, Document xDB, Oracle Berkeley.

If you are persisting, querying and updating semi-structured text and if the structure has any depth at all, you are almost certainly doing it the hard way if you stick with either the RDB of pointers, or stuff-it-in-a-blob techniques -- though there are many exterior reasons that these architectures can be necessary and successful.

Do a little reading on XPath and XQuery before you commit to a design. Here's a good place to start: https://community.emc.com/community/edn/xmltech

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionEtzeitetView Question on Stackoverflow
Solution 1 - XmlByron WhitlockView Answer on Stackoverflow
Solution 2 - Xmlbackslash17View Answer on Stackoverflow
Solution 3 - XmlAlex MartelliView Answer on Stackoverflow
Solution 4 - XmlJohn TurnbullView Answer on Stackoverflow