Text editor theory

Data StructuresEditorText EditorTheory

Data Structures Problem Overview


As I'm always dissatisfied with existing editors, a project I always wanted to start is my own text editor. However doing text editing is serious business.

Besides analyzing the source code of existing text editors, is there any book or other resource (like academic work) about this topic? I'm interested especially in something that teaches how to handle memory and how to manage text insertion (if you have a 100 MB file and want to add a char at x position, you can't just memmove the huge text block...).

Data Structures Solutions


Solution 1 - Data Structures

Take a look at Rob Pike's description of his Sam text editor. Be sure to browse past the high-level overview and command language. He describes parsing, memory management, and data structures later in the paper.

In addition, take a look at Russ Cox's simple regular expression implementation. It's easy to follow and may open some doors outside existing regular expression libraries.

Solution 2 - Data Structures

Over the years I've written quite a number of different text editors. Certainly the simplest way is to manage a long sequence of characters, where you copy everything around to insert any character. Other techniques that I've used include:

  • Represent the text file as a doubly linked list of text lines.

  • Construct a tree-like data structure (sometimes called a "rope") which starts off as a solid string of characters, but has the ability to split, insert, and delete blocks of text without having to move all the rest of the text around.

Many of the old Borland example books used a text editor as a tutorial example. You can occasionally still find copies of these at used bookstores for nearly free.

Solution 3 - Data Structures

There's an excellent tutorial available here that covers a lot of relevant topics in a more modern context:

The other answers to this question cover gap buffer.

Another modern coverage is the descriptions of AvalonEdit

and extra detail from:

and there's a huge amount of detail/content (about SharpDevelop) in the book:

Solution 4 - Data Structures

Promoted to answer by request:

The antique "Software Tools in Pascal" by Kernighan and Plaugher implements the ed editor in a language with neither real strings nor pointers. It contains a great overview of the design considerations that underlie any text editor.

Solution 5 - Data Structures

One old method that still works is called a gap buffer. The basic idea is that you put the text into a buffer, but instead of putting it all in one block, you create a "gap" at the cursor, putting all the text before the cursor at the beginning of the buffer, and all the text after the cursor at the end of the buffer. Most insertions take place at the cursor, which you can do without moving anything (until or unless you overflow the buffer). When the user moves the cursor, you move the appropriate text from one side of the split to the other.

Given typical controls (cursor left, right, up, down, page up, page down), the largest move you typically deal with is a page at a time, which is typically easy to handle quite a bit faster than a keyboard repeats. It can, of course, slow down a bit if you have a truly huge file and a "goto line" command, or something on that order. If you're going to do a lot of that, there are undoubtedly better structures to use...

Solution 6 - Data Structures

The Craft of Text Editing by Craig Finseth, based on his master's thesis, covers these topics. It's free on the web. OTOH it's pretty old and doesn't mention some ideas like ropes that were less practical on the tiny computers of yesteryear.

Solution 7 - Data Structures

This paper compares many data structures that can be used for text editors, including some already mentioned here (e.g., gap buffers) as well as several others (e.g., piece tables). The article is old, but I believe it still covers all the major choices, and it nicely compares them in terms of performance, complexity, and space overhead.

I know Stack Overflow answers aren't just supposed to be links, but this is still the best source of information I've ever found for the information requested, and it's far too long to summarize in an answer here. In case the link goes stale, search for "Data Structures for Text Sequences" by Charles Crowley of the University of New Mexico.

Solution 8 - Data Structures

The Scintilla component uses a split buffer, along the theory explained in a text linked in their Scintilla and SciTE Related Sites page.
The linked page is Data Structures in a Bit-Mapped Text Editor.
The split buffer proved it works well even with megabyte files. Using secondary structures (eg. a list of line starts) can help too.

Solution 9 - Data Structures

Here's how "the pros" at Microsoft do it:

MS Word uses the piece table data structure. An ascending array of character positions is maintained in parallel with an array of larger structures that contain pointers into either the original file (text loaded on file load) or into an append-only buffer of newly added characters.

The EDIT control used everywhere in Windows uses no data structure at all. Notepad simply uses a multi-line EDIT control. Check it out with a heap viewer. The EDIT control maintains only a buffer of line-breaks and tab-stops.

If you're going to create a simple non-formatted text editor in Windows, you could easily subclass the EDIT control to add features.

Solution 10 - Data Structures

> how to manage text insertion (if you have a 100 MB file and want to add a char at x position, you can't just memove the huge text block...).

Make the text file a linked list, where every line is an entry.

Solution 11 - Data Structures

Well if you know that in general people will have relatively few insert points, you could hold an array of pointers into your original text buffer and when the user tries to insert inside it, you "split" the buffer by spawning another pointer to the rest of the buffer, making the first pointer's length so it stops at the insertion point and adding a 3rd pointer for the text to be inserted inbetween, kinda like:

long original text la la la
^                *^
|                 2nd part
1st part

And the star points into a new text buffer where you start adding the text to be inserted.

When you render (or in general parse) your text file, you loop over the array of pointers and then do your work on each buffer. Of course if the buffer is small enough, you skip the adding a new buffer part, but that's just heuristics, try each and get a feel for which works best.

You could also consider splitting the text file on load into multiple buffers say every 1MB or so, because if you load the file in a single buffer you WILL need to make a new buffer for inserted text due to the size. Again, this is a heuristic optimization.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionlornovaView Question on Stackoverflow
Solution 1 - Data StructuresCorbin MarchView Answer on Stackoverflow
Solution 2 - Data StructuresGreg HewgillView Answer on Stackoverflow
Solution 3 - Data StructuresStephenDView Answer on Stackoverflow
Solution 4 - Data StructuresmswView Answer on Stackoverflow
Solution 5 - Data StructuresJerry CoffinView Answer on Stackoverflow
Solution 6 - Data StructuresDarius BaconView Answer on Stackoverflow
Solution 7 - Data StructuresAdrian McCarthyView Answer on Stackoverflow
Solution 8 - Data StructuresPhiLhoView Answer on Stackoverflow
Solution 9 - Data StructuresJamesView Answer on Stackoverflow
Solution 10 - Data StructuresBart van HeukelomView Answer on Stackoverflow
Solution 11 - Data StructuresBlindyView Answer on Stackoverflow