C# Extract text from PDF using PdfSharp

C# Problem Overview

Is there a possibility to extract plain text from a PDF-File with PdfSharp? I don't want to use iTextSharp because of its license.

C# Solutions

Solution 1 - C#

Took Sergio's answer and made some extension methods. I also changed the accumulation of strings into an iterator.

public static class PdfSharpExtensions
{
	public static IEnumerable<string> ExtractText(this PdfPage page)
	{		
		var content = ContentReader.ReadContent(page);		
		var text = content.ExtractText();
		return text;
	}	
	
	public static IEnumerable<string> ExtractText(this CObject cObject)
	{   
		if (cObject is COperator)
		{
			var cOperator = cObject as COperator;
			if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
				cOperator.OpCode.Name == OpCodeName.TJ.ToString())
			{
				foreach (var cOperand in cOperator.Operands)
					foreach (var txt in ExtractText(cOperand))
						yield return txt;	
			}
		}
		else if (cObject is CSequence)
		{
			var cSequence = cObject as CSequence;
			foreach (var element in cSequence)
				foreach (var txt in ExtractText(element))
					yield return txt;
		}
		else if (cObject is CString)
		{
			var cString = cObject as CString;
			yield return cString.Value;
		}
	}
}

Solution 2 - C#

I have implemented it somehow similar to how David did it. Here is my code:

    {
        // ....
        var page = document.Pages[1];
        CObject content = ContentReader.ReadContent(page);
        var extractedText = ExtractText(content);
        // ...
    }

    private IEnumerable<string> ExtractText(CObject cObject )
    {
        var textList = new List<string>();
        if (cObject is COperator)
        {
            var cOperator = cObject as COperator;
            if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
                cOperator.OpCode.Name == OpCodeName.TJ.ToString())
            {
                foreach (var cOperand in cOperator.Operands)
                {
                    textList.AddRange(ExtractText(cOperand));
                }
            }
        }
        else if (cObject is CSequence)
        {
            var cSequence = cObject as CSequence;
            foreach (var element in cSequence)
            {
                textList.AddRange(ExtractText(element));
            }
        }
        else if (cObject is CString)
        {
            var cString = cObject as CString;
            textList.Add(cString.Value);
        }
        return textList;
    }

Solution 3 - C#

PDFSharp provides all the tools to extract the text from a PDF. Use the ContentReader class to access the commands within each page and extract the strings from TJ/Tj operators.

I've uploaded a simple implementation to github.

Content Type	Original Author	Original Content on Stackoverflow
Question	der_chirurg	View Question on Stackoverflow
Solution 1 - C#	Ronnie Overby	View Answer on Stackoverflow
Solution 2 - C#	Sergio	View Answer on Stackoverflow
Solution 3 - C#	David Schmitt	View Answer on Stackoverflow

C# Extract text from PDF using PdfSharp

C# Problem Overview

C# Solutions

Solution 1 - C#

Solution 2 - C#

Solution 3 - C#

Getting last 5 char of string with mysql query

What is the difference between Sub and Function in VB6?

Attributions