Getting title and meta tags from external website

PhpCurlTitleMeta Tags

Php Problem Overview


I want to try figure out how to get the

<title>A common title</title>
<meta name="keywords" content="Keywords blabla" />
<meta name="description" content="This is the description" />

Even though if it's arranged in any order, I've heard of the PHP Simple HTML DOM Parser but I don't really want to use it. Is it possible for a solution except using the PHP Simple HTML DOM Parser.

preg_match will not be able to do it if it's invalid HTML?

Can cURL do something like this with preg_match?

Facebook does something like this but it's properly used by using:

<meta property="og:description" content="Description blabla" />

I want something like this so that it is possible when someone posts a link, it should retrieve the title and the meta tags. If there are no meta tags, then it it ignored or the user can set it themselves (but I'll do that later on myself).

Php Solutions


Solution 1 - Php

This is the way it should be:

function file_get_contents_curl($url)
{
    $ch = curl_init();
 
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
 
    $data = curl_exec($ch);
    curl_close($ch);
 
    return $data;
}

$html = file_get_contents_curl("http://example.com/");

//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');

//get and display what you need:
$title = $nodes->item(0)->nodeValue;

$metas = $doc->getElementsByTagName('meta');

for ($i = 0; $i < $metas->length; $i++)
{
    $meta = $metas->item($i);
	if($meta->getAttribute('name') == 'description')
		$description = $meta->getAttribute('content');
	if($meta->getAttribute('name') == 'keywords')
		$keywords = $meta->getAttribute('content');
}
 
echo "Title: $title". '<br/><br/>';
echo "Description: $description". '<br/><br/>';
echo "Keywords: $keywords";

Solution 2 - Php

<?php
// Assuming the above tags are at www.example.com
$tags = get_meta_tags('http://www.example.com/');

// Notice how the keys are all lowercase now, and
// how . was replaced by _ in the key.
echo $tags['author'];       // name
echo $tags['keywords'];     // php documentation
echo $tags['description'];  // a php manual
echo $tags['geo_position']; // 49.33;-86.59
?>

Solution 3 - Php

get_meta_tags will help you with all but the title. To get the title just use a regex.

$url = 'http://some.url.com';
preg_match("/<title>(.+)<\/title>/siU", file_get_contents($url), $matches);
$title = $matches[1];

Hope that helps.

Solution 4 - Php

get_meta_tags did not work with title.

Only meta tags with name attributes like

<meta name="description" content="the description">

will be parsed.

Solution 5 - Php

Php's native function: get_meta_tags()

http://php.net/manual/en/function.get-meta-tags.php

Solution 6 - Php

Unfortunately, the built in php function get_meta_tags() requires the name parameter, and certain sites, such as twitter leave that off in favor of the property attribute. This function, using a mix of regex and dom document, will return a keyed array of metatags from a webpage. It checks for the name parameter, then the property parameter. This has been tested on instragram, pinterest and twitter.

/**
 * Extract metatags from a webpage
 */
function extract_tags_from_url($url) {
  $tags = array();

  $ch = curl_init();
  curl_setopt($ch, CURLOPT_HEADER, 0);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

  $contents = curl_exec($ch);
  curl_close($ch);

  if (empty($contents)) {
    return $tags;
  }
   
  if (preg_match_all('/<meta([^>]+)content="([^>]+)>/', $contents, $matches)) {
    $doc = new DOMDocument();
    $doc->loadHTML('<?xml encoding="utf-8" ?>' . implode($matches[0]));
    $tags = array();
    foreach($doc->getElementsByTagName('meta') as $metaTag) {
      if($metaTag->getAttribute('name') != "") {
        $tags[$metaTag->getAttribute('name')] = $metaTag->getAttribute('content');
      }
      elseif ($metaTag->getAttribute('property') != "") {
        $tags[$metaTag->getAttribute('property')] = $metaTag->getAttribute('content');
      }
    }
  }

  return $tags;
}

Solution 7 - Php

Shouldnt we be using OG?

The chosen answer is good but doesn't work when a site is redirected (very common!), and doesn't return OG tags, which are the new industry standard. Here's a little function which is a bit more usable in 2018. It tries to get OG tags and falls back to meta tags if it cant them:

function getSiteOG( $url, $specificTags=0 ){
    $doc = new DOMDocument();
    @$doc->loadHTML(file_get_contents($url));
    $res['title'] = $doc->getElementsByTagName('title')->item(0)->nodeValue;

    foreach ($doc->getElementsByTagName('meta') as $m){
		$tag = $m->getAttribute('name') ?: $m->getAttribute('property');
	    if(in_array($tag,['description','keywords']) || strpos($tag,'og:')===0) $res[str_replace('og:','',$tag)] = $m->getAttribute('content');
    }
    return $specificTags? array_intersect_key( $res, array_flip($specificTags) ) : $res;
}

How to use it:

/////////////
//SAMPLE USAGE:
print_r(getSiteOG("http://www.stackoverflow.com")); //note the incorrect url

/////////////
//OUTPUT:
Array
(
    [title] => Stack Overflow - Where Developers Learn, Share, & Build Careers
    [description] => Stack Overflow is the largest, most trusted online community for developers to learn, shareâ âtheir programming âknowledge, and build their careers.
    [type] => website
    [url] => https://stackoverflow.com/
    [site_name] => Stack Overflow
    [image] => https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon@2.png?v=73d79a89bded
)

Solution 8 - Php

A simple function to understand how to retrieve og:tags, title and description, adapt this for yourself

function read_og_tags_as_json($url){


    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $HTML_DOCUMENT = curl_exec($ch);
    curl_close($ch);

    $doc = new DOMDocument();
    $doc->loadHTML($HTML_DOCUMENT);
    
    // fecth <title>
    $res['title'] = $doc->getElementsByTagName('title')->item(0)->nodeValue;

    // fetch og:tags
    foreach( $doc->getElementsByTagName('meta') as $m ){
          
          // if had property
          if( $m->getAttribute('property') ){

              $prop = $m->getAttribute('property');
              
              // here search only og:tags
              if( preg_match("/og:/i", $prop) ){

                  // get results on an array -> nice for templating
                  $res['og_tags'][] =
                  array( 'property' => $m->getAttribute('property'),
                          'content' => $m->getAttribute('content') );
              }

          }
          // end if had property

          // fetch <meta name="description" ... >
          if( $m->getAttribute('name') == 'description' ){

            $res['description'] = $m->getAttribute('content');

          }


    }
    // end foreach

    // render JSON
    echo json_encode($res, JSON_PRETTY_PRINT |
    JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES);
   
}

Return for this page (may have more infos) :

{
    "title": "php - Getting title and meta tags from external website - Stack Overflow",
    "og_tags": [
        {
            "property": "og:type",
            "content": "website"
        },
        {
            "property": "og:url",
            "content": "https://stackoverflow.com/questions/3711357/getting-title-and-meta-tags-from-external-website"
        },
        {
            "property": "og:site_name",
            "content": "Stack Overflow"
        },
        {
            "property": "og:image",
            "content": "https://cdn.sstatic.net/Sites/stackoverflow/Img/[email protected]?v=73d79a89bded"
        },
        {
            "property": "og:title",
            "content": "Getting title and meta tags from external website"
        },
        {
            "property": "og:description",
            "content": "I want to try figure out how to get the\n\n&lt;title&gt;A common title&lt;/title&gt;\n&lt;meta name=\"keywords\" content=\"Keywords blabla\" /&gt;\n&lt;meta name=\"description\" content=\"This is the descript..."
        }
    ]
}

Solution 9 - Php

Your best bet is to bite the bullet use the DOM Parser - it's the 'right way' to do it. In the long run it'll save you more time than it takes to learn how. Parsing HTML with Regex is known to be unreliable and intolerant of special cases.

Solution 10 - Php

We use Apache Tika via php (command line utility) with -j for json :

http://tika.apache.org/

<?php
    shell_exec( 'java -jar tika-app-1.4.jar -j http://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying' );
?>

This is a sample output from a random guardian article :

{
   "Content-Encoding":"UTF-8",
   "Content-Length":205599,
   "Content-Type":"text/html; charset\u003dUTF-8",
   "DC.date.issued":"2013-07-21",
   "X-UA-Compatible":"IE\u003dEdge,chrome\u003d1",
   "application-name":"The Guardian",
   "article:author":"http://www.guardian.co.uk/profile/nicholaswatt",
   "article:modified_time":"2013-07-21T22:42:21+01:00",
   "article:published_time":"2013-07-21T22:00:03+01:00",
   "article:section":"Politics",
   "article:tag":[
      "Lynton Crosby",
      "Health policy",
      "NHS",
      "Health",
      "Healthcare industry",
      "Society",
      "Public services policy",
      "Lobbying",
      "Conservatives",
      "David Cameron",
      "Politics",
      "UK news",
      "Business"
   ],
   "content-id":"/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
   "dc:title":"Tory strategist Lynton Crosby in new lobbying row | Politics | The Guardian",
   "description":"Exclusive: Firm he founded, Crosby Textor, advised private healthcare providers how to exploit NHS \u0027failings\u0027",
   "fb:app_id":180444840287,
   "keywords":"Lynton Crosby,Health policy,NHS,Health,Healthcare industry,Society,Public services policy,Lobbying,Conservatives,David Cameron,Politics,UK news,Business,Politics",
   "msapplication-TileColor":"#004983",
   "msapplication-TileImage":"http://static.guim.co.uk/static/a314d63c616d4a06f5ec28ab4fa878a11a692a2a/common/images/favicons/windows_tile_144_b.png",
   "news_keywords":"Lynton Crosby,Health policy,NHS,Health,Healthcare industry,Society,Public services policy,Lobbying,Conservatives,David Cameron,Politics,UK news,Business,Politics",
   "og:description":"Exclusive: Firm he founded, Crosby Textor, advised private healthcare providers how to exploit NHS \u0027failings\u0027",
   "og:image":"https://static-secure.guim.co.uk/sys-images/Guardian/Pix/pixies/2013/7/21/1374433351329/Lynton-Crosby-008.jpg",
   "og:site_name":"the Guardian",
   "og:title":"Tory strategist Lynton Crosby in new lobbying row",
   "og:type":"article",
   "og:url":"http://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
   "resourceName":"tory-strategist-lynton-crosby-lobbying",
   "title":"Tory strategist Lynton Crosby in new lobbying row | Politics | The Guardian",
   "twitter:app:id:googleplay":"com.guardian",
   "twitter:app:id:iphone":409128287,
   "twitter:app:name:googleplay":"The Guardian",
   "twitter:app:name:iphone":"The Guardian",
   "twitter:app:url:googleplay":"guardian://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying",
   "twitter:card":"summary_large_image",
   "twitter:site":"@guardian"
}

Solution 11 - Php

My solution (adapted from parts of cronoklee's & shamittomar's posts) so I can call it from anywhere and get a JSON return. Can be easily parsed into any content.

<?php
header('Content-type: application/json; charset=UTF-8');

if (!empty($_GET['url']))
{
	file_get_contents_curl($_GET['url']);
}
else
{
	echo "No Valid URL Provided.";
}


function file_get_contents_curl($url)
{
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    $data = curl_exec($ch);
    curl_close($ch);

	echo json_encode(getSiteOG($data), JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES);
}

function getSiteOG( $OGdata){
    $doc = new DOMDocument();
    @$doc->loadHTML($OGdata);
    $res['title'] = $doc->getElementsByTagName('title')->item(0)->nodeValue;

    foreach ($doc->getElementsByTagName('meta') as $m){
        $tag = $m->getAttribute('name') ?: $m->getAttribute('property');
        if(in_array($tag,['description','keywords']) || strpos($tag,'og:')===0) $res[str_replace('og:','',$tag)] = utf8_decode($m->getAttribute('content'));
		
    }
	
    return $res;
}
?>

Solution 12 - Php

Get meta tags from url, php function example:

function get_meta_tags ($url){
         $html = load_content ($url,false,"");
         print_r ($html);
         preg_match_all ("/<title>(.*)<\/title>/", $html["content"], $title);
         preg_match_all ("/<meta name=\"description\" content=\"(.*)\"\/>/i", $html["content"], $description);
         preg_match_all ("/<meta name=\"keywords\" content=\"(.*)\"\/>/i", $html["content"], $keywords);
         $res["content"] = @array("title" => $title[1][0], "descritpion" => $description[1][0], "keywords" =>  $keywords[1][0]);
         $res["msg"] = $html["msg"];
         return $res;
}

Example:

print_r (get_meta_tags ("bing.com") );

Get Meta Tags php

Solution 13 - Php

Easy and php's in-built function.

http://php.net/manual/en/function.get-meta-tags.php

Solution 14 - Php

If you're working with PHP, check out the Pear packages at pear.php.net and see if you find anything useful to you. I've used the RSS packages effectively and it saves a lot of time, provided you can follow how they implement their code via their examples.

Specifically take a look at Sax 3 and see if it will work for your needs. Sax 3 is no longer updated but it might be sufficient.

Solution 15 - Php

As it was already said, this can handle the problem:

$url='http://stackoverflow.com/questions/3711357/get-title-and-meta-tags-of-external-site/4640613';
$meta=get_meta_tags($url);
echo $title=$meta['title'];

//php - Get Title and Meta Tags of External site - Stack Overflow

Solution 16 - Php

<?php 

// ------------------------------------------------------ 

function curl_get_contents($url) {
	
	$timeout = 5; 
	$useragent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko/20100101 Firefox/27.0'; 
	
	$ch = curl_init(); 
	curl_setopt($ch, CURLOPT_URL, $url); 
	curl_setopt($ch, CURLOPT_USERAGENT, $useragent); 
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
	curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); 
	$data = curl_exec($ch); 
	curl_close($ch); 
	
	return $data; 
}

// ------------------------------------------------------ 

function fetch_meta_tags($url) { 
	
	$html = curl_get_contents($url); 
	$mdata = array(); 
	
	$doc = new DOMDocument();
	$doc->loadHTML($html);
	
	$titlenode = $doc->getElementsByTagName('title'); 
	$title = $titlenode->item(0)->nodeValue;
	
	$metanodes = $doc->getElementsByTagName('meta'); 
	foreach($metanodes as $node) { 
	$key = $node->getAttribute('name'); 
	$val = $node->getAttribute('content'); 
	if (!empty($key)) { $mdata[$key] = $val; } 
	}

	$res = array($url, $title, $mdata); 
	
	return $res;
}

// ------------------------------------------------------ 

?>

Solution 17 - Php

I made this small composer package based on the top answer: https://github.com/diversen/get-meta-tags

composer require diversen/get-meta-tags

And then:

use diversen\meta;

$m = new meta();

// Simple usage, get's title, description, and keywords by default
$ary = $m->getMeta('https://github.com/diversen/get-meta-tags');
print_r($ary);

// With more params
$ary = $m->getMeta('https://github.com/diversen/get-meta-tags', array ('description' ,'keywords'), $timeout = 10);
print_r($ary);

It requires CURL and DOMDocument, as the top answer - and is built in the way, but has option for setting curl timeout (and for getting all kind of meta tags).

Solution 18 - Php

Now a days, most of the sites add meta tags to their sites providing information about their site or any particular article page. Such as news or blog sites.

I have created a Meta API which gives you required meta data ac like OpenGraph, Schema.Org, etc.

Check it out - https://api.sakiv.com/docs

Solution 19 - Php

I've got this working a different way and thought I'd share it. Less code than others and found it here. I've added a few things to make it load the page meta that you are on instead of a certain page. I wanted this to copy the default page title and description into the og tags automatically.

For some reason though, whatever way (different scripts) I tried, the page loads super slow online but instant on wamp. Not sure why so I'm probably going with a switch case since the site is not huge.

<?php
$url = 'http://sitename.com'.$_SERVER['REQUEST_URI'];
$fp = fopen($url, 'r');

$content = "";

while(!feof($fp)) {
    $buffer = trim(fgets($fp, 4096));
    $content .= $buffer;
}

$start = '<title>';
$end = '<\/title>';

preg_match("/$start(.*)$end/s", $content, $match);
$title = $match[1];

$metatagarray = get_meta_tags($url);
$description = $metatagarray["description"];

echo "<div><strong>Title:</strong> $title</div>";
echo "<div><strong>Description:</strong> $description</div>";
?>

and in the HTML header

<meta property="og:title" content="<?php echo $title; ?>" />
<meta property="og:description" content="<?php echo $description; ?>" />

Solution 20 - Php

Improved answer from @shamittomar above to get the meta tags (or the specified one from html source)

Can be improved further... the difference from php's default get_meta_tags is that it works even when there is unicode string

function getMetaTags($html, $name = null)
{
	$doc = new DOMDocument();
	try {
		@$doc->loadHTML($html);
	} catch (Exception $e) {
		
	}
	
	$metas = $doc->getElementsByTagName('meta');
	
	$data = [];
	for ($i = 0; $i < $metas->length; $i++)
	{
		$meta = $metas->item($i);

		if (!empty($meta->getAttribute('name'))) {
			// will ignore repeating meta tags !!
			$data[$meta->getAttribute('name')] = $meta->getAttribute('content');
		}
	}
	
	if (!empty($name)) {
		return !empty($data[$name]) ? $data[$name] : false;
	}

	return $data;
}

Solution 21 - Php

Here is PHP simple DOM HTML Class two line code to get page META details.

$html = file_get_html($link);
$meat_description = $html->find('head meta[name=description]', 0)->content;
$meat_keywords = $html->find('head meta[name=keywords]', 0)->content;

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMacMacView Question on Stackoverflow
Solution 1 - PhpshamittomarView Answer on Stackoverflow
Solution 2 - PhpBob JeeyView Answer on Stackoverflow
Solution 3 - PhpLloyd MooreView Answer on Stackoverflow
Solution 4 - PhpHaraldView Answer on Stackoverflow
Solution 5 - PhpAddo SolutionsView Answer on Stackoverflow
Solution 6 - PhpoknateView Answer on Stackoverflow
Solution 7 - PhpcronokleeView Answer on Stackoverflow
Solution 8 - PhpSNS - Web et InformatiqueView Answer on Stackoverflow
Solution 9 - PhpJoshuaView Answer on Stackoverflow
Solution 10 - PhpsebilasseView Answer on Stackoverflow
Solution 11 - Phpkevin walkerView Answer on Stackoverflow
Solution 12 - Phpx3m-bymerView Answer on Stackoverflow
Solution 13 - PhpJay DaveView Answer on Stackoverflow
Solution 14 - PhpGeeksterView Answer on Stackoverflow
Solution 15 - PhpRogerView Answer on Stackoverflow
Solution 16 - PhpsbmarkView Answer on Stackoverflow
Solution 17 - PhpdennisView Answer on Stackoverflow
Solution 18 - PhpsakivView Answer on Stackoverflow
Solution 19 - Phpe11worldView Answer on Stackoverflow
Solution 20 - PhpdavView Answer on Stackoverflow
Solution 21 - PhpKhandad NiaziView Answer on Stackoverflow