Thursday, December 11, 2008

Regular expression to get ALL images in a page

I have been using a RegEx for a while thar matches all image tags in a HTML document. Suddenly the application one of the applications that is using the function ignored images with "space" in the name.

I had to modify my RegEx a bit... A great tool to use when you need a RegEx Editor is RegEx Buddy. It always helps me in figuring out what I'm doing wrong, and it helps in figuring out expressions gathered from the Internet.

Here is my RegEx that matches all images including src tags with a space in the filename:

<img[^>]*?src\s*=\s*[""']?(<?Filename>[^'"">]+\s*)[ '""][^>]*?>


And the C# function that gets the images into a list


public static List<String> GetImagesFromContent(String Html)

{


List<string> UrlList = new List<string>();


string regExPattern = @"<img[^>]*?src\s*=\s*[""']?(<?Filename>[^'"">]+\s*)[ '""][^>]*?>";


Regex r = new Regex(regExPattern, RegexOptions.IgnoreCase RegexOptions.Singleline);

MatchCollection matches = r.Matches(Html);


foreach (Match m in matches)

{

UrlList.Add(m.Groups["Filename"].Value);

}


return UrlList;


}

No comments: