Posts By Category

Posts By Date

Resources:

C# Books
ASP.NET Books DotNet4All








If you like to support this site, feel free to make a donation to support improvements.

Thank you!

Monetize Your Blog

How to parse a webpage in C#? Simple text parsing in C# using GetStringInBetween, GetStringBetween

One common requirement especially when doing screen scrapping is to find strings contained between html tags, or other strings. Regular Expressions provide a powerful way to do so, but are initially intimidating for a beginner programmer. Here is an alternative solution to finding a string between two other strings that uses simple string manipulation in C#, without Regular Expressions:

Usage:

string myString = "<span>Joe Smith</span>";

string [] result = GetStringInBetween("<span>", "</span>", myString);

string output = result[0];

string next = result[1];

GetStringInBetween finds the first occurrence of the “begin” and “end” strings, then you can use result[1] to allow you to move ahead down the html document to find the next value.

Here is GetStringInBetween implementation:

public static string[] GetStringInBetween(string strBegin,

    string strEnd, string strSource,

    bool includeBegin, bool includeEnd)           

{

    string[] result ={ "", "" };

    int iIndexOfBegin = strSource.IndexOf(strBegin);

    if (iIndexOfBegin != -1)

    {

        // include the Begin string if desired

        if (includeBegin)

            iIndexOfBegin -= strBegin.Length;

        strSource = strSource.Substring(iIndexOfBegin

            + strBegin.Length);

        int iEnd = strSource.IndexOf(strEnd);

        if (iEnd != -1)

        {

            // include the End string if desired

            if (includeEnd)

                iEnd += strEnd.Length;

            result[0] = strSource.Substring(0, iEnd);

            // advance beyond this segment

            if (iEnd + strEnd.Length < strSource.Length)

                result[1] = strSource.Substring(iEnd

                    + strEnd.Length);

        }

    }

    else

        // stay where we are

        result[1] = strSource;

    return result;

}

Notice you can choose to include or exclude the beginning and ending search strings in the result.

It's a handy utility that I've been using allot in some screen scrapping projects I've done lately.

kick it on DotNetKicks.com

Feedback

Posted on 12/29/2011 11:55:03 AM

very nice, very elegant function.
thx ;-)
georges

Posted on 9/22/2011 12:00:46 AM

Dear All,

I need the HTML Script, to extract the text in between two characters in html page, can anyone help me.

Posted on 9/16/2010 11:37:27 AM

thanks ,it work good

Posted on 5/27/2010 8:56:27 PM

Hi,

I am trying to replace text between two specific HTML Tags and I used .*, but did not work. It someow strips the html tags withig the text! The RegEx is here that I use in JavaScipt:

var replaceTimeout;
function ReplaceCustomContent(obj, replacement)
{

var HTML = obj.innerHTML;
var Reg = new RegExp("<!\-\-CustomTextToken\-\->.*<!\-\-/CustomTextToken\-\->", 'mig');

var newHTML = HTML.replace(Reg, '<!\-\-CustomTextToken\-\->' + replacement + '<!\-\-/CustomTextToken\-\->');
//var newHTML = HTML.replace(Reg, '<!\-\-CustomTextToken\-\->' + replacement.replace("<","{").replace(">","}") + '<!\-\-/CustomTextToken\-\->');
obj.innerHTML = newHTML;

//alert(HTML == newHTML);
}

function DoCCReplace()
{
ReplaceCustomContent($('divPreviewEdit'), $('textCustom').value);
ReplaceCustomContent($('divPreview'), $('textCustom').value);
}

Thanks for helping me out!

May be Korhan could pitch in plesae?

Posted on 6/18/2009 10:39:00 AM

just use Regex...

Posted on 5/2/2009 4:42:40 AM

Hey thanks a ton for this function! Solved my problem! Thanks a lot!

Posted on 12/5/2008 2:20:11 PM

Thanks, very post.

Posted on 11/2/2008 7:04:58 AM

Hi Matt,

Thanks for your contribution. Yes, RegEx is the way to go. I didn't know much about how to write a RegEx at the time I wrote this article, but since then I've been using Regular Expressions all the way, similar to the ones you posted.

Thanks again,
Yousef

Posted on 11/2/2008 6:59:54 AM

Hey, I didn't want to do a drive-by criticism of your gigantic block of code without clarifying. What you are doing can be done in just 2 lines of code. Here is an example:

Important!! You need to change the { and } characters into < and > everywhere you see them. Your blog app won't let me post of I include the < and >.

// No 'using' directive to simplify this example
//
// The pattern to match is: (?<={span}).*(?={/span})
//
// Broken down, this means:
//
// Find a {span} tag but exclude it from the match (?<={span})
// Find any number of any characters .*
// Find a {/span} tag but exclude it from the match (?={/span})
//
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex("(?<={span}).*(?={/span})");
System.Text.RegularExpressions.Match match = regex.Match("{span}Joe Smith{/span}");

if (match.Success == true)
System.Windows.Forms.MessageBox.Show(match.Value);
else
System.Windows.Forms.MessageBox.Show("No Match!");

Posted on 11/2/2008 6:42:11 AM

Oh, sheesh. There is no need to write a function to do this. Use regular expressions. This is EXACTLY what they are designed for. They are built into the framework.

Posted on 6/7/2007 2:50:44 AM

Hi,

thanks a lot for this post. this was really helpful.

Thanks,
Hari

Posted on 3/29/2007 3:27:26 PM

regular expressions

Please post your comments:

Name:  
Email (optional): Your email address will not be posted.
URL (optional):
Comments: HTML will be ignored, URLs will be converted to hyperlinks  
Enter the text you see in the box:
 


Copyright © 2007 Yousef Mannaa. All material on this site is copyrighted.
Do not publish or reproduce any of this material without written permission from the Author