Regular Expressions for malformed title string


#1

As was observed in a previous topic, some of the title strings do not fit the regular expression pattern

@".* :: (.+) :: .*"
For instance, at this moment I see a title string that is:

[quote] 26. UISplitViewController and NSRegularExpression :: Re: Silver Challenge: Swapping the Master …
[/quote]
Looks truncated by some bit of software. I think it is really worse than that. I believe I’ve seen title strings that didn’t contain any " :: " parts at all though none of the current results are like that.

A couple questions the community might be able to answer…

First, for this particular problem…
Has anyone come up with a regular expression that will match both the regular and “missing author” cases? I’ve tried various versions and have, so far, failed. Rather than flail around indefinitely, I thought I’d ask.

Second, the documentation reference page for the NSRegularExpression class is fine for documentation, but not very tutorial. The fact that I’ve failed to build a pattern that matches an “optional” author field suggests that I could stand to have some examples for the more complicated aspects. I expected to find a focused guide for this, but can’t see one. Does anybody know of a more tutorial document?


#2

I came across this issue as well. I think the issue is that on RSS item titles that are too long, the string is getting truncated by the server to some length. That results in the example code’s regular expression not always finding a match in the title. I and came up with the following solution.

The title nodes are not always formatted as explained in the book:

<::><::>

So using the Regular expression /.* :: (.) :: ./ does not always split the title as expected and some of the items’ titles on the listview are not correct. The issue is that some of the titles are truncated. There seems to be a character limit on the RSS item title field and some of the entries that have long titles are formatted like this:

<::><…>

These entries do not return a match from running the regular expression on the title and you end up with the whole original title entry as the RSSItem.title value. To alleviate this I came up with the following solution which I post here in whole as the solution for the bronze challenge:

First off, add a subforum property to the RSSItem header and synthesize it in the implementation file:

@property (nonatomic, strong) NSString *subforum;
...
@synthesize subforum;

Then in the RSSChannel implementation file modify trimTitles as follows:

-(void)trimItemTitles {
	// MOST RSS Item titles follow this pattern:
	NSRegularExpression *normalTitleReg = [[NSRegularExpression alloc] initWithPattern:@"(.*) :: (.*) :: .*" options:0 error:nil];
	// Some are truncated and follow this pattern:
	NSRegularExpression *truncatedTitleReg = [[NSRegularExpression alloc] initWithPattern:@"(.*) :: (.*)" options:0 error:nil];
	
	// This will hold the item title, matches, result and range:
	NSString				*itemTitle;
	NSArray					*matches;
	NSTextCheckingResult	*result;
	NSRange					range;
	
	for(RSSItem *i in [self items]) {
		// Grab a copy of the title string:
		itemTitle = [i title];

		// Try to match the normal title expression
		matches = [normalTitleReg matchesInString:itemTitle options:0 range:NSMakeRange(0, [itemTitle length])];
		
		// If there are matches:
		if([matches count] > 0) {
			// Grab the first match:
			result = [matches objectAtIndex:0];
			// There must be 3 ranges, one for the total 
			// pattern and one for each parenthesis group:
			if([result numberOfRanges] == 3) {
				// We want the third match for the title:
				range = [result rangeAtIndex:2];
				[i setTitle:[itemTitle substringWithRange:range]];
				
				// And the second match for the subforum:
				range = [result rangeAtIndex:1];
				[i setSubforum:[itemTitle substringWithRange:range]];
			}
		} else if([matches count] == 0) { // If there were no matches, then we got a truncated title...
			// Try to match the truncated expression: 
			matches = [truncatedTitleReg matchesInString:itemTitle options:0 range:NSMakeRange(0, [itemTitle length])];
			// If there are matches, repeat as above...
			if([matches count] > 0) {
				// Grab the first match:
				result = [matches objectAtIndex:0];
				// There must be 3 ranges, one for the total 
				// pattern and one for each parenthesis group:
				if([result numberOfRanges] == 3) {
					// We want the third match for the title:
					range = [result rangeAtIndex:2];
					[i setTitle:[itemTitle substringWithRange:range]];
					
					// And the second match for the subforum:
					range = [result rangeAtIndex:1];
					[i setSubforum:[itemTitle substringWithRange:range]];
				}
			}
		}
	}
}

Finally, in the implementation file of ListViewController, set the item’s subforum by modifying tableView:cellForRowAtIndexPath:

-(UITableViewCell*)tableView:(UITableView*)tableView cellForRowAtIndexPath:(NSIndexPath*)indexPath {
	
	UITableViewCell *cell = [tableView dequeueReusableCellWithIdentifier:@"UITableViewCell"];
	
	if(cell == nil) {
		cell = [[UITableViewCell alloc] initWithStyle:UITableViewCellStyleSubtitle reuseIdentifier:@"UITableViewCell"];
	}
	
	RSSItem *item = [[channel items] objectAtIndex:[indexPath row]];
	
	[[cell textLabel] setText:[item title]];
	[[cell detailTextLabel] setText:[item subforum]];
	
	return cell;
}

#3

I used the following Regex, although it truncates the “…” from the trimmed title.

NSRegularExpression* reg = [[NSRegularExpression alloc] initWithPattern:@"(.*) :: (?:Re: )?(.*) (?:\\.\\.\\.|:: .*)" options:0 error:nil];

I tried to use capture group with alternation(|) like following in order not to truncate the “…” from the title but it didn’t work as I expected.

NSRegularExpression* reg = [[NSRegularExpression alloc] initWithPattern:@"(.*) :: (?:Re: )?(?:(.*) :: .*|(.*)\\.\\.\\.)" options:0 error:nil];

The second Regex pattern is like @"(A) (?:B) (?:©|(D))".
With this pattern, the index of the capture group © is 2 and (D) becomes 3 and the NSRange for one of © or (D) becomes {location:2147483647, length:0}. It will cause out of range exception when it’s used.
So, extra a bit complicated code will be required with this pattern.

If someone knows better way, please share it with us.


#4

I modified the regex pattern as follows to leave “…” and cope with some other title patterns.

I verified with the following 4 texts.
[ul]“5. Your Second Activity :: Re: Please clarify these Challenges :: Reply by phillips”,
“2. Android and :: Re: Previous Button Challenge Help Needed :: Reply by …”,
“2. Android and :: No resource found… :: Author arnjmllr”,
“2. Android and :: Re: SPAN_EXCLUSIVE_EXCLUSIVE spans cannot have a zero …”,[/ul]

With that pattern, capture group 2 will be following
[ul]“Please clarify these Challenges”
“Previous Button Challenge Help Needed”
“No resource found…”
“SPAN_EXCLUSIVE_EXCLUSIVE spans cannot have a zero …”[/ul]

Hope this will help for someone.


#5

[quote=“QuestionDriven”]I modified the regex pattern as follows to leave “…” and cope with some other title patterns.

Very impressive QuestionDriven! Your regex almost made my head implode! :open_mouth:

Just like cprince53, I also found the documentation for the NSRegularExpression extremely difficult for a beginner to understand, and almost offensive :angry: . So, for my solution, I kept my regex pattern very simple, with the slight downside of having a longer piece of code. I also believe my solution covers all different post title scenarios.

My apologies for the excruciatingly long method calls in my example; I just found it easier to understand in this particular case.

- (void)trimItemTitles
{
    // This regex captures all post titles, for real (or at least all the different ones I've found so far)
    // For post titles with only 2 segments ( x :: x ) this results in subForum at range:1 and title at range:2, pretty straightforward
    // For post titles with 3 segments ( x :: x :: x ) this results in subForum AND title at range:1 (x :: x), which I then proceed to crack down again using the very same regex. Range 2 here would be the author, so I just discard it. 

    NSRegularExpression *regex = [[NSRegularExpression alloc] initWithPattern:@"(.*) :: (.*)" options:0 error:nil];

    // Secondary regex to remove the "Re: ". I now understand that we can synthesize it on a single regex
    NSRegularExpression *re = [[NSRegularExpression alloc] initWithPattern:@"\\bRe: (.*)" options:0 error:nil];
    
    for (RSSItem *i in items) {
        NSString *itemTitle = [i title];
        
        NSLog(@"%@", itemTitle);
        
        NSArray *matches = [regex matchesInString:itemTitle options:0 range:NSMakeRange(0, [itemTitle length])];

        if ([matches count] > 0) {
            NSTextCheckingResult *result = [matches objectAtIndex:0];
            
            if ([result numberOfRanges] == 3) {
                
                NSArray *matches2 = [regex matchesInString:[itemTitle substringWithRange:[result rangeAtIndex:1]] options:0 range:NSMakeRange(0, [[itemTitle substringWithRange:[result rangeAtIndex:1]] length])];
                
                if ([matches2 count] > 0) {
                    NSTextCheckingResult *result2 = [matches2 objectAtIndex:0];
                    
                    NSRange r2 = [result2 range];
                    NSLog(@"Match2 at {%d, %d} for %@", r2.location, r2.length, [itemTitle substringWithRange:[result rangeAtIndex:1]]);

                    if ([result2 numberOfRanges] == 3) {
                        
                        NSArray *reMatches = [re matchesInString:[[itemTitle substringWithRange:[result rangeAtIndex:1]] substringWithRange:[result2 rangeAtIndex:2]] options:0 range:NSMakeRange(0, [[[itemTitle substringWithRange:[result rangeAtIndex:1]] substringWithRange:[result2 rangeAtIndex:2]] length])];

                        if ([reMatches count] > 0) {
                            NSTextCheckingResult *reResult = [reMatches objectAtIndex:0];

                            NSRange reR = [reResult rangeAtIndex:1];

                            NSLog(@"Re: match at {%d, %d} for %@", reR.location, reR.length, [[itemTitle substringWithRange:[result rangeAtIndex:1]] substringWithRange:[result2 rangeAtIndex:2]]);

                            [i setSubforum:[[itemTitle substringWithRange:[result rangeAtIndex:1]] substringWithRange:[result2 rangeAtIndex:1]]];
                            [i setTitle:[[[itemTitle substringWithRange:[result rangeAtIndex:1]] substringWithRange:[result2 rangeAtIndex:2]] substringWithRange:reR]];
                        } else {
                            [i setSubforum:[[itemTitle substringWithRange:[result rangeAtIndex:1]] substringWithRange:[result2 rangeAtIndex:1]]];
                            [i setTitle:[[itemTitle substringWithRange:[result rangeAtIndex:1]] substringWithRange:[result2 rangeAtIndex:2]]];
                        }
                    }
                } else {
                    
                    NSArray *reMatches = [re matchesInString:[itemTitle substringWithRange:[result rangeAtIndex:2]] options:0 range:NSMakeRange(0, [[itemTitle substringWithRange:[result rangeAtIndex:2]] length])];
                    
                    if ([reMatches count] > 0) {
                        NSTextCheckingResult *reResult = [reMatches objectAtIndex:0];
                        
                        NSRange reR = [reResult rangeAtIndex:1];
                        
                        NSLog(@"Re: match at {%d, %d} for %@", reR.location, reR.length, [itemTitle substringWithRange:[result rangeAtIndex:2]]);
                        
                        [i setSubforum:[itemTitle substringWithRange:[result rangeAtIndex:1]]];
                        [i setTitle:[[itemTitle substringWithRange:[result rangeAtIndex:2]] substringWithRange:reR]];
                    } else {
                        [i setSubforum:[itemTitle substringWithRange:[result rangeAtIndex:1]]];
                        [i setTitle:[itemTitle substringWithRange:[result rangeAtIndex:2]]];
                    }
                }
            }
        }
    }
}

#6

Hi Plastic,

Let me explain the regex pattern one by one.

"(.*?) :: "
    "()"   --> capturing parentheses (1st capture group)
    "."    --> match any character
    "*?"   --> match 0 or more times. match as few times as possible. non-greedy match
    " :: " --> literal. match exactly " :: "

"(?:Re: )?"
    "(?:)" --> non capturing parentheses
    "Re: " --> literal. match exactly "Re: "
    "?"    --> match zero or one times. Prefer one. With this "?", matching "Re: " becomes optional

"(.*?)"    --> same as above. (2nd capture group)

"(?:(?: :: .*)|\\z)"
    "(?:)" --> non capturing parentheses
    
    "(?:)" --> non capturing parentheses
    " :: " --> literal. match exactly " :: "
    "."    --> match any character
    "*"    --> match 0 or more times. match as many time as possible, greedy match

    "|"    --> alternation(logical OR). note that the outcomes of "A|B" and "B|A" might different.
    
    "\"    --> Escape a special character. In this case it escape "\" of "\z"
    "\z"   --> match if the current position is at the end of input.

The most difficult point for me was to use non-greedy match in the 1st and 2nd capture group.

Honestly speaking, I am really a beginner to Regular Expression.
My experience with Regex was only for simple user input validation, such as alphanumeric, length, etc, and even didn’t know what is capture group nor what is greedy or non-greedy.
And I have started learning it seriously just after finishing this chapter and reading the following entry posted by alberto.
http://forums.bignerdranch.com/viewtopic.php?f=238&t=4414&p=10755#p12006

My Learning resources are mainly the following two sites, and I have just ordered the O’Reilly book, “Mastering Regular Expressions” on rexegg.com recommendation.
http://www.rexegg.com/regex-disambiguation.html
http://www.regular-expressions.info/alternation.html

Hope this will help.
Happy Learning!!

QuestionDriven


#7

This is incredible.

Thank you so much!

Gilmar


#8

I handled the truncated title differently, not necessarily better. Let me know if there Is an issue.

-(void)trimItemTitles
{
//Lets first trim the titles so they are just the titles. Maybe pull out the Chapter later
    NSRegularExpression *titleReg = [[NSRegularExpression alloc] initWithPattern:@"(.*) :: (.*) :: .*"
                                                                         options:0
                                                                           error:nil];
    
    for (RSSItem *i in items)
    {
        NSLog(@"Title: %@", [i title]);
    }
    
    for (RSSItem *i in items)
    {
        NSString *itemTitle = [i title];
        NSArray *matches = [titleReg matchesInString:itemTitle options:0 range:NSMakeRange(0,[itemTitle length])];
        if ([matches count] > 0)
        {
            NSTextCheckingResult *result = [matches objectAtIndex:0]; //This is all three groups
            NSString *postTitle = [itemTitle substringWithRange:[result rangeAtIndex:2]];
            [i setTitle:postTitle];
        }
        else
        {
            //If the title is too long the third section is truncated by the server and replaced with ...
            //This fixes that and removes the ...
            //However, the title will not be truly correct, so we'll have to make up for that later
            NSRegularExpression *titleReg2 = [[NSRegularExpression alloc] initWithPattern:@"(.*) :: (.*) ... .*"
                                                                                 options:0
                                                                                   error:nil];
            NSArray *match2 = [titleReg2 matchesInString:itemTitle options:0 range:NSMakeRange(0, [itemTitle length])];
            if ([match2 count] > 0)
            {
                NSTextCheckingResult *result = [match2 objectAtIndex:0]; //This is all three groups
                NSString *postTitle = [itemTitle substringWithRange:[result rangeAtIndex:2]];
                [i setTitle:postTitle];
            }
            
        }

    }
    
//Now find those posts withput an Re:
    NSRegularExpression *reReg = [[NSRegularExpression alloc] initWithPattern:@"Re: "
                                                                    options:0
                                                                      error:nil];
    //Create the parentArray to hold each postArray
    NSMutableArray *parentArray = [[NSMutableArray alloc] init];
    //If a post does not have Re: remove from items and add it to a postarray then to parentArray
    for (int i = [items count]; i > 0; i--)
    {
        RSSItem *currentItem = [items objectAtIndex:i-1];
        NSString *currentTitle = [currentItem title];
        NSArray *matches = [reReg matchesInString:currentTitle options:0 range:NSMakeRange(0, [currentTitle length])];
        if ([matches count] == 0) //Re: was not found, thus it is an original post
        {
            NSMutableArray *postArray = [[NSMutableArray alloc] init];
            [postArray addObject:currentItem];
            [parentArray insertObject:postArray atIndex:0];
            [items removeObjectAtIndex:i-1];
            NSLog(@"Item # is %i", i-1);
        }
    }
    
    //Let's verify the only items left have Re: at beginning
    for (RSSItem *i in items)
    {
        NSLog(@"Title: %@", [i title]);
    }
    
//Determine if remaining posts are a part of postArray or should be a new array
//Will temporarily remove the Re: for the comparison
//Get the count before because the count changes
    for (int i = 0; i < [items count]; i++)
    {
        //Create a found booolean
        BOOL itemAdded = NO;
        //Get the current item title without the Re:. All will have Re:
        RSSItem *currentItem = [items objectAtIndex:i];
        NSString *currentItemTitle = [currentItem title];
        NSString *currentItemTitleForComp = [currentItemTitle stringByReplacingCharactersInRange:NSMakeRange(0, 4) withString:@""];
        
        for (NSMutableArray *a in parentArray)
        {
            //just for clarity get the item at loc 0
            RSSItem *parentItem = [a objectAtIndex:0];
            //Must check for Re: at beginning of title as it could have been added through this process
            //If Re: is in the title remove it for the comparison.
            NSString * parentItemTitle = [parentItem title];
            NSString *parentItemTitleBeginning = [parentItemTitle substringWithRange:NSMakeRange(0, 4)];
            NSString *parentItemTitleForComp = [[NSString alloc] init];
            if ([parentItemTitleBeginning isEqualToString:@"Re: "])
            {
                parentItemTitleForComp = [parentItemTitle substringWithRange:NSMakeRange(4, [parentItemTitle length] - 4)];
            }
            else
            {
                parentItemTitleForComp = [parentItemTitle copy];
            }
            
            if ([currentItemTitleForComp isEqualToString:parentItemTitleForComp])
            {
//                NSLog(@"%@ was equal to %@", currentItemTitleForComp, parentItemTitleForComp);
                
                [a addObject:currentItem];
//                NSLog(@"%@ added to %@", [currentItem title], [[a objectAtIndex:0] title]);
                itemAdded = YES;
                break;
            }
        }
        //If the loop went all the way through parentArray without finding a match b/w item and post in parentArray
        //Add a new postArray to parentArray
        //Must be done within loop so other replies in the same thread can be added.
        if (!itemAdded)
        {
            NSMutableArray *postArray = [[NSMutableArray alloc] initWithObjects:currentItem, nil];
            [parentArray addObject:postArray];
//            NSLog(@"%@ added to it's own array", [currentItem title]);
        }
    }

    int parentIndex = 0;
    for (NSArray *postArray in parentArray)
    {
        NSLog(@"Subarray %i", parentIndex);
        for (RSSItem *item in postArray)
        {
            NSString *itemTitle = [item title];
            NSLog(@"Title: %@", itemTitle);
        }
        parentIndex ++;
    }
    [items removeAllObjects];
    [items addObjectsFromArray:parentArray];
}

@end

#9

Just want to point out that it is not necessary to use a regex for this particular problem. The following code:

NSArray *substringArray = [title componentsSeparatedByString:@" :: "];
will break the string into separate strings and discard the separator. And this:

self.title = substringArray[1];
should always be the title unless there is no separator at all, in which case [0] will be the original string.

Not as fun as using a regex, but much easier to code and more efficient when a string has clearly defined separators.