Invalid XML bug/feature


#1

Hi everybody!

I want to share with you my discoveries on a bug (feature?!) discovered today while through the Nerdfeed app chapter. The app worked yesterday, but today it only displayed one item in the table. The log pointed me where to look:

2013-12-06 21:34:06.710 Nerdfeed[89687:a0b] <RSSChannel: 0xf3342a0> found a item element
2013-12-06 21:34:06.710 Nerdfeed[89687:a0b] <RSSItem: 0xf334630> found a title element
2013-12-06 21:34:06.711 Nerdfeed[89687:a0b] <RSSItem: 0xf334630> finished title element
2013-12-06 21:34:06.711 Nerdfeed[89687:a0b] <RSSItem: 0xf334630> found a link element
2013-12-06 21:34:06.711 Nerdfeed[89687:a0b] <RSSItem: 0xf334630> finished link element
2013-12-06 21:34:06.712 Nerdfeed[89687:a0b] <RSSItem: 0xf334630> found a description element
2013-12-06 21:34:06.714 Nerdfeed[89687:a0b] <RSSChannel: 0xf3342a0>

The code did parse only one item. However, the feed contained 20 items. To debug, I amended this code:

- (void)connectionDidFinishLoading:(NSURLConnection *)connection {
    NSXMLParser *parser = [[NSXMLParser alloc] initWithData:self.xmlData];
    parser.delegate = self;
    // blocking call!
    [parser parse];

    self.xmlData = nil;
    self.connection = nil;

    if (!parser.parserError) {
        NSLog(@"%@\n%@\n%@", self.channel, self.channel.title, self.channel.infoString);
        [self.tableView reloadData];
    } else {
        NSLog(@"Failed to parse XML (line %d, column %d): %@!",
              parser.lineNumber, parser.columnNumber,
              parser.parserError.localizedDescription);
        self.channel = nil;
    }
}

Here’s the error: Failed to parse XML (line 23, column 35): The operation couldn’t be completed. (NSXMLParserErrorDomain error 111.)! It has something to do with encoding. I saved the file, then run this:

$ head -23 samples/1.xml | tail -1 | xxd
…
0000590: 626c 6520 7965 7429 2e20 2052 6567 6172  ble yet).  Regar
00005a0: 6473 2c20 2020 416e 6472 1dc3 a95d 5d3e  ds,   Andr...]]>
00005b0: 3c2f 6465 7363 7269 7074 696f 6e3e 0a    </description>.

The parser stumbled upon the incorrect byte sequence 1dc3a9 just before ]]>. In fact, here’s the error from validator.w3.org/check:

(For reference, 0x1D is INFORMATION SEPARATOR THREE symbol).

Going further, the link to the post is viewtopic.php?f=4&t=7493&p=21298#p21298 , and you can check that the original post in HTML has the same character there. I couldn’t figure out how to fix this issue on iOS reasonably, and probably I shouldn’t since the XML is invalid. Most likely, the smartfeed script needs to be patched to escape such sequences.

Sorry, this post turned out to be somewhat long.