Closed Bug 555496 Opened 14 years ago Closed 14 years ago

XML invalid character error is reported on a completely unrelated location

Categories

(Core :: XML, defect)

All
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.9.3a4

People

(Reporter: Delineif, Assigned: peterv)

References

Details

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.2) Gecko/20100323 Namoroka/3.6.2
Build Identifier: 

The attached test case is an XML document containing: the xml declaration, a big comment of 212 lines, and <root>Á</root> encoded in iso-8859-1 (making it invalid, since the xml declaration announces UTF-8).

However, Firefox 3.6.2 shows an error on a completely different position, showing the character at the wrong reported location with a replacement character.
(And confusing whoever tries to fix it, since the real error is hundreds of lines below)


Reproducible: Always

Actual Results:  
Error reading XML: bad formed

Line 56, column 64
     copyright notice and this paragraph appear in all copies. [\uffff]he
---------------------------------------------------------------^


That's the content of line 56, column 64, but there's a real T there, not a replacement character.

Expected Results:  
Error reading XML, at line 213, column 7.
<root>Á</root>
------^

The reported position is dependant on the number of tags.
So if we change it to <root><child></child>Á</root>, it moves to line 57, col 12:
copyri[\uffff]ht
and <root><child>Á</child></root> to line 57 col 4.
Attached file Invalid XML test case
Interestingly, directly accessing the attachment doesn't trigger the bug. You need to locally save the file and then open it using file:///.
More exactly, on the network it only fails sometimes, as opposed to the file in local, where it fails always.
I expect that if you press the Shift key while refreshing it breaks most if not all of the time as that will clear the cache.

I also think this may be a regression. Are you willing to help determine whether it is and find the date that this functionality changed in Firefox?

Follow the steps here:
http://quality.mozilla.org/documents-home/bugs-docs/bug-triaging-guidelines/finding-regression-windows

You need to find the switch over day for either mozilla-central or
mozilla-1.9.2

i.e. look in <date>-mozilla-central or alternatively <date>-mozilla-1.9.2

either will do, just pick one and then try builds with different <date>s till
you find when it breaks.
(A) for 0x0C in xml file of <?xml version="1.0" encoding="UTF-8"?>

> Expected Results:  
> Error reading XML, at line 213, column 7.
> <root>Á</root>
> ------^

First line of your xml file.
> <?xml version="1.0" encoding="UTF-8"?>
Last line of your xml file in hexa.
> 0x 3C726F6F743E C1 3C2F726F6F743E
>    < r o o t >     < / r o o t >
> (displyed glyph for 0xC1 depends on viewd charset, font) 

Why 0x0C in UTF-8 can be displayed as glyph of Á?
UTF-8(hex) for Á is 0xC3 0x81 (0xC381).
> http://en.wikipedia.org/wiki/Windows-1252
> http://en.wikipedia.org/wiki/ISO/IEC_8859-1
> http://www.fileformat.info/info/unicode/char/00c1/index.htm
You are writing xml file of <?xml version="1.0" encoding="UTF-8"?> using Windows-1252 or iso-8859-1, aren't you?

(B) for [\uffff] in error message

> Line 56, column 64
>      copyright notice and this paragraph appear in all copies. [\uffff]he
> ---------------------------------------------------------------^

Line 56 of your xml file ( [LF] = 0x0A )
>     copyright notice and this paragraph appear in all copies. The[LF]

Really uffff? ufffd(U+FFFD), isn't it?
> http://www.fileformat.info/info/unicode/char/fffd/index.htm

(C) for [\uffff] in error message at different position

> The reported position is dependant on the number of tags.
> So if we change it to <root><child></child>Á</root>, it moves to line 57, col
12:
> copyri[\uffff]ht
> and <root><child>Á</child></root> to line 57 col 4.

It depends on split point of text node? (internally, 4KB block is used to hold text nodes.) Split at mid of 2bytes utf-8 binary or 3bytes utf-8 binary can happen.

(D) difference between local file and test case attached to this bug. 

(In reply to comment #2)
> Interestingly, directly accessing the attachment doesn't trigger the bug. You
> need to locally save the file and then open it using file:///.

Following is HTTP headers returned by bugzilla.mozilla.org. application/xml is used as Content-Type(it seems b.m.o returns without charset if .xml) if test case is obtained from B.M.O. If local file, it depends on relation between file extention and mime-type which is registrated to system(MS Win's regitry if MS Win). Is problem affected by local mime-type for .xml file?
> https://bug555496.bugzilla.mozilla.org/attachment.cgi?id=435428
> GET /attachment.cgi?id=435428 HTTP/1.1
> Host: bug555496.bugzilla.mozilla.org
>(snip)
>
> HTTP/1.1 200 OK
> Date: Mon, 29 Mar 2010 10:46:45 GMT
> Server: Apache
> Content-Disposition: inline; filename="content3.xml"
> Content-Length: 11105
> X-Backend-Server: pm-app-bugs05
> Keep-Alive: timeout=300, max=1000
> Connection: Keep-Alive
> Content-Type: application/xml; name="content3.xml"
A) WADA, I *know* it is invalid. The problem is that it misreports the location.
It should probably show a replacement character instead of Á in expected result.

B) Good catch. It really is a \uffff Firefox places a box with two lines of ff, not the ? inside a rotated rectangle.

C) I don't think it has to be at a 4kb boundary (plus, the invalid Á is stored in one byte). The original file got many changes before creating that testcase.
There's some arithmetic wrong there, it might not be taking into account the comment, or counting backwards.

D) As I said, most times it shows the correct answer (even with Ctrl+F5), but it sometimes fail (could be related with fetching from cache).
Receive buffer or TCP packet size could be hiding it. It's mostly needed to take into account when checking if it's still present.



Robert, nightlies are 32-bit so I can't directly use them. I was able to try win32 ones on wine and interestingly they don't seem to trigger that behavior. Instead they show <root>[fffd][003c]/root> (0x3c is <, but it's being displayed as an unknown character box)
I can confirm this when loading via file://.  This looks bad (and in particular we seem to be putting U+FFFF in the middle of the comment text!).  Peter, can you take a look, please?
Group: core-security
Status: UNCONFIRMED → NEW
Ever confirmed: true
Blocks: 552509
I looked at this with replay-debugging. The error is in nsScanner::AppendToBuffer. It treats aErrorPos as an offset starting from mCurrentPosition, but it is an offset in the buffer being appended. Regression from bug 174351, patch coming up.

I think that this doesn't cause other harm than just reporting the wrong position in the error message. Since aErrorPos is inside the buffer being appended, the buffer length should always be bigger than aErrorPos.
Assignee: nobody → peterv
Status: NEW → ASSIGNED
Depends on: 174351
Target Milestone: --- → mozilla1.9.3a4
Attached patch v1Splinter Review
I don't think there's an easy way to test this. Reftest won't work I think, since the error page is invalid XML too when serialized and so the reference page would show a different error.
Attachment #435868 - Flags: review?(bzbarsky)
Comment on attachment 435868 [details] [diff] [review]
v1

r=bzbarsky
Attachment #435868 - Flags: review?(bzbarsky) → review+
Since this doesn't sound like a security bug, should we just open it up?
Opening bug up, since this is not a security bug.
Group: core-security
http://hg.mozilla.org/mozilla-central/rev/b974b6dca6ce
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: