Add optional byte offsets to nodes #25

rgrove · 2023-01-17T00:11:52Z

This adds a new parser option named includeOffsets, which defaults to false. When includeOffsets is true, the starting and ending byte offsets of each node in the input string will be made available via start and end properties on the node.

Closes #24

This adds a new parser option named `includeOffsets`, which defaults to `false`. When `includeOffsets` is `true`, the byte offset of each node in the input string will be made available via an `offset` property on the node. Note that this offset doesn't take into account any carriage return (`\r`) characters in the input string because carriage returns are removed during a normalization step before parsing begins. Closes #24

wooorm · 2023-01-17T10:43:33Z

Awesome!! Looks like a good start for my needs! But, I would need both start and end offsets.

I think more folks that would use this feature might want it too: it’s pretty common in acorn/babel/swc/postcss/and the stuff I work on to have both.

Looking at the code, every call to addNode already stashes the start index before doing work, so it should be able to access and end index after that work, right at the addNode call, too.
(Only exception I think is that addText can merge text nodes, in which case it should update the end index of the previous sibling).

The CR+LF support (and likely BOM), I would quite like. But it’s not essential for me to start using this. It’s probably hard to add. I think I might be able to solve that myself with vfile-location!

rgrove · 2023-01-17T17:51:11Z

Awesome!! Looks like a good start for my needs! But, I would need both start and end offsets.

Ah, you did say that in the original ticket, but I overlooked it! Shouldn't be hard to add.

The CR+LF support (and likely BOM), I would quite like. But it’s not essential for me to start using this. It’s probably hard to add. I think I might be able to solve that myself with vfile-location!

I got partway into implementing CR support when I remembered that the reason for normalizing line endings before parsing is that the XML spec seems to require it:

To simplify the tasks of applications, the XML processor MUST behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.

— https://www.w3.org/TR/2008/REC-xml-20081126/#sec-line-ends

The presence of #xD in the above production is maintained purely for backward compatibility with the First Edition. As explained in 2.11 End-of-Line Handling, all #xD characters literally present in an XML document are either removed or replaced by #xA characters before any other processing is done.

— https://www.w3.org/TR/2008/REC-xml-20081126/#NT-S

It's possible I'm being too strict in my interpretation though, because there are references to CR characters elsewhere in the spec that wouldn't make sense if they were all removed before parsing. I think I'll need to do some comparisons with other widely used XML parsers to see how they handle things.

wooorm · 2023-01-17T18:13:15Z

Yeah I saw that. And I think it’s a fine position. Though, the word “behave” here can be interpreted as that CR/LF/CRLF all have to behave the same as LF. Not specifically that they’re removed.

You know more about parsing XML than I do, but I’m guessing that there’s no real difference in one LF and two LFs?

Which then is also interesting, because the note says that CRLF can be swapped with either LFLF, or LF. So what then to do with positional info?

rgrove · 2023-01-17T18:19:06Z

My understanding (and how parse-xml currently works) is that a CRLF sequence should be normalized to LF (not LFLF), while a CR that isn't followed by a LF should also be normalized to LF.

So the sequence \r\n\r becomes \n\n, not \n\n\n. The end result is exactly as many line endings as before, but they're all LF line endings and not CRLF or CR. While the number of line endings will be the same, the number of characters won't be.

wooorm · 2023-01-17T18:25:44Z

Yeah, I agree. That’s the only interpretation that makes sense. But how I read the second one seems to allow swapping all CRs for LFs

This will allow us to accurate report byte offsets in input strings that contain carriage return characters. While the additional calls to `normalizeLineBreaks()` may seem like a potential performance problem, rewriting the normalization to avoid a regex has actually resulted in a very small performance improvement overall.

rgrove · 2023-01-22T02:00:10Z

Updated to include both start and end offsets. Additionally, line endings are now normalized during parsing instead of before parsing, so the reported offsets are correct even when the input string contains carriage returns.

wooorm

Changes look good, maybe:

docs
use this new functionality in errors?

wooorm · 2023-01-22T17:04:41Z

Small weird thing is that ends are added in front of starts on the objects, so the ordering is a bit funky:

<ref *1> XmlDocument {
  parent: null,
  end: 137,
  start: 0,
  children: [
    XmlElement {
      parent: [Circular *1],
      end: 136,
      start: 0,
      ...
      attributes: [Object: null prototype],
      children: [Array]
    }
  ]
}

wooorm · 2023-01-22T17:33:50Z

Might be nice to have a type of the errors too btw, that includes types for line/column/pos/excerpt, I don’t think I see that?

wooorm · 2023-01-22T18:22:06Z

OK, parser looks really good! :) I have a couple ideas / bugs / questions to report:

Idea: it would be really nice if different errors have some .code or so attached, to represent their “kind”, so that they can be programatically mapped or translated to other messages
Bug: (whitespace?) text around the root element is not present in the tree, e.g., a final \n or one before the root
Bug: initial/final whitespace seems weird in comments/cdata. Take for example:

<root>
  <!--
  This
  comment
  has
  multiple
  lines
  -->
</root>

The text I get back for that comment is 'This\n comment\n has\n multiple\n lines', but I would expect initial and final whitespace. Similar for CDATA:

<root>
  <script type="text/javascript">
    // <![CDATA[
    alert(1)
    // ]]>
  </script>
</root>

I get the value '\n alert(1)\n // \n ' back, but the final '\n ' is not in the cdata (I think these should be in a separate text?)

Question: should lowercase doctype crash?
Question: should documents without a root element crash?
Question: should documents that have CDATA before the root element crash? (Maybe makes sense 🤔)

rgrove · 2023-01-23T01:37:00Z

docs

API docs are included with these changes and will be built by TypeDoc when this goes to main. Was there something more you were looking for?

use this new functionality in errors?

Good idea.

Small weird thing is that ends are added in front of starts on the objects, so the ordering is a bit funky:

Alphabetical order strikes again! Will fix.

Might be nice to have a type of the errors too btw, that includes types for line/column/pos/excerpt, I don’t think I see that?

Good call. This is something I overlooked in the conversion to TypeScript for 4.0. Tracked in #27.

Idea: it would be really nice if different errors have some .code or so attached, to represent their “kind”, so that they can be programatically mapped or translated to other messages

Someone else also suggested this the other day in #26. Sounds like it's worth considering.

Bug: (whitespace?) text around the root element is not present in the tree, e.g., a final \n or one before the root

I think this is the correct behavior according to the spec. It also matches what libxml2 and DOMParser do. Section 2.1 of XML 1.0 5th ed. defines a document as:

document ::= prolog element Misc*

This allows whitespace, comments, and a few other things to exist before or after the root element, but whitespace in a Misc construct is ignored rather than treated as text content.

Bug: initial/final whitespace seems weird in comments/cdata.

parse-xml currently trims comment content. The spec neither requires nor forbids this, but I can see how it would be surprising. libxml2 doesn't appear to trim comment content, so I suppose we could consider this a bug. Filed as #28.

I get the value '\n alert(1)\n // \n ' back, but the final '\n ' is not in the cdata (I think these should be in a separate text?)

By default parse-xml generates XmlText nodes for CDATA content. Since consecutive text nodes are compacted, this CDATA content ends up being merged with the preceding and following text content into a single XmlText node. If you set the preserveCdata parser option to true, then CDATA content will be represented by XmlCdata nodes and you'll get what you expect here.

Question: should lowercase doctype crash?

Yep, and currently that appears to be the case. Are you seeing something different?

Question: should documents without a root element crash?

Yep, a document must have one or more elements, so the root element must exist.

Question: should documents that have CDATA before the root element crash? (Maybe makes sense 🤔)

Yep, the spec doesn't allow CDATA before the root element:

You're not by any chance trying to parse HTML, are you? 😬

#25 (comment)

rgrove · 2023-01-23T01:57:51Z

use this new functionality in errors?

On second thought, after looking at the error() method again, I'm not seeing anything obvious that could be improved by using the new offset stuff. When an error occurs there's no guarantee that a node with relevant offsets has been created, so it doesn't look like we can use that for anything. Did you have any particular improvements in mind?

wooorm · 2023-01-23T08:58:37Z

API docs are included with these changes and will be built by TypeDoc when this goes to main. Was there something more you were looking for?

I didn’t see a readme.md change in this PR, assuming there would be some options there, but there aren’t any there so that’s fine!

Someone else also suggested this the other day […]. Sounds like it's worth considering.

✨

If you set the preserveCdata parser option to true, then CDATA content will be represented by XmlCdata nodes and you'll get what you expect here.

I have this on but I still see it. I think it’s this:

parse-xml/src/lib/Parser.ts

Line 73 in 18e5e98

if (prevNode instanceof XmlText) {
parse-xml/src/lib/XmlCdata.ts

Line 7 in 18e5e98

export class XmlCdata extends XmlText {

CDATA is an instance of text!

This allows whitespace, comments, […]

parse-xml currently trims comment content […]

You're not by any chance trying to parse HTML, are you?

I have some tests that contain crappy XML, which sax did parse, so I’m checking whether these are bugs or fixes!

But some of these questions are also because I am trying to roundtrip XML. A user reads a file, some small things are changed in the syntax tree, user writes a file, and they should more or less be equivalent.

So it would be nice to keep text in comments the same. ✅
And it would also be quite nice to have some support for <?xml… and <!DOCTYPE… (similar to preserveCdata options).
Stuff like whitespace around the root element I get though. HTML also requires for it to be stripped by a parser.

But this might in some ways, more or less, be at odds with the ideas behind this project!

rgrove · 2023-01-23T21:37:41Z

I have this on but I still see it. I think it’s this:

* https://github.com/rgrove/parse-xml/blob/18e5e98c8f1ab800267c5581493523e6bd32caa4/src/lib/Parser.ts#L73

* https://github.com/rgrove/parse-xml/blob/18e5e98c8f1ab800267c5581493523e6bd32caa4/src/lib/XmlCdata.ts#L7
  
  CDATA is an instance of text!

You're right! Definitely a bug. #29

I have some tests that contain crappy XML, which sax did parse, so I’m checking whether these are bugs or fixes!

Ah, okay. sax is very forgiving and allows many things the spec forbids, so that explains it.

I once experimented with adding a "loose" mode to parse-xml that would be more forgiving at the expense of jettisoning spec compliance, but it ended up involving some pretty significant changes and created a lot of new testing surface, so I ended up abandoning it.

But some of these questions are also because I am trying to roundtrip XML. A user reads a file, some small things are changed in the syntax tree, user writes a file, and they should more or less be equivalent.

Got it.

I'm open to adding XML declarations and doctype declarations to the DOM, which (along with fixing comment content trimming) should give you enough info to round-trip to an equivalent XML string that would parse to the same result. #30

If you want to go further than that and round-trip to an equal string that preserves non-meaningful whitespace and other stuff not represented in the DOM, then parse-xml may not be the best tool. We could get there, but as you said it's not really in line with the goals of the project.

wooorm · 2023-01-24T11:03:45Z

I once experimented with adding a "loose" mode to parse-xml […] I ended up abandoning it.

I’d do the same — I’m fine with these differences from sax, I prefer the spec compliance!

If you want to go further than that and round-trip to an equal string

I don’t, luckily ;)

GH-30

This is enough for me!

I’ve checked out this PR locally and used it instead of sax in xast-util-from-xml, and it works well. With GH-25 (this PR) and GH-30 (instructions/declarations) I can make the switch!

This PR looks good to me! ✅

rgrove · 2023-02-05T00:32:03Z

I've published version 4.1.0 with most of the changes discussed here. Thanks again for the great suggestions!

rgrove mentioned this pull request Jan 17, 2023

Positional info #24

Closed

Update comment

3a02750

rgrove added 2 commits January 21, 2023 17:09

Include both start and end offsets when includeOffsets is true

1b4ca38

wooorm reviewed Jan 22, 2023

View reviewed changes

Swap the order of the start and end properties

87b6213

#25 (comment)

rgrove changed the base branch from main to next January 28, 2023 20:39

rgrove merged commit dd7c8b1 into next Jan 28, 2023

rgrove deleted the offsets branch January 29, 2023 21:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional byte offsets to nodes #25

Add optional byte offsets to nodes #25

rgrove commented Jan 17, 2023 •

edited

Loading

wooorm commented Jan 17, 2023

rgrove commented Jan 17, 2023

wooorm commented Jan 17, 2023

rgrove commented Jan 17, 2023

wooorm commented Jan 17, 2023

rgrove commented Jan 22, 2023

wooorm left a comment

wooorm commented Jan 22, 2023

wooorm commented Jan 22, 2023

wooorm commented Jan 22, 2023

rgrove commented Jan 23, 2023

rgrove commented Jan 23, 2023

wooorm commented Jan 23, 2023

rgrove commented Jan 23, 2023 •

edited

Loading

wooorm commented Jan 24, 2023 •

edited

Loading

rgrove commented Feb 5, 2023

Add optional byte offsets to nodes #25

Add optional byte offsets to nodes #25

Conversation

rgrove commented Jan 17, 2023 • edited Loading

wooorm commented Jan 17, 2023

rgrove commented Jan 17, 2023

wooorm commented Jan 17, 2023

rgrove commented Jan 17, 2023

wooorm commented Jan 17, 2023

rgrove commented Jan 22, 2023

wooorm left a comment

Choose a reason for hiding this comment

wooorm commented Jan 22, 2023

wooorm commented Jan 22, 2023

wooorm commented Jan 22, 2023

rgrove commented Jan 23, 2023

rgrove commented Jan 23, 2023

wooorm commented Jan 23, 2023

rgrove commented Jan 23, 2023 • edited Loading

wooorm commented Jan 24, 2023 • edited Loading

rgrove commented Feb 5, 2023

rgrove commented Jan 17, 2023 •

edited

Loading

rgrove commented Jan 23, 2023 •

edited

Loading

wooorm commented Jan 24, 2023 •

edited

Loading