Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional byte offsets to nodes #25

Merged
merged 5 commits into from
Jan 28, 2023
Merged

Add optional byte offsets to nodes #25

merged 5 commits into from
Jan 28, 2023

Conversation

rgrove
Copy link
Owner

@rgrove rgrove commented Jan 17, 2023

This adds a new parser option named includeOffsets, which defaults to false. When includeOffsets is true, the starting and ending byte offsets of each node in the input string will be made available via start and end properties on the node.

Closes #24

This adds a new parser option named `includeOffsets`, which defaults to
`false`. When `includeOffsets` is `true`, the byte offset of each node
in the input string will be made available via an `offset` property on
the node.

Note that this offset doesn't take into account any carriage return
(`\r`) characters in the input string because carriage returns are
removed during a normalization step before parsing begins.

Closes #24
@rgrove rgrove mentioned this pull request Jan 17, 2023
@wooorm
Copy link

wooorm commented Jan 17, 2023

Awesome!! Looks like a good start for my needs! But, I would need both start and end offsets.

I think more folks that would use this feature might want it too: it’s pretty common in acorn/babel/swc/postcss/and the stuff I work on to have both.

Looking at the code, every call to addNode already stashes the start index before doing work, so it should be able to access and end index after that work, right at the addNode call, too.
(Only exception I think is that addText can merge text nodes, in which case it should update the end index of the previous sibling).


The CR+LF support (and likely BOM), I would quite like. But it’s not essential for me to start using this. It’s probably hard to add. I think I might be able to solve that myself with vfile-location!

@rgrove
Copy link
Owner Author

rgrove commented Jan 17, 2023

Awesome!! Looks like a good start for my needs! But, I would need both start and end offsets.

Ah, you did say that in the original ticket, but I overlooked it! Shouldn't be hard to add.

The CR+LF support (and likely BOM), I would quite like. But it’s not essential for me to start using this. It’s probably hard to add. I think I might be able to solve that myself with vfile-location!

I got partway into implementing CR support when I remembered that the reason for normalizing line endings before parsing is that the XML spec seems to require it:

To simplify the tasks of applications, the XML processor MUST behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.

https://www.w3.org/TR/2008/REC-xml-20081126/#sec-line-ends

The presence of #xD in the above production is maintained purely for backward compatibility with the First Edition. As explained in 2.11 End-of-Line Handling, all #xD characters literally present in an XML document are either removed or replaced by #xA characters before any other processing is done.

https://www.w3.org/TR/2008/REC-xml-20081126/#NT-S

It's possible I'm being too strict in my interpretation though, because there are references to CR characters elsewhere in the spec that wouldn't make sense if they were all removed before parsing. I think I'll need to do some comparisons with other widely used XML parsers to see how they handle things.

@wooorm
Copy link

wooorm commented Jan 17, 2023

Yeah I saw that. And I think it’s a fine position. Though, the word “behave” here can be interpreted as that CR/LF/CRLF all have to behave the same as LF. Not specifically that they’re removed.

You know more about parsing XML than I do, but I’m guessing that there’s no real difference in one LF and two LFs?

Which then is also interesting, because the note says that CRLF can be swapped with either LFLF, or LF. So what then to do with positional info?

@rgrove
Copy link
Owner Author

rgrove commented Jan 17, 2023

My understanding (and how parse-xml currently works) is that a CRLF sequence should be normalized to LF (not LFLF), while a CR that isn't followed by a LF should also be normalized to LF.

So the sequence \r\n\r becomes \n\n, not \n\n\n. The end result is exactly as many line endings as before, but they're all LF line endings and not CRLF or CR. While the number of line endings will be the same, the number of characters won't be.

@wooorm
Copy link

wooorm commented Jan 17, 2023

Yeah, I agree. That’s the only interpretation that makes sense. But how I read the second one seems to allow swapping all CRs for LFs

This will allow us to accurate report byte offsets in input strings that
contain carriage return characters.

While the additional calls to `normalizeLineBreaks()` may seem like a
potential performance problem, rewriting the normalization to avoid a
regex has actually resulted in a very small performance improvement
overall.
@rgrove
Copy link
Owner Author

rgrove commented Jan 22, 2023

Updated to include both start and end offsets. Additionally, line endings are now normalized during parsing instead of before parsing, so the reported offsets are correct even when the input string contains carriage returns.

Copy link

@wooorm wooorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, maybe:

  • docs
  • use this new functionality in errors?

@wooorm
Copy link

wooorm commented Jan 22, 2023

Small weird thing is that ends are added in front of starts on the objects, so the ordering is a bit funky:

<ref *1> XmlDocument {
  parent: null,
  end: 137,
  start: 0,
  children: [
    XmlElement {
      parent: [Circular *1],
      end: 136,
      start: 0,
      ...
      attributes: [Object: null prototype],
      children: [Array]
    }
  ]
}

@wooorm
Copy link

wooorm commented Jan 22, 2023

Might be nice to have a type of the errors too btw, that includes types for line/column/pos/excerpt, I don’t think I see that?

@wooorm
Copy link

wooorm commented Jan 22, 2023

OK, parser looks really good! :) I have a couple ideas / bugs / questions to report:

  • Idea: it would be really nice if different errors have some .code or so attached, to represent their “kind”, so that they can be programatically mapped or translated to other messages
  • Bug: (whitespace?) text around the root element is not present in the tree, e.g., a final \n or one before the root
  • Bug: initial/final whitespace seems weird in comments/cdata. Take for example:
<root>
  <!--
  This
  comment
  has
  multiple
  lines
  -->
</root>

The text I get back for that comment is 'This\n comment\n has\n multiple\n lines', but I would expect initial and final whitespace. Similar for CDATA:

<root>
  <script type="text/javascript">
    // <![CDATA[
    alert(1)
    // ]]>
  </script>
</root>

I get the value '\n alert(1)\n // \n ' back, but the final '\n ' is not in the cdata (I think these should be in a separate text?)

  • Question: should lowercase doctype crash?
  • Question: should documents without a root element crash?
  • Question: should documents that have CDATA before the root element crash? (Maybe makes sense 🤔)

@rgrove
Copy link
Owner Author

rgrove commented Jan 23, 2023

  • docs

API docs are included with these changes and will be built by TypeDoc when this goes to main. Was there something more you were looking for?

  • use this new functionality in errors?

Good idea.

Small weird thing is that ends are added in front of starts on the objects, so the ordering is a bit funky:

Alphabetical order strikes again! Will fix.

Might be nice to have a type of the errors too btw, that includes types for line/column/pos/excerpt, I don’t think I see that?

Good call. This is something I overlooked in the conversion to TypeScript for 4.0. Tracked in #27.

Idea: it would be really nice if different errors have some .code or so attached, to represent their “kind”, so that they can be programatically mapped or translated to other messages

Someone else also suggested this the other day in #26. Sounds like it's worth considering.

Bug: (whitespace?) text around the root element is not present in the tree, e.g., a final \n or one before the root

I think this is the correct behavior according to the spec. It also matches what libxml2 and DOMParser do. Section 2.1 of XML 1.0 5th ed. defines a document as:

document   ::=   prolog element Misc*

This allows whitespace, comments, and a few other things to exist before or after the root element, but whitespace in a Misc construct is ignored rather than treated as text content.

Bug: initial/final whitespace seems weird in comments/cdata.

parse-xml currently trims comment content. The spec neither requires nor forbids this, but I can see how it would be surprising. libxml2 doesn't appear to trim comment content, so I suppose we could consider this a bug. Filed as #28.

I get the value '\n alert(1)\n // \n ' back, but the final '\n ' is not in the cdata (I think these should be in a separate text?)

By default parse-xml generates XmlText nodes for CDATA content. Since consecutive text nodes are compacted, this CDATA content ends up being merged with the preceding and following text content into a single XmlText node. If you set the preserveCdata parser option to true, then CDATA content will be represented by XmlCdata nodes and you'll get what you expect here.

Question: should lowercase doctype crash?

Yep, and currently that appears to be the case. Are you seeing something different?

Question: should documents without a root element crash?

Yep, a document must have one or more elements, so the root element must exist.

Question: should documents that have CDATA before the root element crash? (Maybe makes sense 🤔)

Yep, the spec doesn't allow CDATA before the root element:

You're not by any chance trying to parse HTML, are you? 😬

@rgrove
Copy link
Owner Author

rgrove commented Jan 23, 2023

  • use this new functionality in errors?

On second thought, after looking at the error() method again, I'm not seeing anything obvious that could be improved by using the new offset stuff. When an error occurs there's no guarantee that a node with relevant offsets has been created, so it doesn't look like we can use that for anything. Did you have any particular improvements in mind?

@wooorm
Copy link

wooorm commented Jan 23, 2023

API docs are included with these changes and will be built by TypeDoc when this goes to main. Was there something more you were looking for?

I didn’t see a readme.md change in this PR, assuming there would be some options there, but there aren’t any there so that’s fine!

Someone else also suggested this the other day […]. Sounds like it's worth considering.

If you set the preserveCdata parser option to true, then CDATA content will be represented by XmlCdata nodes and you'll get what you expect here.

I have this on but I still see it. I think it’s this:

This allows whitespace, comments, […]

parse-xml currently trims comment content […]

You're not by any chance trying to parse HTML, are you?

I have some tests that contain crappy XML, which sax did parse, so I’m checking whether these are bugs or fixes!

But some of these questions are also because I am trying to roundtrip XML. A user reads a file, some small things are changed in the syntax tree, user writes a file, and they should more or less be equivalent.

So it would be nice to keep text in comments the same. ✅
And it would also be quite nice to have some support for <?xml… and <!DOCTYPE… (similar to preserveCdata options).
Stuff like whitespace around the root element I get though. HTML also requires for it to be stripped by a parser.

But this might in some ways, more or less, be at odds with the ideas behind this project!

@rgrove
Copy link
Owner Author

rgrove commented Jan 23, 2023

I have this on but I still see it. I think it’s this:

* https://github.com/rgrove/parse-xml/blob/18e5e98c8f1ab800267c5581493523e6bd32caa4/src/lib/Parser.ts#L73

* https://github.com/rgrove/parse-xml/blob/18e5e98c8f1ab800267c5581493523e6bd32caa4/src/lib/XmlCdata.ts#L7
  
  CDATA is an instance of text!

You're right! Definitely a bug. #29

I have some tests that contain crappy XML, which sax did parse, so I’m checking whether these are bugs or fixes!

Ah, okay. sax is very forgiving and allows many things the spec forbids, so that explains it.

I once experimented with adding a "loose" mode to parse-xml that would be more forgiving at the expense of jettisoning spec compliance, but it ended up involving some pretty significant changes and created a lot of new testing surface, so I ended up abandoning it.

But some of these questions are also because I am trying to roundtrip XML. A user reads a file, some small things are changed in the syntax tree, user writes a file, and they should more or less be equivalent.

Got it.

I'm open to adding XML declarations and doctype declarations to the DOM, which (along with fixing comment content trimming) should give you enough info to round-trip to an equivalent XML string that would parse to the same result. #30

If you want to go further than that and round-trip to an equal string that preserves non-meaningful whitespace and other stuff not represented in the DOM, then parse-xml may not be the best tool. We could get there, but as you said it's not really in line with the goals of the project.

@wooorm
Copy link

wooorm commented Jan 24, 2023

I once experimented with adding a "loose" mode to parse-xml […] I ended up abandoning it.

I’d do the same — I’m fine with these differences from sax, I prefer the spec compliance!

If you want to go further than that and round-trip to an equal string

I don’t, luckily ;)

GH-30

This is enough for me!

I’ve checked out this PR locally and used it instead of sax in xast-util-from-xml, and it works well. With GH-25 (this PR) and GH-30 (instructions/declarations) I can make the switch!


This PR looks good to me! ✅

@rgrove rgrove changed the base branch from main to next January 28, 2023 20:39
@rgrove rgrove merged commit dd7c8b1 into next Jan 28, 2023
@rgrove rgrove deleted the offsets branch January 29, 2023 21:56
@rgrove
Copy link
Owner Author

rgrove commented Feb 5, 2023

I've published version 4.1.0 with most of the changes discussed here. Thanks again for the great suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Positional info
2 participants