-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optional byte offsets to nodes #25
Conversation
This adds a new parser option named `includeOffsets`, which defaults to `false`. When `includeOffsets` is `true`, the byte offset of each node in the input string will be made available via an `offset` property on the node. Note that this offset doesn't take into account any carriage return (`\r`) characters in the input string because carriage returns are removed during a normalization step before parsing begins. Closes #24
Awesome!! Looks like a good start for my needs! But, I would need both start and end offsets. I think more folks that would use this feature might want it too: it’s pretty common in acorn/babel/swc/postcss/and the stuff I work on to have both. Looking at the code, every call to The CR+LF support (and likely BOM), I would quite like. But it’s not essential for me to start using this. It’s probably hard to add. I think I might be able to solve that myself with |
Ah, you did say that in the original ticket, but I overlooked it! Shouldn't be hard to add.
I got partway into implementing CR support when I remembered that the reason for normalizing line endings before parsing is that the XML spec seems to require it:
— https://www.w3.org/TR/2008/REC-xml-20081126/#sec-line-ends
— https://www.w3.org/TR/2008/REC-xml-20081126/#NT-S It's possible I'm being too strict in my interpretation though, because there are references to CR characters elsewhere in the spec that wouldn't make sense if they were all removed before parsing. I think I'll need to do some comparisons with other widely used XML parsers to see how they handle things. |
Yeah I saw that. And I think it’s a fine position. Though, the word “behave” here can be interpreted as that CR/LF/CRLF all have to behave the same as LF. Not specifically that they’re removed. You know more about parsing XML than I do, but I’m guessing that there’s no real difference in one LF and two LFs? Which then is also interesting, because the note says that CRLF can be swapped with either LFLF, or LF. So what then to do with positional info? |
My understanding (and how parse-xml currently works) is that a CRLF sequence should be normalized to LF (not LFLF), while a CR that isn't followed by a LF should also be normalized to LF. So the sequence |
Yeah, I agree. That’s the only interpretation that makes sense. But how I read the second one seems to allow swapping all CRs for LFs |
This will allow us to accurate report byte offsets in input strings that contain carriage return characters. While the additional calls to `normalizeLineBreaks()` may seem like a potential performance problem, rewriting the normalization to avoid a regex has actually resulted in a very small performance improvement overall.
Updated to include both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good, maybe:
- docs
- use this new functionality in errors?
Small weird thing is that <ref *1> XmlDocument {
parent: null,
end: 137,
start: 0,
children: [
XmlElement {
parent: [Circular *1],
end: 136,
start: 0,
...
attributes: [Object: null prototype],
children: [Array]
}
]
} |
Might be nice to have a type of the errors too btw, that includes types for line/column/pos/excerpt, I don’t think I see that? |
OK, parser looks really good! :) I have a couple ideas / bugs / questions to report:
<root>
<!--
This
comment
has
multiple
lines
-->
</root> The text I get back for that comment is <root>
<script type="text/javascript">
// <![CDATA[
alert(1)
// ]]>
</script>
</root> I get the value
|
API docs are included with these changes and will be built by TypeDoc when this goes to
Good idea.
Alphabetical order strikes again! Will fix.
Good call. This is something I overlooked in the conversion to TypeScript for 4.0. Tracked in #27.
Someone else also suggested this the other day in #26. Sounds like it's worth considering.
I think this is the correct behavior according to the spec. It also matches what libxml2 and DOMParser do. Section 2.1 of XML 1.0 5th ed. defines a document as: This allows whitespace, comments, and a few other things to exist before or after the root element, but whitespace in a
parse-xml currently trims comment content. The spec neither requires nor forbids this, but I can see how it would be surprising. libxml2 doesn't appear to trim comment content, so I suppose we could consider this a bug. Filed as #28.
By default parse-xml generates
Yep, and currently that appears to be the case. Are you seeing something different?
Yep, a document must have one or more elements, so the root element must exist.
Yep, the spec doesn't allow CDATA before the root element:
You're not by any chance trying to parse HTML, are you? 😬 |
On second thought, after looking at the |
I didn’t see a
✨
I have this on but I still see it. I think it’s this:
I have some tests that contain crappy XML, which But some of these questions are also because I am trying to roundtrip XML. A user reads a file, some small things are changed in the syntax tree, user writes a file, and they should more or less be equivalent. So it would be nice to keep text in comments the same. ✅ But this might in some ways, more or less, be at odds with the ideas behind this project! |
You're right! Definitely a bug. #29
Ah, okay. I once experimented with adding a "loose" mode to parse-xml that would be more forgiving at the expense of jettisoning spec compliance, but it ended up involving some pretty significant changes and created a lot of new testing surface, so I ended up abandoning it.
Got it. I'm open to adding XML declarations and doctype declarations to the DOM, which (along with fixing comment content trimming) should give you enough info to round-trip to an equivalent XML string that would parse to the same result. #30 If you want to go further than that and round-trip to an equal string that preserves non-meaningful whitespace and other stuff not represented in the DOM, then parse-xml may not be the best tool. We could get there, but as you said it's not really in line with the goals of the project. |
I’d do the same — I’m fine with these differences from
I don’t, luckily ;) This is enough for me! I’ve checked out this PR locally and used it instead of This PR looks good to me! ✅ |
I've published version 4.1.0 with most of the changes discussed here. Thanks again for the great suggestions! |
This adds a new parser option named
includeOffsets
, which defaults tofalse
. WhenincludeOffsets
istrue
, the starting and ending byte offsets of each node in the input string will be made available viastart
andend
properties on the node.Closes #24