-
-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More conditional sourcepos #180
base: master
Are you sure you want to change the base?
More conditional sourcepos #180
Conversation
ef34ee3
to
b18c103
Compare
That is a timeout, pretty sure the code is fine, @jgm if you feel like triggering a new run :) |
b18c103
to
14819a2
Compare
Rebased and pushed, no timeout this time :) |
Is it possible to break this into smaller, logical chunks? This is currently such a big ball of code, touching so many parts of the code base, that it's hard to survey. For example, I see changes in buffer.h like this:
I wonder: Is that a mistake, going from That's just one example. Can we have a commit, for example, that just adds reference nodes? After all, that's a single logical change that might make sense completely indepenedently of the "extent" stuff. As for the demo script, I wasn't sure from your description what it did. But I think it would be better to develop it outside of cmark itself, so we don't get cmark too cluttered up. It certainly shouldn't go in the |
@jgm, the first change you mention is from the commit from @nwellnhof that you reverted, which you had squashed together from a bunch of his commits .. |
My initial branch was also 4 commits that you ended up squashing as one, I had re-extracted the python stuff from it as it clearly needed to be separate, but squashing things together, merging them, reverting them then asking me to re-split them up is a bit frustrating I have to say |
As for putting the formatting script in src/, I put it there because the cmark executable is also located in there, even though it's not part of the library, just wanted to be consistent. |
+++ Mathieu Duponchelle [Jan 05 17 12:36 ]:
***@***.***, the first change you mention is from the commit from
***@***.*** that you reverted, which you had squashed together from a
bunch of his commits ..
I shouldn't have done that - GitHub now makes squashing the
default, and I sometimes forget that.
Looking back at his commit, and looking at the code in more
detail, I see that this isn't a change in functionality;
it's just that he's operating with the delta rather than the
total size (as in the original).
Still, could you divide this into some logically separated
changes? At the very least we should have the introduction
of reference nodes as a separate commit, and nwellnhof's
changes a separate commit; the extent changes should not
be mixed with these. And as I suggested the demo program
should not be in this PR.
|
+++ Mathieu Duponchelle [Jan 05 17 12:40 ]:
My initial branch was also 4 commits that you ended up squashing as
one, I had re-extracted the python stuff from it as it clearly needed
to be separate, but squashing things together, merging them, reverting
them then asking me to re-split them up is a bit frustrating I have to
say
I understand that, the frustration is legitimate and I take
the blame. Still, I think going forward, for maintainability
and for code review, we need smaller, more atomic, commits.
(And this has generally what has made it hard for me to
process your PRs in the past.)
It's unfair to ask you to re-split this, so I won't.
But I'm also reluctant to merge the ball-of-twine without
some splitting, and I don't know if I'll get around to doing
it any time soon, because this isn't a high-priority feature
for me.
|
+++ Mathieu Duponchelle [Jan 05 17 12:42 ]:
As for putting the formatting script in src/, I put it there because
the cmark executable is also located in there, even though it's not
part of the library, just wanted to be consistent.
Yes, that's true. But I think a python script belongs
somewhere else. (And really, probably not in this
repository...it makes more sense to me for you to have
your own repository for that, where you can develop it
as you see fit and rapidly.)
|
Really ?? My other unmerged PR is split into 20 commits, the source pos stuff was initially four commits! |
+++ Mathieu Duponchelle [Jan 04 17 15:30 ]:
3. The actual fancy stuff, a reindenting script, tested on every test
of the spec with 1 to 80 columns, the test checks that the new AST
is identical to the old one, minus softbreaks. The actual test I
propose only checks for reindenting on one column, because 80 * 610
starts to take a little time I'm afraid. Note that the script could
break in a few more places than it currently does, and certainly be
massively optimized, but I think it's already in a state worth
merging because, as far as I can tell, it is now provably reliable.
What would help me is to hear a bit more about how this
differs from what you can do already with cmark's commonmark
-> commonmark transitions. I assume the answer is that your
script won't make adventitious changes to things like list
indentation, list number delimiter, type of header, etc.?
|
Yes, but far from only that, the end goal with this is to have a tool similar to clang-format for commonmark, allowing to control the style in large projects with multiple contributors. |
Let me try out the formatting script and see if it is
something that would make sense to keep in the repo as
a demo.
|
Example use case: Let's say you're the maintainer of "progit", and want to enforce a certain set of rules (no soft breaks, lists have to use The solution I'd like to adopt as such a maintainer, is to have a pre-commit hook that runs cmark-format (which could figure out the style from a well-known-named configuration file for example), and be reasonably certain that any commits submitted for inclusion have gone through and be validated by the formatting script. On the contributor side, you could use the tool to automatically refactor your changes in order to make them comply with the required style, which would thus eliminate tedious and error-prone manual changes. |
+++ Mathieu Duponchelle [Jan 05 17 12:56 ]:
Yes, but far from only that, the end goal with this is to have a tool
similar to clang-format for commonmark, allowing to control the style
in large projects with multiple contributors.
Yes, this would be useful and it's certainly something people have wanted.
|
@jgm, well that's what this "demo" script aims at becoming, it only handles width for now because that was the only difficult thing to do, the rest can easily be added by anyone with minimum python's working knowledge, which is part of the reason why I submitted it now, the other being indeed that I wanted to demo the script to you as I figured you would see the point :) |
Another important thing with that script is also that it does not do unnecessary escaping, which is really what bugged me with -t commonmark, ie given:
--width 80 will give you:
whereas --width 1 will correctly give you:
|
If splitting is too difficult, could you at least amend the
commit message to include a more detailed accounting of the
changes made and their motivations? The changelog will need
to contain something like that, in any case, and I usually
construct that from commit messages.
All API changes, especially, need to be very carefully
documented in the changelog.
In this case that would include the addition of reference
nodes and all the new extent-related API.
|
Sure, I didn't do that because I figured it would all get squashed anyway, I usually tend to be quite specific about these things, but I understand now that you got tricked by github :) I'll see what I can do for splitting up the commit anyway. |
14819a2
to
064df74
Compare
Added cmark_node_get_label(node). Added cmark_node_set_label(node). API change.
* Improve strbuf guarantees Introduce BUFSIZE_MAX macro and make sure that the strbuf implementation can handle strings up to this size. * Abort early if document size exceeds internal limit * Make parser return NULL on internal index overflow Make S_parser_feed set an error and ignore subsequent chunks if the total input document size exceeds an internal limit. Make cmark_parser_finish return NULL if an error was encountered. Add public API functions to retrieve error code and error message. strbuf overflow in renderers and OOM in parser or renderers still cause an abort.
Its generation is conditioned to the setting of the OPT_SOURCEPOS flag. Add cmark_parser_get_first_source_extent(parser). Add cmark_source_extent_get_start(extent). Add cmark_source_extent_get_stop(extent). Add cmark_source_extent_get_next(extent). Add cmark_source_extent_get_previous(extent). Add cmark_source_extent_get_node(extent). Add cmark_source_extent_get_type(extent. Add cmark_source_extent_get_type_string(extent). API change.
Add cmark_default_mem_free(pointer). API change.
The only implemented feature is reindenting to arbitrary width, but that was also the actually complicated one so there's that.
064df74
to
014b781
Compare
@jgm, better like that ? |
Much better, thanks! This looks good, but can you remove the cmark-format part and put it in a separate repository? Two more questions occur to me:
|
Of course, but I would really have liked to have it live alongside the library in order to ensure its test suite ran upon new commits / spec changes :) I would not mind at all moving the code outside of src/ , and I'd stick around for maintaining it, do you have any particular reason not to merge it under these conditions? As I explained in earlier comments, I think it would be a valuable tool to offer. By the way, you said you'd give it a try, did you find any issues that made you want to hold back on it?
My initial revision did not include it in the xml output, but my opinion is that this format should expose as faithfully as possible the ast, I am mainly using it for ease of development and debugging, and I like to have as much information as possible in that case. If you agree with this, I will update the schema.
We will want to update that, however I'm not sure in which manner. Attributes in the html output are definitely helpful for things like side-by-side autoscrolling, syntax highlighting etc, and we want some form of this. A few questions:
my expectation of a column API would be that utf8 is accounted for, however that is not the case. I would rework this to simply expose byte offsets.
I think that's actually an interesting level of details, and can complement the more detailed new API by letting blocks overlap, this could be trivially implemented in the add_extent source map method, which would, whenever a new extent is added, update the start offset of the associated node, and recurse up its ancestors to update their stop offset. |
@jgm, can you answer that last comment ? |
I did try it and one thing I remember noticing were lots of line breaks in bad places, e.g. between a word and Question: could some of its aims be achieved by modifying cmark's existing commonmark renderer, or a modification of it? E.g. you could pass in the extents + the original source, and the renderer could peek in the source to see if a backslash escape was used, check indentations, bullet types, and so on, and try to preserve these in the output. One thing the existing renderer does very well I think is line breaks.
Yes, I think this makes sense.
Right. That makes more sense, I agree.
Wasn't sure what you meant here, can you explain? |
Weird, can you please give specific examples, cause that is just not the case here, example:
|
@jgm, maybe you're thinking about links, but IMHO my tool's behaviour is way better than cmark -t commonmark in that case:
ie, my tool tries its best to stick to the requested width, and will break on punctuation, cmark will not |
Well, the thing is I needed some sort of hash table for example, strict C89 with no dependencies makes things a bit awkward. As this is a developer tool, I considered that speed was not an absolute necessity. By the way, the glib now builds with no issues in Visual Studio ;) |
I disagree that this is better behavior. These are cases where both tools are going above the requested width (and such cases are going to be inevitable). cmark avoids the awkward punctuation breaks. I personally think we should never break after the Here's another example where I think cmark does the right thing: cmark:
your tool:
Surely in this case the line break after the opening
I don't think anyone will think that's a desirable place to break. Of course, all these things are easily fixable. |
I'm still not completely decided about whether it makes sense to keep cmark-format in the cmark repository, but at the very least I think it should be in a separate directory with a separate CMakeLists.txt. |
Hey folks, out of curiosity, have there been any updates on this? It looks like the code is pretty much ready to go, and I’d love to be able to use source positions in a project I’m working on pretty soon. Any ideas about when it might land on master, and/or make it into a “released” version? Thanks! |
The outstanding issues concern cmark-format and whether it should go in the repository. |
Sorry, as usual, work comes in unexpectedly :) @jgm, I have no problem with keeping the script for later, it's in a separate commit so it shouldn't be too difficult to filter it out for now :) |
Any updates on when this might make it into a release? I'm trying to get it all building correctly locally now, but just for simplicity's sake, we like to update our cmark version only to numbered tags in the Git repo (and splatting this pull request on top of 0.27.1 isn't yielding ideal results 😞). |
There you go, three commits:
The source map code, which generation is now conditional to the SOURCEPOS option, when not enabled make bench results are identical to the results without the patch. Note that I had a bunch of follow-up commits on my end which I ended up fixing up with that patch, mostly fixes except for one patch that defines reference nodes. I also squashed up @nwellnhof's commits in there, please tell me if you mind.
The wrapper code, also updated but nothing major at all
The actual fancy stuff, a reindenting script, tested on every test of the spec with 1 to 80 columns, the test checks that the new AST is identical to the old one, minus softbreaks. The actual test I propose only checks for reindenting on one column, because 80 * 610 starts to take a little time I'm afraid. Note that the script could break in a few more places than it currently does, and certainly be massively optimized, but I think it's already in a state worth merging because, as far as I can tell, it is now provably reliable.