Fix escaping of names in IR#8880
Conversation
The binary reader (but not the text parser) would previous escape names read from the names section and use those escaped names in the IR. But there is no reason to escape names in the IR; we can use the unmodified byte sequences instead. Remove this unnecessary escaping and always store names unescaped in the IR. The binary writer was further escaping names when writing them back to the names section. Remove this escaping so that we simply preserve whatever names were present in the input, modulo any deduplication we might have done. Asyncify and wasm-split read function names as input from the command line and from manifest files and would previously escape these names on reading so they would match up with the escaped names in the IR. Now that the IR does not store escaped names, fix Asyncify and wasm-split to no longer escape these names. As a follow-on change we might consider _unescaping_ these names instead, which would allow users to more easily pass names containing unprintable characters. As a drive-by, fix the printing in MinimizeImportsAndExports to properly use JSON escaping.
|
@aheejin, it looks like you were right earlier about the IR storing escaped names. I was confused because that does not happen when the input is text. This will make everything more consistent and punts on the question of how and whether we should accept escaped names in the wasm-split manifests. |
aheejin
left a comment
There was a problem hiding this comment.
Nice! This makes things cleaner.
Can we also remove escape ane unescape in https://github.com/aheejin/binaryen/blob/main/src/ir/names.cpp and https://github.com/aheejin/binaryen/blob/main/src/ir/names.h I added in #8868?
| String::printEscapedJSON(std::cout, key.first.view()) << ", "; | ||
| String::printEscapedJSON(std::cout, key.second.view()) << ", "; | ||
| String::printEscapedJSON(std::cout, new_.view()) << "]"; |
There was a problem hiding this comment.
Why should this be printEscapedJSON and not printEscaped? And what are the differences?
There was a problem hiding this comment.
This pass is printing JSON, so it should be printing JSON escape sequences rather than WebAssembly text format escape sequences. An example of the difference is that JSON uses "\uABCD" for unicode character U+ABCD, whereas the WebAssembly text format would use "\u{ABCD}".
However, I've reverted this part of the change. printEscapedJSON expects the representation of its string argument to be WTF-16, but that is not the case here.
| void WasmBinaryWriter::writeEscapedName(std::string_view name) { | ||
| if (name.find('\\') == std::string_view::npos) { | ||
| writeInlineString(name); | ||
| return; | ||
| } | ||
| // decode escaped by escapeName (see below) function names | ||
| std::string unescaped; | ||
| for (size_t i = 0; i < name.size();) { | ||
| char ch = name[i++]; | ||
| // support only `\xx` escapes; ignore invalid or unsupported escapes | ||
| if (ch != '\\' || i + 1 >= name.size() || !isHexDigit(name[i]) || | ||
| !isHexDigit(name[i + 1])) { | ||
| unescaped.push_back(ch); | ||
| continue; | ||
| } | ||
| unescaped.push_back( | ||
| char((decodeHexNibble(name[i]) << 4) | decodeHexNibble(name[i + 1]))); | ||
| i += 2; | ||
| } | ||
| writeInlineString({unescaped.data(), unescaped.size()}); | ||
| } |
There was a problem hiding this comment.
The binary writer was further escaping names when writing them back to the names section.
Isn't this unescaping names despite the name tells the opposite? Anyway, if the IR has unescaped names now and the name section should have the same, removing this looks correct.
There was a problem hiding this comment.
Oh yes, you're right. I'll update the description.
| Name escape(Name name); | ||
| // Unescapes a WebAssembly identifier back into its original human-readable | ||
| // string. | ||
| std::string unescape(Name name); |
There was a problem hiding this comment.
Can't we remove unescape?
It looks it's used in wasm-split --print-profile --unescape. Do we still need this given that IR has unescaped names already?
The binary reader (but not the text parser) would previous escape names read from the names section and use those escaped names in the IR. But there is no reason to escape names in the IR; we can use the unmodified byte sequences instead. Remove this unnecessary escaping and always store names unescaped in the IR.
The binary writer was partially unescaping names when writing them back to the names section, but this is no longer necessary. After removing this, we simply preserve whatever names were present in the input, modulo any changes we might have made to avoid duplicate names.
Asyncify and wasm-split read function names as input from the command line and from manifest files and would previously escape these names on reading so they would match up with the escaped names in the IR. Now that the IR does not store escaped names, fix Asyncify and wasm-split to no longer escape these names. As a follow-on change we might consider unescaping these names instead, which would allow users to more easily pass names containing unprintable characters.