eloparco added inline comments.
================ Comment at: lldb/tools/lldb-vscode/lldb-vscode.cpp:2177 + const auto bytes_offset = -instruction_offset * bytes_per_instruction; + auto start_addr = base_addr - bytes_offset; + const auto disassemble_bytes = instruction_count * bytes_per_instruction; ---------------- eloparco wrote: > clayborg wrote: > > eloparco wrote: > > > clayborg wrote: > > > > This will work if the min and max opcode byte size are the same, like > > > > for arm64, the min and max are 4. This won't work for x86 or arm32 in > > > > thumb mode. So when backing up, we need to do an address lookup on the > > > > address we think we want to go to, and then adjust the starting address > > > > accordingly. Something like: > > > > > > > > ``` > > > > SBAddress start_sbaddr = (base_addr - bytes_offset, g_vsc.target); > > > > ``` > > > > now we have a section offset address that can tell us more about what > > > > it is. We can find the SBFunction or SBSymbol for this address and use > > > > those to find the right instructions. This will allow us to correctly > > > > disassemble code bytes. > > > > > > > > We can also look at the section that the memory comes from and see what > > > > the section contains. If the section is data, then emit something like: > > > > ``` > > > > 0x00001000 .byte 0x23 > > > > 0x00001001 .byte 0x34 > > > > ... > > > > ``` > > > > To find the section type we can do: > > > > ``` > > > > SBSection section = start_sbaddr.GetSection(); > > > > if (section.IsValid() && section.GetSectionType() == > > > > lldb::eSectionTypeCode) { > > > > // Disassemble from a valid boundary > > > > } else { > > > > // Emit a byte or long at a time with ".byte 0xXX" or other ASM > > > > directive for binary data > > > > } > > > > ``` > > > > We need to ensure we start disassembling on the correct instruction > > > > boundary as well as our math for "start_addr" might be in between > > > > actual opcode boundaries. If we are in a lldb::eSectionTypeCode, then > > > > we know we have instructions, and if we are not, then we can emit > > > > ".byte" or other binary data directives. So if we do have > > > > lldb::eSectionTypeCode as our section type, then we should have a > > > > function or symbol, and we can get instructions from those objects > > > > easily: > > > > > > > > ``` > > > > if (section.IsValid() && section.GetSectionType() == > > > > lldb::eSectionTypeCode) { > > > > lldb::SBInstructionList instructions; > > > > lldb::SBFunction function = start_sbaddr.GetFunction(); > > > > if (function.IsValid()) { > > > > instructions = function.GetInstructions(g_vsc.target); > > > > } else { > > > > symbol = start_sbaddr.GetSymbol(); > > > > if (symbol.IsValid()) > > > > instructions = symbol.GetInstructions(g_vsc.target); > > > > } > > > > const size_t num_instrs = instructions.GetSize(); > > > > if (num_instrs > 0) { > > > > // we found instructions from a function or symbol and we need to > > > > // find the matching instruction that we want to start from by > > > > iterating > > > > // over the instructions and finding the right address > > > > size_t matching_idx = num_instrs; // Invalid index > > > > for (size_t i=0; i<num_instrs; ++i) { > > > > lldb::SBInstruction inst = instructions.GetInstructionAtIndex(i); > > > > if (inst.GetAddress().GetLoadAddress(g_vsc.target) >= start_addr) { > > > > matching_idx = i; > > > > break; > > > > } > > > > } > > > > if (matching_idx < num_instrs) { > > > > // Now we can print the instructions from [matching_idx, num_instrs) > > > > // then we need to repeat the search for the next function or > > > > symbol. > > > > // note there may be bytes between functions or symbols which we > > > > can disassemble > > > > // by calling _get_instructions_from_memory(...) but we must find > > > > the next > > > > // symbol or function boundary and get back on track > > > > } > > > > > > > > ``` > > > Sorry, I should have provided a proper explanation. > > > > > > I use the maximum instruction size as the "worst case". Basically, I need > > > to read a portion of memory but I do not know the start address and the > > > size. For the start address, if I want to read N instructions before > > > `base_addr` I need to read at least starting from `base_addr - N * > > > max_instruction_size`: if all instructions are of size > > > `max_instruction_size` I will read exactly N instructions; otherwise I > > > will read more than N instructions and prune the additional ones > > > afterwards. Same for applies for the size. > > > > > > Since `start_addr` is based on a "worst case", it may be an address in > > > the middle of an instruction. In that case, that first instruction will > > > be misinterpreted, but I think that is negligible. > > > > > > The logic is similar to what is implemented in other VS Code extensions, > > > like > > > https://github.com/vadimcn/vscode-lldb/blob/master/adapter/src/debug_session.rs#L1134. > > > > > > Does it make sense? > > The issue is, you might end up backing up by N bytes, and you might not end > > up on an opcode boundary. Lets say you have x86 disassembly like: > > > > ``` > > 0x100002e37 <+183>: 48 8b 4d f8 movq -0x8(%rbp), %rcx > > 0x100002e3b <+187>: 48 39 c8 cmpq %rcx, %rax > > 0x100002e3e <+190>: 0f 85 09 00 00 00 jne 0x100002e4d > > ; <+205> at main.cpp > > 0x100002e44 <+196>: 8b 45 94 movl -0x6c(%rbp), %eax > > 0x100002e47 <+199>: 48 83 c4 70 addq $0x70, %rsp > > 0x100002e4b <+203>: 5d popq %rbp > > 0x100002e4c <+204>: c3 retq > > 0x100002e4d <+205>: e8 7e 0f 00 00 callq 0x100003dd0 > > ; symbol stub for: __stack_chk_fail > > 0x100002e52 <+210>: 0f 0b ud2 > > ``` > > > > Let's say you started with the address 0x100002e4c, and backed up by the > > max opcode size of 15 for x86_64, that would be 0x100002e3d. You would > > start disassembling on a non opcode boundary as this is the last byte of > > the 3 byte opcode at 0x100002e3b (0xc8). And this would disassembly > > incorrectly. So we need to make use of the functions or symbol boundaries > > to ensure we are disassembling correctly. If we have no function or symbol, > > we can do our best. But as you can see we would get things completely wrong > > in this case and we need to fix this as detailed. > Actually, I was mislead by the fact that so far I tried on both: > - my ARM machine, with instructions of fixed size (4) > - with [WAMR](https://github.com/bytecodealliance/wasm-micro-runtime) > disassembling to WASM, where, when the start address is in the middle of an > instruction, only that first instruction is misinterpreted > > Without going into sections and symbols as you were proposing, the solution > can be as easy as done in this VS Code Extension: > https://github.com/vadimcn/vscode-lldb/commit/28ae4f4bf3bd29a0c84dd586cd5360836210ab51. > > I'll update the code and put some additional comments to clarify the logic. > Let me know what you think. > Actually, I was mislead by the fact that so far I tried on both: > - my ARM machine, with instructions of fixed size (4) > - with [WAMR](https://github.com/bytecodealliance/wasm-micro-runtime) > disassembling to WASM, where, when the start address is in the middle of an > instruction, only that first instruction is misinterpreted > > Without going into sections and symbols as you were proposing, the solution > can be as easy as done in this VS Code Extension: > https://github.com/vadimcn/vscode-lldb/commit/28ae4f4bf3bd29a0c84dd586cd5360836210ab51. > > I'll update the code and put some additional comments to clarify the logic. > Let me know what you think. ================ Comment at: lldb/tools/lldb-vscode/lldb-vscode.cpp:2177 + const auto bytes_offset = -instruction_offset * bytes_per_instruction; + auto start_addr = base_addr - bytes_offset; + const auto disassemble_bytes = instruction_count * bytes_per_instruction; ---------------- eloparco wrote: > eloparco wrote: > > clayborg wrote: > > > eloparco wrote: > > > > clayborg wrote: > > > > > This will work if the min and max opcode byte size are the same, like > > > > > for arm64, the min and max are 4. This won't work for x86 or arm32 in > > > > > thumb mode. So when backing up, we need to do an address lookup on > > > > > the address we think we want to go to, and then adjust the starting > > > > > address accordingly. Something like: > > > > > > > > > > ``` > > > > > SBAddress start_sbaddr = (base_addr - bytes_offset, g_vsc.target); > > > > > ``` > > > > > now we have a section offset address that can tell us more about what > > > > > it is. We can find the SBFunction or SBSymbol for this address and > > > > > use those to find the right instructions. This will allow us to > > > > > correctly disassemble code bytes. > > > > > > > > > > We can also look at the section that the memory comes from and see > > > > > what the section contains. If the section is data, then emit > > > > > something like: > > > > > ``` > > > > > 0x00001000 .byte 0x23 > > > > > 0x00001001 .byte 0x34 > > > > > ... > > > > > ``` > > > > > To find the section type we can do: > > > > > ``` > > > > > SBSection section = start_sbaddr.GetSection(); > > > > > if (section.IsValid() && section.GetSectionType() == > > > > > lldb::eSectionTypeCode) { > > > > > // Disassemble from a valid boundary > > > > > } else { > > > > > // Emit a byte or long at a time with ".byte 0xXX" or other ASM > > > > > directive for binary data > > > > > } > > > > > ``` > > > > > We need to ensure we start disassembling on the correct instruction > > > > > boundary as well as our math for "start_addr" might be in between > > > > > actual opcode boundaries. If we are in a lldb::eSectionTypeCode, then > > > > > we know we have instructions, and if we are not, then we can emit > > > > > ".byte" or other binary data directives. So if we do have > > > > > lldb::eSectionTypeCode as our section type, then we should have a > > > > > function or symbol, and we can get instructions from those objects > > > > > easily: > > > > > > > > > > ``` > > > > > if (section.IsValid() && section.GetSectionType() == > > > > > lldb::eSectionTypeCode) { > > > > > lldb::SBInstructionList instructions; > > > > > lldb::SBFunction function = start_sbaddr.GetFunction(); > > > > > if (function.IsValid()) { > > > > > instructions = function.GetInstructions(g_vsc.target); > > > > > } else { > > > > > symbol = start_sbaddr.GetSymbol(); > > > > > if (symbol.IsValid()) > > > > > instructions = symbol.GetInstructions(g_vsc.target); > > > > > } > > > > > const size_t num_instrs = instructions.GetSize(); > > > > > if (num_instrs > 0) { > > > > > // we found instructions from a function or symbol and we need to > > > > > // find the matching instruction that we want to start from by > > > > > iterating > > > > > // over the instructions and finding the right address > > > > > size_t matching_idx = num_instrs; // Invalid index > > > > > for (size_t i=0; i<num_instrs; ++i) { > > > > > lldb::SBInstruction inst = instructions.GetInstructionAtIndex(i); > > > > > if (inst.GetAddress().GetLoadAddress(g_vsc.target) >= start_addr) > > > > > { > > > > > matching_idx = i; > > > > > break; > > > > > } > > > > > } > > > > > if (matching_idx < num_instrs) { > > > > > // Now we can print the instructions from [matching_idx, > > > > > num_instrs) > > > > > // then we need to repeat the search for the next function or > > > > > symbol. > > > > > // note there may be bytes between functions or symbols which we > > > > > can disassemble > > > > > // by calling _get_instructions_from_memory(...) but we must find > > > > > the next > > > > > // symbol or function boundary and get back on track > > > > > } > > > > > > > > > > ``` > > > > Sorry, I should have provided a proper explanation. > > > > > > > > I use the maximum instruction size as the "worst case". Basically, I > > > > need to read a portion of memory but I do not know the start address > > > > and the size. For the start address, if I want to read N instructions > > > > before `base_addr` I need to read at least starting from `base_addr - N > > > > * max_instruction_size`: if all instructions are of size > > > > `max_instruction_size` I will read exactly N instructions; otherwise I > > > > will read more than N instructions and prune the additional ones > > > > afterwards. Same for applies for the size. > > > > > > > > Since `start_addr` is based on a "worst case", it may be an address in > > > > the middle of an instruction. In that case, that first instruction will > > > > be misinterpreted, but I think that is negligible. > > > > > > > > The logic is similar to what is implemented in other VS Code > > > > extensions, like > > > > https://github.com/vadimcn/vscode-lldb/blob/master/adapter/src/debug_session.rs#L1134. > > > > > > > > Does it make sense? > > > The issue is, you might end up backing up by N bytes, and you might not > > > end up on an opcode boundary. Lets say you have x86 disassembly like: > > > > > > ``` > > > 0x100002e37 <+183>: 48 8b 4d f8 movq -0x8(%rbp), %rcx > > > 0x100002e3b <+187>: 48 39 c8 cmpq %rcx, %rax > > > 0x100002e3e <+190>: 0f 85 09 00 00 00 jne 0x100002e4d > > > ; <+205> at main.cpp > > > 0x100002e44 <+196>: 8b 45 94 movl -0x6c(%rbp), %eax > > > 0x100002e47 <+199>: 48 83 c4 70 addq $0x70, %rsp > > > 0x100002e4b <+203>: 5d popq %rbp > > > 0x100002e4c <+204>: c3 retq > > > 0x100002e4d <+205>: e8 7e 0f 00 00 callq 0x100003dd0 > > > ; symbol stub for: __stack_chk_fail > > > 0x100002e52 <+210>: 0f 0b ud2 > > > ``` > > > > > > Let's say you started with the address 0x100002e4c, and backed up by the > > > max opcode size of 15 for x86_64, that would be 0x100002e3d. You would > > > start disassembling on a non opcode boundary as this is the last byte of > > > the 3 byte opcode at 0x100002e3b (0xc8). And this would disassembly > > > incorrectly. So we need to make use of the functions or symbol > > > boundaries to ensure we are disassembling correctly. If we have no > > > function or symbol, we can do our best. But as you can see we would get > > > things completely wrong in this case and we need to fix this as detailed. > > Actually, I was mislead by the fact that so far I tried on both: > > - my ARM machine, with instructions of fixed size (4) > > - with [WAMR](https://github.com/bytecodealliance/wasm-micro-runtime) > > disassembling to WASM, where, when the start address is in the middle of an > > instruction, only that first instruction is misinterpreted > > > > Without going into sections and symbols as you were proposing, the solution > > can be as easy as done in this VS Code Extension: > > https://github.com/vadimcn/vscode-lldb/commit/28ae4f4bf3bd29a0c84dd586cd5360836210ab51. > > > > I'll update the code and put some additional comments to clarify the logic. > > Let me know what you think. > > Actually, I was mislead by the fact that so far I tried on both: > > - my ARM machine, with instructions of fixed size (4) > > - with [WAMR](https://github.com/bytecodealliance/wasm-micro-runtime) > > disassembling to WASM, where, when the start address is in the middle of an > > instruction, only that first instruction is misinterpreted > > > > Without going into sections and symbols as you were proposing, the solution > > can be as easy as done in this VS Code Extension: > > https://github.com/vadimcn/vscode-lldb/commit/28ae4f4bf3bd29a0c84dd586cd5360836210ab51. > > > > I'll update the code and put some additional comments to clarify the logic. > > Let me know what you think. > > @clayborg what I wrote in the comment before this one is what I implemented (Diff 6), do you think that is not enough? I'll try it on a x86 linux machine to make sure it works there Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D140358/new/ https://reviews.llvm.org/D140358 _______________________________________________ lldb-commits mailing list lldb-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits