eloparco added inline comments.
================ Comment at: lldb/tools/lldb-vscode/lldb-vscode.cpp:2177 + const auto max_instruction_size = g_vsc.target.GetMaximumOpcodeByteSize(); + const auto bytes_offset = -instruction_offset * max_instruction_size; + auto start_addr = base_addr - bytes_offset; ---------------- clayborg wrote: > eloparco wrote: > > clayborg wrote: > > > Just checked out your changes, and you are still just subtracting a value > > > from the start address and attempting to disassemble from memory which is > > > the problem. We need to take that subtracted address, and look it up as > > > suggested in previous code examples I posted. If you find a function to > > > symbol, ask those objects for their instructions. and then try to use > > > those. > > > > > > But basically for _any_ disassembly this is what I would recommend doing: > > > - first resolve the "start_address" (no matter how you come up the > > > address) that want to disassemble into a SBAddress > > > - check its section. If the section is valid and contains instructions, > > > call a function that will disassemble the address range for the section > > > that starts at "start_address" and ends at the end of the section. We can > > > call this "disassemble_code" as a function. More details on this below > > > - If the section does not contain instructions, just read the bytes and > > > emit a lines like: > > > ``` > > > 0x1000 .byte 0x12 > > > 0x1000 .byte 0x34 > > > ... > > > ``` > > > > > > Now for the disassemble_code function. We know the address range for this > > > is in code. We then need to resolve the address passed to > > > "disassemble_code" into a SBAddress and ask that address for a SBFunction > > > or SBSymbol as I mentioned. Then we ask the SBFunction or SBSymbol for > > > all instructions that they contain, and then use any instructions that > > > fall into the range we have. If there is no SBFunction or SBSymbol, then > > > disassemble an instruction at a time and then see if the new address will > > > resolve to a function or symbol. > > Tried my changes on a linux x86 machine and the loop `for (unsigned i = 0; > > i < max_instruction_size; i++) {` (L2190) takes care of the `start_address` > > possibly being in the middle of an instruction, so that's not a problem. > > The problem I faced is that it tries to read too far from `base_addr` and > > the `ReadMemory()` operation returns few instructions (without reaching > > `base_addr`). That was not happening on my macOS M1 (arm) machine. > > > > To solve, I changed the loop at L2190 to > > ``` > > for (unsigned i = 0; i < bytes_offset; i++) { > > auto sb_instructions = > > _get_instructions_from_memory(start_addr + i, disassemble_bytes); > > ``` > > and if `start_addr` is in `sb_instructions` we're done and can exit the > > loop. That worked. > > > > Another similar thing that can be done is to start from `start_sbaddr` as > > you were saying, increment the address until a valid section is found. Then > > call `_get_instructions_from_memory()` passing the section start. > > What do you think? Delegating the disassembling to `ReadMemory()` + > > `GetInstructions()` looks simpler to me than to manually iterate over > > sections and get instructions from symbols and functions. > > Is there any shortcoming I'm not seeing? > so your > ``` > for (unsigned i = 0; i < max_instruction_size; i++) { > ``` > disassembles and tries to make sure you make it back to the original base > address from the original disassemble packet? That can work but could a a bit > time consuming? > > The main issue, as you saw on x86, is you don't know what is in memory. You > could have unreadable areas of memory when trying to disassemble. Also if you > do have good memory that does contain instructions, there can be padding in > between functions or even data between functions that the function accesses > that can't be correctly disassembled and could throw things off again. > > The memory regions are the safest way to traverse memory to see what you have > and would help you deal with holes in the memory. You can ask about a memory > region with: > ``` > lldb::SBError SBProcess::GetMemoryRegionInfo(lldb::addr_t load_addr, > lldb::SBMemoryRegionInfo ®ion_info); > ``` > If you ask about an invalid address, like zero on most platforms, it will > return a valid "region_info" with the bounds of the unreadable address range > filled in no read/write/execute permissions: > ``` > (lldb) script > Python Interactive Interpreter. To exit, type 'quit()', 'exit()' or Ctrl-D. > >>> region = lldb.SBMemoryRegionInfo() > >>> err = lldb.process.GetMemoryRegionInfo(0, region) > [0x0000000000000000-0x0000000100000000 ---] > ``` > So you could use the memory region info to your advantage here. If you have > execute permissions, disassemble as instructions, and if you don't emit > ".byte 0xXX" for each byte. If there are no permissions, you can emit some > other string like "0x00000000: <invalid memory>". > > That being said, even when you do find an executable section of memory, there > can be different stuff there even if a section _is_ executable. For instance, > if we ask about the next memory region on a mac M1: > ``` > >>> err = lldb.process.GetMemoryRegionInfo(0x0000000100000000, region) > >>> print(region) > [0x0000000100000000-0x0000000100004000 R-X] > ``` > Notice that this is read + execute (you can access these via: > ``` > region.IsReadable() > region.IsWritable() > region.IsExecutable() > ``` > But at this address, this is the mach-o header which doesn't make sense to > try and disassemble: > ``` > (lldb) memory read 0x0000000100000000 > 0x100000000: cf fa ed fe 0c 00 00 01 00 00 00 00 02 00 00 00 ................ > 0x100000010: 11 00 00 00 18 03 00 00 85 00 20 00 00 00 00 00 .......... ..... > (lldb) memory read -fx -s4 -c 4 0x0000000100000000 > 0x100000000: 0xfeedfacf 0x0100000c 0x00000000 0x00000002 > ``` > 0xfeedfacf is the mach-o magic bytes for little endian 64 bit mach-o files. > ``` > (lldb) disassemble --start-address 0x0000000100000000 > a.out`_mh_execute_header: > 0x100000000 <+0>: .long 0xfeedfacf ; unknown opcode > 0x100000004 <+4>: .long 0x0100000c ; unknown opcode > 0x100000008 <+8>: udf #0x0 > 0x10000000c <+12>: udf #0x2 > 0x100000010 <+16>: udf #0x11 > 0x100000014 <+20>: udf #0x318 > 0x100000018 <+24>: .long 0x00200085 ; unknown opcode > 0x10000001c <+28>: udf #0x0 > ``` > > So this is the main reason why I would suggest just disassembling using .byte > or .long when we aren't in a function or symbol. > understood, so in that case we also have to do the same for `_handle_disassemble_positive_offset`, because even there, when we use `ReadInstructions()`, we could have some unreadable areas in the middle and `ReadInstructions()` would stop prematurely when running into one of them, resulting in less instructions returned than expected Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D140358/new/ https://reviews.llvm.org/D140358 _______________________________________________ lldb-commits mailing list lldb-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits