Thanks for the reponses and I think they are valuable. I embedded my opinions with yours and leave the dispatch context for @kevinthesun.
Also cc @tqchen and @icemelon9 for their inputs. > > If we design this resume logic in a general way, we can also extend it to > > tophub. > > Does it make sense to generalise here? As far as I can tell, TopHub doesn't > store tuning history just optimal configs, so there's no way to 'resume' a > TopHub tuning session. In some way we have to determine whether the existing > 'tuning effort' to produce a particular config is sufficient and number of > trials is the only obvious way I can think of characterising this. I'd be > happy to look at any alternative implementation idea though. I agree with you that TopHub is serving a different purpose if we consider trial number in the resume logic, but they can still share the same implementation and history format in the way I suggested. My concern of using trial number is that it limits the use case of this RFC only for resuming interrupted tuning but not others, such as transferring the tuning process to the others, or reuse the configs of 2000 trial random search to launch a new grid search, etc. Alternativaly, we could decouple the history and a specific tuning process. Speicifically, we do not add any tuning process specific information to the config library but just let the tuner determine if it can reuse the result from the config library or not when it needs to measure that config. For example, the tuning process was interrupted in the 50th trial so we have 50 configs in the library. When resuming the tuning, the tuner still starts from scratch but it could save the time of measuring those 50 configs when the tuner follows the same tuning process. One advantage is that this scenario is applicable to different tuner or even different models with the same task. One drawback of my alternative comapred to yours is that if the tuning process is non-deterministic (e.g., random search) then we might spend time on tuning different configs, but I think this can be workaround by either exposing an optional random seed argument in tuner (such as `random_state` used in `sklearn`), or let user reduce the trial number when resuming. > > > We can try to retrieve the target device info using system call and add it > > to every record when dumping to file/database. > > This would be a good start, but I think this needs to also be something a > user can fully specify. For instance, we might be interesting in driver > versions, memory clock speeds or even physical parameters such as board > cooling. Which system calls were you considering using to determine the > platform? Perhaps have a default method that relies on these calls with the > ability to pass additional arbitrary info to `config_library.start_job()`? > I have the same question actually. This part is relatively vague and probably need some other's input. > > In my personal opinion, we also need to invalid the history/records when > > TVM has been updated > > I agree with this, but maybe it can be included as part of the previous point > on board configuration? In a general sense we need an idea of whether a > particular config is 'compatible' with our current platform and I think it's > reasonable to include TVM version as a part of this. Your response remineded me that the current config history already includes a version information, although it is always 0.1. Not sure if we can make use of it and save some efforts. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/dmlc/tvm/issues/4150#issuecomment-544611522