Github user dschneider-pivotal commented on a diff in the pull request: https://github.com/apache/geode/pull/559#discussion_r120224399 --- Diff: geode-docs/managing/troubleshooting/system_failure_and_recovery.html.md.erb --- @@ -276,8 +276,83 @@ find the reason. Description: -The process discovered that it was not in the distributed system and cannot determine why it was removed. The membership coordinator removed the member after it failed to respond to an internal are you alive message. +The process discovered that it was not in the distributed system and cannot determine why it was +removed. The membership coordinator removed the member after it failed to respond to an internal +are-you-alive message. Response: The operator should examine the locator processes and logs. + +## <a id="restart-failure-persistent-lru" class="no-quick-link"></a> Restart Fails Due To Out-of-Memory Error + +This section describes a restart failure that can occur when the stopped system is one that was configured with persistent regions. Specifically: + +- Some of the regions of the recovering system, when running, were configured as PERSISTENT regions, which means that they save their data to disk. +- At least one of the persistent regions was configured to evict least recently used (LRU) data by overflowing values to disk. + +### How Data is Recovered From Persistent Regions + +Data recovery, upon restart, always recovers keys. You can configure whether and how the system +recovers the values associated with those keys to populate the system cache. + +**Value Recovery** + +- Recovering all values immediately during startup slows the startup time but results in consistent +read performance after the startup on a "hot" cache. + +- Recovering no values means quicker startup but a "cold" cache, so the first retrieval of each value will read from disk. + +- Retrieving values asynchronously in a background thread allows a relatively quick startup on a "warm" cache +that will eventually recover every value. + +**Retrieve or Ignore LRU values** + +When a system with persistent LRU regions shuts down, the system does not record which of the values +were recently used. On subsequent startup, if values are recovered into an LRU region they may be +the least recently used instead of the most recently used. Also, if LRU values are recovered on a +heap or an off-heap LRU region, it is possible that the LRU memory limit will be exceeded, resulting +in an `OutOfMemoryException` during recovery. For these reasons, LRU value recovery can be treated +differently than non-LRU values. + +## Default Recovery Behavior for Persistent Regions + +The default behavior is for the system to recover all keys, then asynchronously recover all data +values that were resident, leaving LRU values unrecovered. This default strategy is best for +most applications, because it strikes a balance between recovery speed and cache completeness. + +### Configuring Recovery of Persistent Regions + +Three Java system parameters allow the developer to control the recovery behavior for persistent regions: + +- `gemfire.disk.recoverValues` + + Default = `true`, recover values. If `false`, recover only keys, do not recover values. + + *How used:* When `true`, recovery of the values "warms up" the cache so data retrievals will find + their values in the cache, without causing time consuming disk accesses. When `false`, shortens + recovery time so the system becomes available for use sooner, but the first retrieval on each key + will require a disk read. + +- `gemfire.disk.recoverLruValues` + + Default = `false`, do not recover LRU values. If `true`, recover LRU values. If + `gemfire.disk.recoverValues` is `false`, then `gemfire.disk.recoverLruValues` is ignored, since + no values are recovered. + + *How used:* When `false`, shortens recovery time by ignoring LRU values. When `true`, restores + more data values to the cache. Recovery of the LRU values increases heap memory usage and + could cause an out-of-memory error, preventing the system from restarting. + +- `gemfire.disk.recoverValuesSync` + + Default = `false`, recover values by an asynchronous background process. If `true`, values are + recovered synchronously, and recovery is not complete until all values have been retrieved. If + `gemfire.disk.recoverValues` is `false`, then `gemfire.disk.recoverValuesSync` is ignored since + no values are recovered. + + *How used:* When `false`, allows the system to become available sooner, but some time must elapse + before the entire cache is refreshed. Some key retrievals will require disk access, and some will not. --- End diff -- change "the entire cache is refreshed" to "all values have been read from disk into cache memory"
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---