Author: brett Date: Wed Jul 26 15:06:23 2006 New Revision: 425872 URL: http://svn.apache.org/viewvc?rev=425872&view=rev Log: [MRM-127] update intended design
Modified: maven/repository-manager/trunk/maven-repository-indexer/src/site/apt/design.apt Modified: maven/repository-manager/trunk/maven-repository-indexer/src/site/apt/design.apt URL: http://svn.apache.org/viewvc/maven/repository-manager/trunk/maven-repository-indexer/src/site/apt/design.apt?rev=425872&r1=425871&r2=425872&view=diff ============================================================================== --- maven/repository-manager/trunk/maven-repository-indexer/src/site/apt/design.apt (original) +++ maven/repository-manager/trunk/maven-repository-indexer/src/site/apt/design.apt Wed Jul 26 15:06:23 2006 @@ -11,16 +11,18 @@ <<Note: The current indexer design is under review. This document will grow into what it should be, and the code and tests refactored to match>> + ~~TODO: separate API design from Lucene implementation design + * Standard Artifact Index We currently want to index these elements from the repository: - * for each artifact file: the artifact ID, version, group ID, classifier, type (extension), filename, - checksums (md5, sha1) and size + * for each artifact file: the artifact ID, version, group ID, classifier, type (extension), filename (including path + from the repository base), checksums (md5, sha1) and size * for each artifact POM: the packaging, licenses, dependencies, build plugins, reporting plugins - * plugin prefix from the repository metadata (in the future, more may be indexed) + * plugin prefix * Java classes and packages within a JAR artifact (delimited by \n) @@ -32,23 +34,42 @@ record may need to be updated when different files that are related to the same artifact are discovered (ie, the POM, or for plugins the metadata that contains their prefix). - Records in the index are generally keyed by their dependency conflict ID (ie, a combination of group, artifact, - version, type and classifier). The exception to this rule is the POM: if an entry already exists with a different - type but the same group, artifact, version and no classifier, then a POM entry is not added and the model fields are - applied to the existing entry. Conversely, if a POM is added first and an artifact with the same group, artifact, - version and no classifier is later added then it overwrites the record of the POM. - - The above process, especially with regard to the handling of the POM, should be much simpler if the discoverer is - able to associate a POM to the artifact instead of feeding them in separately as it does at present. - - While some of the information stored is specific to a particular type of file, it is all maintained in a single index - for simplicity. In the future, if the content of the various documents diverges greatly, it may be split into separate - indexes. In that case, we may consider using Lucene's - {{{http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-b11296f9e7b2a5e7496d67118d0a5898f2fd9823} multiple index - searching capabilities}}. + To simplify this, the process for discovery is as follows: - Currently, the discoverer returns POMs as separate artifact entries to the actual artifact, and any derived artifacts - in the repository. To accommodate this, when indexed + * Discovered artifacts will read the related POM and metadata from the repository to index, rather than relying on + it being discovered. This ensures that partial discovery still yields correct results in all cases, and it is + possible to construct the entire record without having to read back from the index. + + * POMs that do not have a packaging of POM are not sent to the indexer. + + The result of this process is that updates to a POM or repository metadata and not the corresponding artifact(s) will + not update the index. As POMs should not be modified, this will not be a major concern. Likewise, updates to metadata + will only accompany updates to the artifact itself, so will not cause a problem. + + The above case may have a problem if the discovery happens during the middle of a deployment outside of the + repository manager (where the artifact is present, but the metadata or POM is not). To avoid such cases, the + discoverer should only detect changes more than a minute old (this blackout should be configurable). + + Other techniques were considered: + + * Processing each artifact file individually, updating each record as needed. This would result in having to read + back each index record before writing. This is quite costly in Lucene as it would be "read, delete, add". You + must have a reader and writer open for that process, and it greatly complicates the code. + + * Have three indices, one for each. This would complicate searching (and may affect ranking of results, though this + was not analysed). While Lucene is + {{{http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-b11296f9e7b2a5e7496d67118d0a5898f2fd9823} capable of + searching multiple indices}}, it is expected that the results would be in the form of a list of separate records + rather than the "table join" this effectively is. A similar derivative of this technique would be to store + everything in one index, using a field (previously, doctype) to identify each record. + + Records in the index are keyed by their path from the repository root. While this is longer than using the + dependency conflict ID, Lucene cannot delete by a combination of terms, so would require storing an additional + field in the index where the file already exists. + + The plugin prefix can be found either from inside the plugin JAR (<<<META-INF/maven/plugin.xml>>>), or from the + repository metadata for the plugin's group. For simplicity, the first approach will be used. This means at present + there is no need to index the repository metadata, however that may be considered in future. Note that archetypes currently don't have a packaging associated with them in Maven, so it is not recorded in the POM. However, to be able to search by this type, the indexer will look for a <<<META-INF/maven/archetype.xml>>> file, and @@ -83,7 +104,7 @@ * <<<m>>>: md5 checksum of the JAR - Only JARs are indexed at present. + Only JARs are indexed at present. The JAR filename is used as the key for later deleting entries. * Searching @@ -92,9 +113,13 @@ Some features that will be available: - * <Search by a particular field (exact match)>: This would be needed for search by checksum + * <Search through most fields for a particular keyword>: the general case described above. + + * <Search by a particular field (exact match)>: This would be needed for search by checksum. - * <Search in a range of field values>: This would be needed for searching based on update time + * <Search in a range of field values>: This would be needed for searching based on update time. Note that in + Lucene it may be better to search by other fields (or return all), and then filter the results by dates rather + than making dates part of a search query. * <Limit search to particular fields>: It will be useful to only search Java classes and packages, for example @@ -102,10 +127,3 @@ reasons. It should not have to read any metadata files or properties of files such as size and checksum from the disk. This enables searching a repository remotely without having the physical repository available, which is useful for IDE integration among other things. - -* Limitations - - Currently, because the POM and artifacts are fed in separately, there is no way to associate an artifact with a - classifier to its POM, meaning there is less information about it in the index. It may be best that this occurs by - design - it seems that while it is desirable to search by classifier you only want to find the main artifact for - browsing and see the derived artifact listed under that. How this evolves should be carefully considered.