Hi, I`ve just created 2 pull requests: https://github.com/apache/maven-indexer/pull/10 (indexer-cli not working as expected) https://github.com/apache/maven-indexer/pull/11 (what I mentioned about the ArtifactInfo's repositoryID being null)
Hope you have the time to have a look. I`m particularly interested in #11. If it is accepted, I might continue exploring option 1), as detailed in my previous post, since I am currently blocked by it, as detailed in 1.1). Thanks, Eduard On Mon, Dec 8, 2014 at 10:26 PM, Tamas Cservenak <[email protected]> wrote: > Hi Eduard, > > for additional information see: > http://jira.codehaus.org/browse/MINDEXER-81 > > Currently, the ArtifactInfo is hardwired, is not extensible. > > Re available index for Central, it’s not the “minimal usable” > the decision driver, but the SIZE of the index download instead. > We were experimenting with different creators, but the bandwidth > it took off (if you compared it to artifact downloads) was really huge. > > As almost everyone uses MRMs, and they tend to “improve” the > basic GAV index Central publishes (ie. once Nexus caches a > JAR file, it will “improve” the index with Classnames in the JAR too, > something Central does not publish). > > artifactInfo#repoId should not return null, if asked via context. > If it does, there is a bug lurking somewhere. > > Currently the “extra info” path is viable, but that would create a lot > of cruft around indexer classes….. > > > > > > -- > Thanks, > ~t~ > > On 8 Dec 2014 at 17:15:08, Eduard Moraru ([email protected]) wrote: > > Hi, > > I have a new challenge for your maven-indexer expertise :) > > What about adding additional information to the local index? I see the > default indexers (min, etc..) produce really minimal information. The > problem is that everybody is using these default indexers and all the > available indexes (maven central, etc) offer very little information that > you can actually use to make the index useful in an application outside of > really basic name, description, group, artifact, etc queries. > > For instance, if I would want to add author information (to query by > author) or dependency information (to perform compatibility checks against > an installation/group of installed artifacts) or anything else for the > matter, what would be the recommended approach? > > From what I have currently researched, I see 2 options: > > 1) Have a custom IndexCreator that uses the updateDocument(ArtifactInfo > artifactInfo, Document document) method to fetch (HTTP GET) get the > pom.xml > by using information from the artifactInfo object (repository, groupID, > artifactId, classifier, version, etc.) so that the resulting document > contains the extra information. It seems that IndexCreators are used a lot > more than they are advertised in the descriptions, not only for indexing > new items, but also when converting between ArtifactInfo objects and > Lucene > Documents. > > 1.1) I had initially started going on this pat, but then I realized that > the artifactInfo that I receive in this method does not provide basic > information (i.e. artifactInfo.getRepository() always returns null ;-( ) > It > would be awesome if information like context and/or repository would be > added to the artifactInfo object (maybe in > IndexUtils.constructArtifactInfo( Document doc, IndexingContext context ) > ?), the same way the ArtifactInfo.UINFO and ArtifactInfo.LAST_MODIFIED > fields are handled specially and explicitly added to a new Document that > is > passed to the IndexCreators. > > 2) Handle this separately from maven indexer's work, and do it right after > index/update operations, i.e. let maven-indexer update the local index > with > information from the remote index and then start manipulating the > underlying Lucene index by adding information retrieved from the network > (HTTP GET) from the remote repositoy's POM files. In a rough pseudocode, > something like: > > indexer.update(repoX); > indexer.getAllIndexedArtifacts().forEach(artifact -> > var extraData = getExtraData(repoX, artifact); > var indexer.getLuceneIndex().add(artifact, extraData) > ); > > 3) Any other suggestions? > > My ultimate goal is (besides basic name/description queries) to be able to > perform compatibility queries on artifacts coming from multiple > repositories, so I need to find a solution to add this missing infrmation > (artifact dependencies, and maybe more). > > As previously, your help and suggestions are most welcomed. > > Thanks, > Eduard > > On Wed, Nov 26, 2014 at 1:22 PM, Eduard Moraru <[email protected]> > wrote: > > > > > > > On Tue, Nov 25, 2014 at 12:22 PM, Tamas Cservenak <[email protected]> > > wrote: > > > >> Hi there, > >> > >> 1) yes, indexing context retains the artefact “origin” (ie. repo), so > you > >> need context per origin. Sadly, the 1 index per context is current > >> limitation of maven indexer, but this problem is known. Created > >> http://jira.codehaus.org/browse/MINDEXER-93 > >> > >> 2) Yes, merged context is basically delegating to member contexts. > under > >> the hud, it uses Lucene’s MultiReader to actually perform the search. > >> > > > > I have solved the search problem for now by using the SearchEngine > > component and issuing an IteratorSearchRequest on a list of > > IndexingContexts to get paginated results. Will have to see how that > works > > on the long run. > > > > Thanks, > > Eduard > > > > > >> Re ranging, there are already issues (or problem spread across multiple > >> issues), most notably this one > >> http://jira.codehaus.org/browse/MINDEXER-8 > >> > >> 3) I think yes. Currently, indexer is being transitioned from Plexus to > >> JSR330, and as you see in examples, it should work with any container > >> supporting it. re “manually wiring”, in latest releases you might be > able > >> to do it, but in older ones probably not, as Plexus supported field > >> injection only, and some of those member was not exposed via > getter/setter. > >> See > >> http://jira.codehaus.org/browse/MINDEXER-80 > >> > >> > >> -- > >> Thanks, > >> ~t~ > >> > >> On 21 Nov 2014 at 18:08:26, Eduard Moraru ([email protected]) > wrote: > >> > >> Hi, > >> > >> I have recently started playing with the maven indexer [1], following > the > >> examples [2], and I have some questions (since AFAIS, documentation is > >> practically unexistent on the matter): > >> > >> 1) From what I can understand, you need an IndexingContext for each > >> repository you plan to index. This makes you end up with n lucene > indexes, > >> one for each repository. Is there any way that I could have just 1 > lucene > >> index, with all my repositories indexed in the same place? If the main > >> purpose is searchig, why scatter the indexed information across n > indexes > >> and make the whole process dificult? Maybe I`m missing something. > >> > >> 2) On the same line as the first question, when it comes to searching, > it > >> seems that I can use a MergedIndexingContext to perform a search on > >> multiple (all) indexed repositories (IndexingContexts). How does this > >> merge > >> the search results? I assume it takes each lucene index and queries it > >> individually, but this probably means that the lucene scores of these > >> merged results are completely messed up and ureliable, right? > >> Any suggestions on how to properly perform search over multiple indexed > >> repositories? > >> > >> 3) About the Plexus Container: Am I forced to initialize and use one, > or > >> can I/should manually instantiate the default implementations and use > them > >> instead? > >> > >> I`ll probably come up with more questions along the way, hope someone > will > >> find the time to guide me on the right path. > >> > >> Thanks, > >> Eduard > >> > >> ---------- > >> [1] https://github.com/apache/maven-indexer/ > >> [2] > >> > >> > https://github.com/apache/maven-indexer/tree/master/indexer-examples/indexer-examples-basic > >> > > > > > >
