Re: Using the Maven Indexer

Eduard Moraru Mon, 12 Jan 2015 09:57:04 -0800

Hi,

I`ve just created 2 pull requests:
https://github.com/apache/maven-indexer/pull/10 (indexer-cli not working as
expected)
https://github.com/apache/maven-indexer/pull/11 (what I mentioned about the
ArtifactInfo's repositoryID being null)


Hope you have the time to have a look.

I`m particularly interested in #11. If it is accepted, I might continue
exploring option 1), as detailed in my previous post, since I am currently
blocked by it, as detailed in 1.1).

Thanks,
Eduard

On Mon, Dec 8, 2014 at 10:26 PM, Tamas Cservenak <[email protected]>
wrote:

> Hi Eduard,
>
> for additional information see:
> http://jira.codehaus.org/browse/MINDEXER-81
>
> Currently, the ArtifactInfo is hardwired, is not extensible.
>
> Re available index for Central, it’s not the “minimal usable”
> the decision driver, but the SIZE of the index download instead.
> We were experimenting with different creators, but the bandwidth
> it took off (if you compared it to artifact downloads) was really huge.
>
> As almost everyone uses MRMs, and they tend to “improve” the
> basic GAV index Central publishes (ie. once Nexus caches a
> JAR file, it will “improve” the index with Classnames in the JAR too,
> something Central does not publish).
>
> artifactInfo#repoId should not return null, if asked via context.
> If it does, there is a bug lurking somewhere.
>
> Currently the “extra info” path is viable, but that would create a lot
> of cruft around indexer classes…..
>
>
>
>
>
> --
> Thanks,
> ~t~
>
> On 8 Dec 2014 at 17:15:08, Eduard Moraru ([email protected]) wrote:
>
> Hi,
>
> I have a new challenge for your maven-indexer expertise :)
>
> What about adding additional information to the local index? I see the
> default indexers (min, etc..) produce really minimal information. The
> problem is that everybody is using these default indexers and all the
> available indexes (maven central, etc) offer very little information that
> you can actually use to make the index useful in an application outside of
> really basic name, description, group, artifact, etc queries.
>
> For instance, if I would want to add author information (to query by
> author) or dependency information (to perform compatibility checks against
> an installation/group of installed artifacts) or anything else for the
> matter, what would be the recommended approach?
>
> From what I have currently researched, I see 2 options:
>
> 1) Have a custom IndexCreator that uses the updateDocument(ArtifactInfo
> artifactInfo, Document document) method to fetch (HTTP GET) get the
> pom.xml
> by using information from the artifactInfo object (repository, groupID,
> artifactId, classifier, version, etc.) so that the resulting document
> contains the extra information. It seems that IndexCreators are used a lot
> more than they are advertised in the descriptions, not only for indexing
> new items, but also when converting between ArtifactInfo objects and
> Lucene
> Documents.
>
> 1.1) I had initially started going on this pat, but then I realized that
> the artifactInfo that I receive in this method does not provide basic
> information (i.e. artifactInfo.getRepository() always returns null ;-( )
> It
> would be awesome if information like context and/or repository would be
> added to the artifactInfo object (maybe in
> IndexUtils.constructArtifactInfo( Document doc, IndexingContext context )
> ?), the same way the ArtifactInfo.UINFO and ArtifactInfo.LAST_MODIFIED
> fields are handled specially and explicitly added to a new Document that
> is
> passed to the IndexCreators.
>
> 2) Handle this separately from maven indexer's work, and do it right after
> index/update operations, i.e. let maven-indexer update the local index
> with
> information from the remote index and then start manipulating the
> underlying Lucene index by adding information retrieved from the network
> (HTTP GET) from the remote repositoy's POM files. In a rough pseudocode,
> something like:
>
> indexer.update(repoX);
> indexer.getAllIndexedArtifacts().forEach(artifact ->
> var extraData = getExtraData(repoX, artifact);
> var indexer.getLuceneIndex().add(artifact, extraData)
> );
>
> 3) Any other suggestions?
>
> My ultimate goal is (besides basic name/description queries) to be able to
> perform compatibility queries on artifacts coming from multiple
> repositories, so I need to find a solution to add this missing infrmation
> (artifact dependencies, and maybe more).
>
> As previously, your help and suggestions are most welcomed.
>
> Thanks,
> Eduard
>
> On Wed, Nov 26, 2014 at 1:22 PM, Eduard Moraru <[email protected]>
> wrote:
>
> >
> >
> > On Tue, Nov 25, 2014 at 12:22 PM, Tamas Cservenak <[email protected]>
> > wrote:
> >
> >> Hi there,
> >>
> >> 1) yes, indexing context retains the artefact “origin” (ie. repo), so
> you
> >> need context per origin. Sadly, the 1 index per context is current
> >> limitation of maven indexer, but this problem is known. Created
> >> http://jira.codehaus.org/browse/MINDEXER-93
> >>
> >> 2) Yes, merged context is basically delegating to member contexts.
> under
> >> the hud, it uses Lucene’s MultiReader to actually perform the search.
> >>
> >
> > I have solved the search problem for now by using the SearchEngine
> > component and issuing an IteratorSearchRequest on a list of
> > IndexingContexts to get paginated results. Will have to see how that
> works
> > on the long run.
> >
> > Thanks,
> > Eduard
> >
> >
> >> Re ranging, there are already issues (or problem spread across multiple
> >> issues), most notably this one
> >> http://jira.codehaus.org/browse/MINDEXER-8
> >>
> >> 3) I think yes. Currently, indexer is being transitioned from Plexus to
> >> JSR330, and as you see in examples, it should work with any container
> >> supporting it. re “manually wiring”, in latest releases you might be
> able
> >> to do it, but in older ones probably not, as Plexus supported field
> >> injection only, and some of those member was not exposed via
> getter/setter.
> >> See
> >> http://jira.codehaus.org/browse/MINDEXER-80
> >>
> >>
> >> --
> >> Thanks,
> >> ~t~
> >>
> >> On 21 Nov 2014 at 18:08:26, Eduard Moraru ([email protected])
> wrote:
> >>
> >> Hi,
> >>
> >> I have recently started playing with the maven indexer [1], following
> the
> >> examples [2], and I have some questions (since AFAIS, documentation is
> >> practically unexistent on the matter):
> >>
> >> 1) From what I can understand, you need an IndexingContext for each
> >> repository you plan to index. This makes you end up with n lucene
> indexes,
> >> one for each repository. Is there any way that I could have just 1
> lucene
> >> index, with all my repositories indexed in the same place? If the main
> >> purpose is searchig, why scatter the indexed information across n
> indexes
> >> and make the whole process dificult? Maybe I`m missing something.
> >>
> >> 2) On the same line as the first question, when it comes to searching,
> it
> >> seems that I can use a MergedIndexingContext to perform a search on
> >> multiple (all) indexed repositories (IndexingContexts). How does this
> >> merge
> >> the search results? I assume it takes each lucene index and queries it
> >> individually, but this probably means that the lucene scores of these
> >> merged results are completely messed up and ureliable, right?
> >> Any suggestions on how to properly perform search over multiple indexed
> >> repositories?
> >>
> >> 3) About the Plexus Container: Am I forced to initialize and use one,
> or
> >> can I/should manually instantiate the default implementations and use
> them
> >> instead?
> >>
> >> I`ll probably come up with more questions along the way, hope someone
> will
> >> find the time to guide me on the right path.
> >>
> >> Thanks,
> >> Eduard
> >>
> >> ----------
> >> [1] https://github.com/apache/maven-indexer/
> >> [2]
> >>
> >>
> https://github.com/apache/maven-indexer/tree/master/indexer-examples/indexer-examples-basic
> >>
> >
> >
>
>

Re: Using the Maven Indexer

Reply via email to