Re: [Collections] Suppliers, Iterables, and Producers

2024-05-03 Thread Claude Warren
Gary and Alex,

Any thoughts on this?

Claude

On Wed, May 1, 2024 at 7:55 AM Claude Warren  wrote:

> Good suggestions.
>
> short-circuit. We could make this distinction by including it in the name:
>> forEachUntil(Predicate ...), forEachUnless, ...
>
>
> We need the unit name in the method name.  All Bloom filters implement
> IndexProducer and BitmapProducer and since they use Predicate method
> parameters they will conflict.
>
>
> I have opened a ticket [1] with the list of tasks, which I think is now:
>
>- Be clear that producers are like interruptible iterators with
>predicate tests acting as a switch to short-circuit the iteration.
>- Rename classes:
>   - CellConsumer to CellPredicate (?)
>   - Rename BitMap to BitMaps.
>- Rename methods:
>   - Producer forEachX() to forEachUntil()
>   - The semantic nomenclature:
>   - Bitmaps are arrays of bits not a BitMaps object.
>   - Indexes are ints and not an instance of a Collection object.
>   - Cells are pairs of ints representing an index and a value.  They
>   are not Pair<> objects.
>   - Producers iterate over collections of the object (Bitmap, Index,
>   Cell) applying a predicate to do work and stop the iteration early if
>   necessary.  They are carriers/transporters of Bloom filter enabled bits.
>   They allow us to query the contents of the Bloom filter in an
>   implementation agnostic way.
>
>
> In thinking about the term Producer, other terms could be used
> Interrogator (sounds like you can add a query), Extractor might work.  But
> it has also come to mind that there is a "compute" series of methods in the
> ConcurrentMap class.  Perhaps the term we want is not "forEach", but
> "process".  The current form of usage is something like:
>
> IndexProducer ip = 
> ip.forEachIndex(idx -> someIntPredicate)
>
> We could change the name from XProducer to XProcessor, or XExtractor; and
> the method to processXs.  So the above code would look like:
>
> IndexExtractor ix = 
> ix.processIndexs(idx -> someIntPredicate)
>
> another example
>
> BitMapExtractor bx = .
> bx.processBitMaps(bitmap -> someBitMapPredicate)
>
> Claude
>
> [1] https://issues.apache.org/jira/browse/COLLECTIONS-854
>
>
> On Tue, Apr 30, 2024 at 4:51 PM Gary D. Gregory 
> wrote:
>
>>
>>
>> On 2024/04/30 14:33:47 Alex Herbert wrote:
>> > On Tue, 30 Apr 2024 at 14:45, Gary D. Gregory 
>> wrote:
>> >
>> > > Hi Claude,
>> > >
>> > > Thank you for the detailed reply :-) A few comments below.
>> > >
>> > > On 2024/04/30 06:29:38 Claude Warren wrote:
>> > > > I will see if I can clarify the javadocs and make things clearer.
>> > > >
>> > > > What I think I specifically heard is:
>> > > >
>> > > >- Be clear that producers are fast fail iterators with predicate
>> > > tests.
>> > > >- Rename CellConsumer to CellPredicate (?)
>> > >
>> > > Agreed (as suggested by Albert)
>> > >
>> > > >- The semantic nomenclature:
>> > > >   - Bitmaps are arrays of bits not a BitMap object.
>> > > >   - Indexes are ints and not an instance of a Collection object.
>> > > >   - Cells are pairs of ints representing an index and a value.
>> They
>> > > >   are not Pair<> objects.
>> > > >   - Producers iterate over collections of the object (Bitmap,
>> Index,
>> > > >   Cell) applying a predicate to do work and stop the iteration
>> early
>> > > if
>> > > >   necessary.  They are carriers/transporters of Bloom filter
>> enabled
>> > > bits.
>> > > >   They allow us to query the contents of the Bloom filter in an
>> > > >   implementation agnostic way.
>> > >
>> > > As you say naming is hard. The above is a great example and a good
>> > > exercise I've gone through at work and in other FOSS projects:
>> "Producers
>> > > iterate over collections of the object...". In general when I see or
>> write
>> > > a Javadoc of the form "Foo bars" or "Runners walk" or "Walkers run",
>> you
>> > > get the idea ;-) I know that either the class (or method) name is bad
>> or
>> > > the Javadoc/documentation is bad; not _wrong_, just bad in the sense
>> that
>> > > it's confusing (to me).
>> > >
>> > > I am not advocating for a specific change ATM but I want to discuss
>> the
>> > > option because it is possible the current name is not as good as it
>> could
>> > > be. It could end up as an acceptable compromise if we cannot use more
>> Java
>> > > friendly terms though.
>> > >
>> > > Whenever I see a class that implements a "forEach"-kind of method, I
>> think
>> > > "Iterable".
>> > >
>> >
>> > Here we should think "Collection", or generally more than 1. In the Java
>> > sense an Iterable is something you can walk through to the
>> > end, possibly removing elements as you go using the Iterator interface.
>> We
>> > would not require supporting removal, and we want to control a
>> > short-circuit. We could make this distinction by including it in the
>> name:
>> > forEachUntil(Predicate ...), forEachUnle

Re: [Collections] Suppliers, Iterables, and Producers

2024-05-03 Thread Gary Gregory
LGTM. Maybe the current PR (LGTM) should be merged first, Alex, how does
that PR look to you?

Gary

On Fri, May 3, 2024, 11:44 AM Claude Warren  wrote:

> Gary and Alex,
>
> Any thoughts on this?
>
> Claude
>
> On Wed, May 1, 2024 at 7:55 AM Claude Warren  wrote:
>
>> Good suggestions.
>>
>> short-circuit. We could make this distinction by including it in the name:
>>> forEachUntil(Predicate ...), forEachUnless, ...
>>
>>
>> We need the unit name in the method name.  All Bloom filters implement
>> IndexProducer and BitmapProducer and since they use Predicate method
>> parameters they will conflict.
>>
>>
>> I have opened a ticket [1] with the list of tasks, which I think is now:
>>
>>- Be clear that producers are like interruptible iterators with
>>predicate tests acting as a switch to short-circuit the iteration.
>>- Rename classes:
>>   - CellConsumer to CellPredicate (?)
>>   - Rename BitMap to BitMaps.
>>- Rename methods:
>>   - Producer forEachX() to forEachUntil()
>>   - The semantic nomenclature:
>>   - Bitmaps are arrays of bits not a BitMaps object.
>>   - Indexes are ints and not an instance of a Collection object.
>>   - Cells are pairs of ints representing an index and a value.  They
>>   are not Pair<> objects.
>>   - Producers iterate over collections of the object (Bitmap, Index,
>>   Cell) applying a predicate to do work and stop the iteration early if
>>   necessary.  They are carriers/transporters of Bloom filter enabled 
>> bits.
>>   They allow us to query the contents of the Bloom filter in an
>>   implementation agnostic way.
>>
>>
>> In thinking about the term Producer, other terms could be used
>> Interrogator (sounds like you can add a query), Extractor might work.  But
>> it has also come to mind that there is a "compute" series of methods in the
>> ConcurrentMap class.  Perhaps the term we want is not "forEach", but
>> "process".  The current form of usage is something like:
>>
>> IndexProducer ip = 
>> ip.forEachIndex(idx -> someIntPredicate)
>>
>> We could change the name from XProducer to XProcessor, or XExtractor; and
>> the method to processXs.  So the above code would look like:
>>
>> IndexExtractor ix = 
>> ix.processIndexs(idx -> someIntPredicate)
>>
>> another example
>>
>> BitMapExtractor bx = .
>> bx.processBitMaps(bitmap -> someBitMapPredicate)
>>
>> Claude
>>
>> [1] https://issues.apache.org/jira/browse/COLLECTIONS-854
>>
>>
>> On Tue, Apr 30, 2024 at 4:51 PM Gary D. Gregory 
>> wrote:
>>
>>>
>>>
>>> On 2024/04/30 14:33:47 Alex Herbert wrote:
>>> > On Tue, 30 Apr 2024 at 14:45, Gary D. Gregory 
>>> wrote:
>>> >
>>> > > Hi Claude,
>>> > >
>>> > > Thank you for the detailed reply :-) A few comments below.
>>> > >
>>> > > On 2024/04/30 06:29:38 Claude Warren wrote:
>>> > > > I will see if I can clarify the javadocs and make things clearer.
>>> > > >
>>> > > > What I think I specifically heard is:
>>> > > >
>>> > > >- Be clear that producers are fast fail iterators with predicate
>>> > > tests.
>>> > > >- Rename CellConsumer to CellPredicate (?)
>>> > >
>>> > > Agreed (as suggested by Albert)
>>> > >
>>> > > >- The semantic nomenclature:
>>> > > >   - Bitmaps are arrays of bits not a BitMap object.
>>> > > >   - Indexes are ints and not an instance of a Collection
>>> object.
>>> > > >   - Cells are pairs of ints representing an index and a
>>> value.  They
>>> > > >   are not Pair<> objects.
>>> > > >   - Producers iterate over collections of the object (Bitmap,
>>> Index,
>>> > > >   Cell) applying a predicate to do work and stop the iteration
>>> early
>>> > > if
>>> > > >   necessary.  They are carriers/transporters of Bloom filter
>>> enabled
>>> > > bits.
>>> > > >   They allow us to query the contents of the Bloom filter in an
>>> > > >   implementation agnostic way.
>>> > >
>>> > > As you say naming is hard. The above is a great example and a good
>>> > > exercise I've gone through at work and in other FOSS projects:
>>> "Producers
>>> > > iterate over collections of the object...". In general when I see or
>>> write
>>> > > a Javadoc of the form "Foo bars" or "Runners walk" or "Walkers run",
>>> you
>>> > > get the idea ;-) I know that either the class (or method) name is
>>> bad or
>>> > > the Javadoc/documentation is bad; not _wrong_, just bad in the sense
>>> that
>>> > > it's confusing (to me).
>>> > >
>>> > > I am not advocating for a specific change ATM but I want to discuss
>>> the
>>> > > option because it is possible the current name is not as good as it
>>> could
>>> > > be. It could end up as an acceptable compromise if we cannot use
>>> more Java
>>> > > friendly terms though.
>>> > >
>>> > > Whenever I see a class that implements a "forEach"-kind of method, I
>>> think
>>> > > "Iterable".
>>> > >
>>> >
>>> > Here we should think "Collection", or generally more than 1. In the
>>> Java
>>> > sense an Iterable is something yo

Re: Modularization of components

2024-05-03 Thread Elric V

Apache Commons VFS is already broken up into a multi-module project,
so I don't know what you're talking about; see
https://search.maven.org/search?q=g:org.apache.commons%20AND%20a:commons-vfs2*
The next release will be further modularized; see git master,


It's a multi-module project, sure, but the modules are split along 
technical boundaries rather than functional. I didn't explain this well 
enough in my original message, so let me try that again.


I thought VFS was an appropriate example because it contains *a lot* of 
functionality. This is by design, of course, and it's a useful thing. 
But most people who use VFS don't use all of the file system types 
(called Providers in VFS). There's FTP, SFTP, HTTP, and a bunch of others.


My hypothetical suggestion was that if each of those providers were 
their own module, the dependency footprint would go down for many 
projects which use some but not all VFS Providers.


IMO this would be a good thing for a variety of reasons.

I don't know whether VFS is an appropriate example from a 
technical/feasibility perspective, and sure, backwards compatibility is 
a concern. But this was intended as an example to start a discussion 
about modularization within commons.



(1) It's painful to build Apache Commons releases with Maven
multi-module projects. It's NOT just building a jar file or set of
jars. In comparison, building a mono-module is "simple".


Is this a fundamental maven issue which is hard to solve? I haven't had 
too many issues with multi-module maven projects in the past, but I 
admit that my builds are a lot less complex than commons projects.



(2) Always, always, always keep compatibility in mind


How is this related?
Any set of functionalities should be amenable to a modular design,
unless there are cyclic dependencies (that signal bad design).


I imagine that some (many?) projects aren't designed with modularization 
or pluginification in mind, and they end up doing something like 
Providers.register(FTP.class, HTTP.class, SFTP.class) to register all 
known implementations. Inverting that relationship isn't always easy to 
do after the fact. So I understand that this isn't necessarily a quick 
and easy project.



Supporting JPMS is orthogonal to a modular (Maven) project (see
[RNG], for example).


True. I think in the long term both are desirable. One to reduce size & 
dependencies & build times; the other to better isolate components & 
implementation details. But if I had to choose one or the other, maven 
modularization would certainly be first on the list.



In summary, IMO modularization should be a feature (and a default
goal) of any new major release.
I know that it is a lot of work (of course, cf. [Math] history) , but we
should encourage contributions towards that goal.


Thanks for the +1 on that, Gilles. I'm certainly not expecting any 
overnight changes on this. My goal was merely to start a discussion and 
see whether there's any interest for this in the community.


Commons components are used incredibly widely. Which is obviously a 
great thing. But I see WARs getting fatter and fatter with transitive 
dependencies, and lots of classes remaining unused at runtime. In the 
age of continuous deployments and fast container startup times, making 
it easier to keep things slim seems like a useful goal.


Best,

Elric


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org