Re: Data ingestion with predefined buckets

Anilkumar Gingade Wed, 15 Apr 2020 10:37:26 -0700

About api: I would not recommend using bucketId in api, as it is internal
and there are other internal/external apis that rely on bucket id
calculations; which could be compromised here.


Instead of adding new APIs, probably looking at minimizing/reducing the
time spent may be a good start.

BucketRegin.waitUntilLocked - A putAll thread could spend time here, when
there are multiple threads acting upon the same thread; one way to reduce
this is by tuning the putall size, can you try changing our putall size
(say start with 100).

I am wondering about the time spent in hashcode(); is it a custom code?

If you want to create the buckets upfront, you can try calling the method:
PartitionRegionHelper.assignBucketsToPartitions().

-Anil


On Wed, Apr 15, 2020 at 8:37 AM steve mathew <steve.mathe...@gmail.com>
wrote:

> Thanks Den, Anil and Udo for your inputs. Extremely sorry for late rely as
> I took bit of time to explore and understand geode internals.
>
> It seems BucketRegion/Bucket terminology is not exposed to user but still i
> am trying to achieve something that is uncommon and for which client API is
> not exposed.
>
> *Details about Use-case/Client *
> - MultiThreadClient - Each task perform data-ingestion on specific bucket.
> Each task knows the bucket number to ingest data. In-short client knows
> task-->bucket mapping.
> - Each task iteratively ingest-data into batch (configurable) of 1000
> records to the bucket assigned to it.
> - Parallelism is achieved by running multiple tasks concurrently.
>
>
> *When i tried with exisitng R.putAll() API, observed slow performance and
> related observations are* - Few tasks takes quite a longer time (ThreaDump
> shows--> Thread WAITING on BucketRegin.waitUntilLocked), hence overall
> client takes longer time.
>  - Code profiling shows good amount of time spent during hash-code
> calculation. It seems key.hashCode() gets calculated in on both client and
> server, which is not required for my use-case as task-->bucket mapping
> known before.
>  - putAll() client implementation takes care of Parallelism (using
> PRMetadata enabled thread-pool and reshuffle the keys internally), but in
> my-case that's taken care by multiple tasks each per buckrt within my
> client.
>
> *I have forked the Geode codebase and trying to extend it by providing a
> client API like, *
> //Region.java
> /**
>  * putAll records in specified bucket
>  */
> *public void putAll(int bucketId, map) *
>
> Already added client side message and related code (similar to putAllOp and
> its impl) , I am adding server-side code/BaseCommand, similar to putAll
> code-path (cmdExecute()/virtualPut() etc),* is there any API (internal)
> that provides bucket specific putAll and take care of redundancy -
> secondary bucket ingestion - as well and i can use/hook it directly ..?*
>
> It seems, if i isolate bucket-creation and actual put flow (create bucket
> prior to putAll call) it may work better in my scenario, hence
> *Is there any (recommendations) to create buckets explicitly prior to
> actual PUT and not within putAll flow lazily on actual PUT. Is there any
> internal API available for this, that can be used or other means like FE
> etc....?*
>
> *Data processing/retrieval :  *I am not going to use get/getAll API but
> will process the data using FE and Querying mechanism once achieve bucket
> specific ingestion.
>
>
> *Overall thoughts on this API impl. ..?*
>
> Looking forward to the inputs..
> Thanks in advance.
>
> *Steve M*
>
>
>
>
>
> On Sat, Apr 11, 2020 at 7:12 PM Udo Kohlmeyer <u...@vmware.com.invalid>
> wrote:
>
> > Hi there Steve,
> >
> > Firstly, you are correct, the pattern you are describing is not
> > recommended and possibly not even correctly supported. I've seen many
> > implementations of Geode systems and none of them ever needed to do what
> > you are intending to do. Seems like you are will to go through A LOT of
> > effort for a benefit I don't immediately realize.
> >
> > Also, I'm confused on what part of the "hashing" you are trying to avoid.
> > You will ALWAYS  have the hashing overhead. At the very least the key
> will
> > have to be hashed for put() and later on get().
> > As for the "file-per-bucket" request, there will always be some form of
> > bucket resolution that needs to happen. Be it a custom
> PartitionResolution
> > or default partition bucket resolver.
> >
> > In the code that Dan provided, you now have to manage the bucket number
> > explicitly in the client. When you insert data, you have to provide the
> > correct bucket number and if you retrieve the data, you have to provide
> the
> > correct bucket number, otherwise you will get "null" back. So this means
> > your client has to manage the bucket numbers. Because every subsequent
> > put/get that does not provide the bucket number, will possibly result in
> > some failure. In short, EVERY key operation (put/get) will require a
> > bucketNumber to function correctly, as the PartitionResolver is used.
> >
> > Maybe we can aid you better in suitable solution by understanding WHAT
> you
> > are trying to achieve and WHAT you are trying to avoid.
> >
> > So in short, you will NOT avoid hashing, as a Region will always hash the
> > key, regardless of how you load your data. Think of a Region as a big
> > distributed HashMap. Hashing is in its DNA and inner workings. The only
> > thing step you'd avoid is the bucket allocation calculation, which tbh,
> is
> > lightweight
> >
> > `bucketNumber = (hashcode % totalNumberBuckets) + 1`
> >
> > --Udo
> >
> > On 4/10/20, 3:52 PM, "steve mathew" <steve.mathe...@gmail.com> wrote:
> >
> >     Thanks Dan for your quick response.
> >
> >     Though, This may not be a recommended pattern, Here, I am targeting a
> >     bucket specific putAll and want to exclude hashing as it turn out as
> an
> >     overhead in my scenario.
> >     Is this achievable...? How should I define a PartionResolver that
> works
> >     generically and returns a respective bucket for specific file.
> >     What will get impacted if I opt this route (Fix partitioning per
> > file), can
> >     think of horizontal scalability as buckets made fix .. thoughts?
> >
> >
> >     -Steave M.
> >
> >
> >     On Sat, Apr 11, 2020, 1:54 AM Dan Smith <dsm...@pivotal.io> wrote:
> >
> >     > Hi Steve,
> >     >
> >     > The bucket that data goes into is generally determined by the key.
> > So for
> >     > example if your data in File-0 is all for customer X, you can
> include
> >     > Customer X in your region key and implement a PartitionResolver
> that
> >     > extracts the customer from your region key and returns it. Geode
> > will then
> >     > group all of the data for Customer X into a single bucket.
> >     >
> >     > You generally shouldn't have to target a specific bucket number (eg
> > bucket
> >     > 0). But technically you can just by returning an integer from your
> >     > PartitionResolver. If you return the integer 0, your data will go
> > into
> >     > bucket 0. Usually it's just better to return your partition key (eg
> >     > "Customer X") and let geode hash that to some bucket number.
> >     >
> >     > -Dan
> >     >
> >     > On Fri, Apr 10, 2020 at 11:04 AM steve mathew <
> > steve.mathe...@gmail.com>
> >     > wrote:
> >     >
> >     > > Hello Geode devs and users,
> >     > >
> >     > > I have a set of files populated with data, fairly distributed, I
> > want to
> >     > > put each file's data in a specific bucket,
> >     > > like PutAll File-0 data into Geode bucket B0
> >     > >       PutAll File-1 data into Geode bucket B1
> >     > >
> >     > >       and so on...
> >     > >
> >     > > How can i achieve this using geode client...?
> >     > >
> >     > > Can i achieve this using PartitonResolver or some other means...?
> >     > >
> >     > > Thanks in advance
> >     > >
> >     > > -Steve M.
> >     > >
> >     >
> >
> >
> >
>

Re: Data ingestion with predefined buckets

Reply via email to