Re: Data ingestion with predefined buckets

steve mathew Wed, 15 Apr 2020 08:37:28 -0700

Thanks Den, Anil and Udo for your inputs. Extremely sorry for late rely as
I took bit of time to explore and understand geode internals.


It seems BucketRegion/Bucket terminology is not exposed to user but still i
am trying to achieve something that is uncommon and for which client API is
not exposed.

*Details about Use-case/Client *
- MultiThreadClient - Each task perform data-ingestion on specific bucket.
Each task knows the bucket number to ingest data. In-short client knows
task-->bucket mapping.
- Each task iteratively ingest-data into batch (configurable) of 1000
records to the bucket assigned to it.
- Parallelism is achieved by running multiple tasks concurrently.


*When i tried with exisitng R.putAll() API, observed slow performance and
related observations are* - Few tasks takes quite a longer time (ThreaDump
shows--> Thread WAITING on BucketRegin.waitUntilLocked), hence overall
client takes longer time.
 - Code profiling shows good amount of time spent during hash-code
calculation. It seems key.hashCode() gets calculated in on both client and
server, which is not required for my use-case as task-->bucket mapping
known before.
 - putAll() client implementation takes care of Parallelism (using
PRMetadata enabled thread-pool and reshuffle the keys internally), but in
my-case that's taken care by multiple tasks each per buckrt within my
client.

*I have forked the Geode codebase and trying to extend it by providing a
client API like, *
//Region.java
/**
 * putAll records in specified bucket
 */
*public void putAll(int bucketId, map) *

Already added client side message and related code (similar to putAllOp and
its impl) , I am adding server-side code/BaseCommand, similar to putAll
code-path (cmdExecute()/virtualPut() etc),* is there any API (internal)
that provides bucket specific putAll and take care of redundancy -
secondary bucket ingestion - as well and i can use/hook it directly ..?*

It seems, if i isolate bucket-creation and actual put flow (create bucket
prior to putAll call) it may work better in my scenario, hence
*Is there any (recommendations) to create buckets explicitly prior to
actual PUT and not within putAll flow lazily on actual PUT. Is there any
internal API available for this, that can be used or other means like FE
etc....?*

*Data processing/retrieval :  *I am not going to use get/getAll API but
will process the data using FE and Querying mechanism once achieve bucket
specific ingestion.


*Overall thoughts on this API impl. ..?*

Looking forward to the inputs..
Thanks in advance.

*Steve M*





On Sat, Apr 11, 2020 at 7:12 PM Udo Kohlmeyer <[email protected]>
wrote:

> Hi there Steve,
>
> Firstly, you are correct, the pattern you are describing is not
> recommended and possibly not even correctly supported. I've seen many
> implementations of Geode systems and none of them ever needed to do what
> you are intending to do. Seems like you are will to go through A LOT of
> effort for a benefit I don't immediately realize.
>
> Also, I'm confused on what part of the "hashing" you are trying to avoid.
> You will ALWAYS  have the hashing overhead. At the very least the key will
> have to be hashed for put() and later on get().
> As for the "file-per-bucket" request, there will always be some form of
> bucket resolution that needs to happen. Be it a custom PartitionResolution
> or default partition bucket resolver.
>
> In the code that Dan provided, you now have to manage the bucket number
> explicitly in the client. When you insert data, you have to provide the
> correct bucket number and if you retrieve the data, you have to provide the
> correct bucket number, otherwise you will get "null" back. So this means
> your client has to manage the bucket numbers. Because every subsequent
> put/get that does not provide the bucket number, will possibly result in
> some failure. In short, EVERY key operation (put/get) will require a
> bucketNumber to function correctly, as the PartitionResolver is used.
>
> Maybe we can aid you better in suitable solution by understanding WHAT you
> are trying to achieve and WHAT you are trying to avoid.
>
> So in short, you will NOT avoid hashing, as a Region will always hash the
> key, regardless of how you load your data. Think of a Region as a big
> distributed HashMap. Hashing is in its DNA and inner workings. The only
> thing step you'd avoid is the bucket allocation calculation, which tbh, is
> lightweight
>
> `bucketNumber = (hashcode % totalNumberBuckets) + 1`
>
> --Udo
>
> On 4/10/20, 3:52 PM, "steve mathew" <[email protected]> wrote:
>
>     Thanks Dan for your quick response.
>
>     Though, This may not be a recommended pattern, Here, I am targeting a
>     bucket specific putAll and want to exclude hashing as it turn out as an
>     overhead in my scenario.
>     Is this achievable...? How should I define a PartionResolver that works
>     generically and returns a respective bucket for specific file.
>     What will get impacted if I opt this route (Fix partitioning per
> file), can
>     think of horizontal scalability as buckets made fix .. thoughts?
>
>
>     -Steave M.
>
>
>     On Sat, Apr 11, 2020, 1:54 AM Dan Smith <[email protected]> wrote:
>
>     > Hi Steve,
>     >
>     > The bucket that data goes into is generally determined by the key.
> So for
>     > example if your data in File-0 is all for customer X, you can include
>     > Customer X in your region key and implement a PartitionResolver that
>     > extracts the customer from your region key and returns it. Geode
> will then
>     > group all of the data for Customer X into a single bucket.
>     >
>     > You generally shouldn't have to target a specific bucket number (eg
> bucket
>     > 0). But technically you can just by returning an integer from your
>     > PartitionResolver. If you return the integer 0, your data will go
> into
>     > bucket 0. Usually it's just better to return your partition key (eg
>     > "Customer X") and let geode hash that to some bucket number.
>     >
>     > -Dan
>     >
>     > On Fri, Apr 10, 2020 at 11:04 AM steve mathew <
> [email protected]>
>     > wrote:
>     >
>     > > Hello Geode devs and users,
>     > >
>     > > I have a set of files populated with data, fairly distributed, I
> want to
>     > > put each file's data in a specific bucket,
>     > > like PutAll File-0 data into Geode bucket B0
>     > >       PutAll File-1 data into Geode bucket B1
>     > >
>     > >       and so on...
>     > >
>     > > How can i achieve this using geode client...?
>     > >
>     > > Can i achieve this using PartitonResolver or some other means...?
>     > >
>     > > Thanks in advance
>     > >
>     > > -Steve M.
>     > >
>     >
>
>
>

Re: Data ingestion with predefined buckets

Reply via email to