Re: LRU cache

2023-02-18 Thread Albert-Jan Roskam
   I sometimes use this trick, which I learnt from a book by Martelli.
   Instead of try/except, membership testing with "in" (__contains__) might
   be faster. Probably "depends". Matter of measuring.
   def somefunc(arg, _cache={}):
       if len(_cache) > 10 ** 5:
           _cache.pop()
       try:
           return _cache[arg]
       except KeyError:
           result = expensivefunc(arg)
           _cache[arg] = result
           return result
   Albert-Jan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing caching strategies

2023-02-18 Thread Peter J. Holzer
On 2023-02-17 18:08:09 -0500, [email protected] wrote:
> Analogies I am sharing are mainly for me to wrap my head around an idea by
> seeing if it matches any existing ideas or templates and is not meant to be
> exact. Fair enough?

Yeah. But if you are venting your musings into a public space you
shouldn't be surprised if people react to them. And we can only react to
what you write, not what you think.

> But in this case, from my reading, the analogy is rather reasonable.

Although that confused me a bit. You are clearly responding to something
I thought about but which you didn't quote below. Did I just think about
it and not write it, but you responded anyway because you're a mind
reader? Nope, it just turns out you (accidentally) deleted that sentence.


> The implementation of Roaring Bitmaps seems to logically break up the
> space of all possible values it looks up into multiple "zones" that
> are somewhat analogous to individual files,

That part is correct. But you presented that in the form of a
performance/space tradeoff, writing about "trying multiple methods" to
find the smallest, and that that makes compression slower. That may be
the case for pkzip, but it's not what RB does: Instead it uses a very
simple heuristic: If there are less than 25% of the bits set in a zone,
it uses a list of offsets, otherwise a plain bitmap. (according to
their 2016 paper which I just skimmed through again - maybe the
implementation is a bit more complex now). So I think your description
would lead the reader to anticipate problems which aren't there and
probably miss ones which are there. So I'll stay with my "not completely
wrong but not very helpful" assessment.


> I did not raise the issue and thus have no interest in promoting this
> technology nor knocking it down. Just wondering what it was under the hood
> and whether I might even have a need for it. I am not saying Social Security
> numbers are a fit, simply that some form of ID number might fit.

Yeah, that's the point: Any form of ID which is a small-ish integer
number fits.

And maybe it's just because I work with databases a lot, but
representing things with numeric ids (in relational databases we call
them "surrogate keys") is such a basic operation that it doesn't warrant
more than a sentence or two.


> If a company has a unique ID number for each client and uses it
> consistently, then an implementation that holds a set stored this way
> of people using product A, such as house insurance, and those using
> product B, such as car insurance, and perhaps product C is an umbrella
> policy, might easily handle some queries such as who uses two or all
> three (intersections of sets) or who might merit a letter telling them
> how much they can save if they subscribed to two or all three as a way
> to get more business. Again, just  a made-up example I can think
> about. A company which has a million customers over the years will
> have fairly large sets as described. 

A much better example. This is indeed how you would use roaring bitmaps.


> What is helpful to me in thinking about something will naturally often not
> be helpful to you or others but nothing you wrote makes me feel my first
> take was in any serious way wrong. It still makes sense to me.
> 
> And FYI, the largest integer in signed 32 bits is 2_147_483_647

I know. It's been some time since I could do hexadecimal arithmetic
in my head but the the limits of common data types are still burned into
my brain ;-).

> which is 10 digits. A Social Security number look like xxx-xx- at
> this time which is only 9 digits.

These are US social security numbers. Other countries have other
schemes. For example, Austrian SSNs have 10 digits, so you would need 34
bits to represent them exactly. However, they (obviously) contain some
redundancy (one of the digits is a checksum, and there aren't 99
days in a century) so it could algorithmically be compressed down to 26
bits. But you probably wouldn't do that because in almost any real
application you wouldn't use the SSN as a primary key (some people don't
have one, and there have been mixups resulting in two people getting the
same SSN).


> As for my other EXAMPLE, I fail to see why I need to provide a specific need
> for an application. I don't care what they need it for. The thought was
> about whether something that does not start as an integer can be uniquely
> mapped into and out of integers of size 32 bits.

That's what confused me. You seemed to concentrate on the "map things to
integers" part which has been solved for decades and is absolutely
incidental to roaring bitmaps and completely ignored what you would be
using them for.

So I thought I was missing something, but it seems I wasn't.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | [email protected] |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signatur

Re: Precision Tail-off?

2023-02-18 Thread Peter J. Holzer
On 2023-02-18 03:52:51 +, Oscar Benjamin wrote:
> On Sat, 18 Feb 2023 at 01:47, Chris Angelico  wrote:
> > On Sat, 18 Feb 2023 at 12:41, Greg Ewing via Python-list
> > > To avoid it you would need to use an algorithm that computes nth
> > > roots directly rather than raising to the power 1/n.
> > >
> >
> > It's somewhat curious that we don't really have that. We have many
> > other inverse operations - addition and subtraction (not just "negate
> > and add"), multiplication and division, log and exp - but we have
> > exponentiation without an arbitrary-root operation. For square roots,
> > that's not a problem, since we can precisely express the concept
> > "raise to the 0.5th power", but for anything else, we have to raise to
> > a fractional power that might be imprecise.
> 
> Various libraries can do this. Both SymPy and NumPy have cbrt for cube roots:

Yes, but that's a special case. Chris was talking about arbitrary
(integer) roots. My calculator has a button labelled [x√y], but my
processor doesn't have an equivalent operation. Come to think of it, it
doesn't even have a a y**x operation - just some simpler operations
which can be used to implement it. GCC doesn't inline pow(y, x) on
x86/64 - it just calls the library function.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | [email protected] |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Precision Tail-off?

2023-02-18 Thread Oscar Benjamin
On Sat, 18 Feb 2023 at 11:19, Peter J. Holzer  wrote:
>
> On 2023-02-18 03:52:51 +, Oscar Benjamin wrote:
> > On Sat, 18 Feb 2023 at 01:47, Chris Angelico  wrote:
> > > On Sat, 18 Feb 2023 at 12:41, Greg Ewing via Python-list
> > > > To avoid it you would need to use an algorithm that computes nth
> > > > roots directly rather than raising to the power 1/n.
> > > >
> > >
> > > It's somewhat curious that we don't really have that. We have many
> > > other inverse operations - addition and subtraction (not just "negate
> > > and add"), multiplication and division, log and exp - but we have
> > > exponentiation without an arbitrary-root operation. For square roots,
> > > that's not a problem, since we can precisely express the concept
> > > "raise to the 0.5th power", but for anything else, we have to raise to
> > > a fractional power that might be imprecise.
> >
> > Various libraries can do this. Both SymPy and NumPy have cbrt for cube 
> > roots:
>
> Yes, but that's a special case. Chris was talking about arbitrary
> (integer) roots. My calculator has a button labelled [x√y], but my
> processor doesn't have an equivalent operation.

All three of SymPy, mpmath and gmpy2 can do this as accurately as
desired for any integer root:

  >>> n = 12345678900

  >>> sympy.root(n, 6)
  10*13717421**(1/6)*3**(1/3)
  >>> sympy.root(n, 6).evalf(50)
  22314431635.562095902499928269233656421704825692573

  >>> mpmath.root(n, 6)
  mpf('22314431635.562096')
  >>> mpmath.mp.dps = 50
  >>> mpmath.root(n, 6)
  mpf('22314431635.562095902499928269233656421704825692572746')

  >>> gmpy2.root(n, 6)
  mpfr('22314431635.562096')
  >>> gmpy2.get_context().precision = 100
  >>> gmpy2.root(n, 6)
  mpfr('22314431635.56209590249992826924',100)

There are also specific integer only root routines like
sympy.integer_nthroot or gmpy2.iroot.

  >>> gmpy2.iroot(n, 6)
  (mpz(22314431635), False)
  >>> sympy.integer_nthroot(n, 6)
  (22314431635, False)

Other libraries like the stdlib math module and numpy define some
specific examples like cbrt or isqrt but not a full root or iroot.
What is lacking is a plain 64-bit floating point routine like:

  def root(x: float, n: int) -> float:
  return x ** (1/n) # except more accurate than this

It could be a good candidate for numpy and/or the math module. I just
noticed from the docs that the math module has a new in 3.11 cbrt
function that I didn't know about which suggests that a root function
might also be considered a reasonable addition in future. Similarly
isqrt was new in 3.8 and it is not a big leap from there to see
someone adding iroot.

--
Oscar
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: LRU cache

2023-02-18 Thread Thomas Passin

On 2/18/2023 5:38 AM, Albert-Jan Roskam wrote:

I sometimes use this trick, which I learnt from a book by Martelli.
Instead of try/except, membership testing with "in" (__contains__) might
be faster. Probably "depends". Matter of measuring.
def somefunc(arg, _cache={}):
    if len(_cache) > 10 ** 5:
        _cache.pop()
    try:
        return _cache[arg]
    except KeyError:
        result = expensivefunc(arg)
        _cache[arg] = result
        return result
Albert-Jan


_cache.get(arg) should be a little faster and use slightly fewer 
resources than the try/except.


--
https://mail.python.org/mailman/listinfo/python-list


Re: LRU cache

2023-02-18 Thread Rob Cliffe via Python-list



On 18/02/2023 15:29, Thomas Passin wrote:

On 2/18/2023 5:38 AM, Albert-Jan Roskam wrote:

    I sometimes use this trick, which I learnt from a book by Martelli.
    Instead of try/except, membership testing with "in" 
(__contains__) might

    be faster. Probably "depends". Matter of measuring.
    def somefunc(arg, _cache={}):
        if len(_cache) > 10 ** 5:
            _cache.pop()
        try:
            return _cache[arg]
        except KeyError:
            result = expensivefunc(arg)
            _cache[arg] = result
            return result
    Albert-Jan


_cache.get(arg) should be a little faster and use slightly fewer 
resources than the try/except.


Provided that you can provide a default value to get() which will never 
be a genuine "result".


--
https://mail.python.org/mailman/listinfo/python-list


Re: LRU cache

2023-02-18 Thread Albert-Jan Roskam
   On Feb 18, 2023 17:28, Rob Cliffe via Python-list 
   wrote:

 On 18/02/2023 15:29, Thomas Passin wrote:
 > On 2/18/2023 5:38 AM, Albert-Jan Roskam wrote:
 >>     I sometimes use this trick, which I learnt from a book by
 Martelli.
 >>     Instead of try/except, membership testing with "in"
 >> (__contains__) might
 >>     be faster. Probably "depends". Matter of measuring.
 >>     def somefunc(arg, _cache={}):
 >>         if len(_cache) > 10 ** 5:
 >>             _cache.pop()
 >>         try:
 >>             return _cache[arg]
 >>         except KeyError:
 >>             result = expensivefunc(arg)
 >>             _cache[arg] = result
 >>             return result
 >>     Albert-Jan
 >
 > _cache.get(arg) should be a little faster and use slightly fewer
 > resources than the try/except.
 >
 Provided that you can provide a default value to get() which will never
 be a genuine "result".

   =
   This might be better than None:
   _cache.get(arg, Ellipsis)
-- 
https://mail.python.org/mailman/listinfo/python-list


RE: Comparing caching strategies

2023-02-18 Thread avi.e.gross
David,

This conversation strikes me as getting antagonistic and as such, I will not
continue it here after this message.

I can nitpick at least as well as you but have no interest. It is also
wandering away from the original point.

The analogy I gave remains VALID no matter if you do not accept it as being
precise. Analogies are not equalities. I did not say this does at all what
programs like pkzip do in entirety or even anything similarly. I simply said
that pkzip (as it no doubt evolves) has a series of compression methods to
choose from and each has a corresponding method to undo and get back the
original. As it happens, you can write any number of algorithms that
determine which method to use and get back the original. It depend on many
relative costs including not just the size of the compressed object but how
often it will be accessed and how much effort each takes. In some cases you
will compress everything once and extract it many times and in many places,
so it may be worth the effort to try various compression techniques and
measure size and even the effort to extract from that and decide which to
use.

I do not know the internals of any Roaring Bitmap implementation so all I
did gather was that once the problem is broken into accessing individual
things I chose to call zones for want of a more specific name, then each
zone is stored in one of an unknown number of ways depending on some logic.
You say an implementation chose two ways and that is fine. But in theory, it
may make sense to choose other ways as well and the basic outline of the
algorithm remains similar. I can imagine if a region/zone is all the even
numbers, then a function that checks if you are searching for an odd number
may be cheaper. That is an example, not something I expect to see or that is
useful enough. But the concept of storing a function as the determiner for a
region is general enough that it can support many more realistic ideas.

>From what you wrote, the algorithm chosen is fairly simple BUT I have to ask
if these bitmaps are static or can be changed at run time? I mean if you
have a region that is sparse and then you keep adding, does the software
pause and rewrite that region as a bitmap if it is a list of offsets? Or, if
a bitmap loses enough, ...

On to your other points. Again, understand I am talking way more abstractly
than you and thus it really does not matter what the length of a particular
ID in some country is for the main discussion. The assumption is that if you
are using something with limits, like a Roaring Bitmap, that you do things
within the limits. When I lived in Austria, I did not bother getting an
Austrian Sozialversicherungsnummer so I have no idea it is ten digits long.
In any case, many things grow over time such as the length of telephone
numbers. 

The same applies to things like airport codes. They can get longer for many
reasons and may well exceed 4 characters, and certainly UNICODE or other
such representations may exceed four bytes now if you allow non-ASCII
characters that map into multiple bytes. My point was to think about how
useful a Roaring bitmap is if it takes only 32 bit integers and one trivial
mapping was to use any four bytes to represent a unique integer. But clearly
you could map longer strings easily enough if you restrict yourself to 26
upper case letters and perhaps a few other symbols that can be encoded in 5
bits. I am not saying it is worth the effort, but that means 6 characters
can fit in 32 bits.

I do wonder if the basic idea has to be limited to 32 bits or if it can
expand to say 64 or use additional but fast methods of storing the data
beyond the two mentioned.

There are variants of other ideas I can think of like sparse arrays or
matrices such as you find in the scipy module in Python. If they hold a
Boolean value, they sound like they are a similar idea where you simply keep
track of the ones marked True, or if it makes sense, the ones considered
False.

-Original Message-
From: Python-list  On
Behalf Of Peter J. Holzer
Sent: Saturday, February 18, 2023 5:40 AM
To: [email protected]
Subject: Re: Comparing caching strategies

On 2023-02-17 18:08:09 -0500, [email protected] wrote:
> Analogies I am sharing are mainly for me to wrap my head around an 
> idea by seeing if it matches any existing ideas or templates and is 
> not meant to be exact. Fair enough?

Yeah. But if you are venting your musings into a public space you shouldn't
be surprised if people react to them. And we can only react to what you
write, not what you think.

> But in this case, from my reading, the analogy is rather reasonable.

Although that confused me a bit. You are clearly responding to something I
thought about but which you didn't quote below. Did I just think about it
and not write it, but you responded anyway because you're a mind reader?
Nope, it just turns out you (accidentally) deleted that sentence.


> The implementation of Roaring Bitmaps seems to lo

Re: Comparing caching strategies

2023-02-18 Thread MRAB

On 2023-02-18 19:59, [email protected] wrote:
[snip]

I do not know the internals of any Roaring Bitmap implementation so all I
did gather was that once the problem is broken into accessing individual
things I chose to call zones for want of a more specific name, then each
zone is stored in one of an unknown number of ways depending on some logic.
You say an implementation chose two ways and that is fine. But in theory, it
may make sense to choose other ways as well and the basic outline of the
algorithm remains similar. I can imagine if a region/zone is all the even
numbers, then a function that checks if you are searching for an odd number
may be cheaper. That is an example, not something I expect to see or that is
useful enough. But the concept of storing a function as the determiner for a
region is general enough that it can support many more realistic ideas.

  From what you wrote, the algorithm chosen is fairly simple BUT I have to ask
if these bitmaps are static or can be changed at run time? I mean if you
have a region that is sparse and then you keep adding, does the software
pause and rewrite that region as a bitmap if it is a list of offsets? Or, if
a bitmap loses enough, ...

A region is either "sparse" (<= 4096 entries) or "dense" (> 4096 
entries). It'll be converted to the other type as and when necessary.



On to your other points. Again, understand I am talking way more abstractly
than you and thus it really does not matter what the length of a particular
ID in some country is for the main discussion. The assumption is that if you
are using something with limits, like a Roaring Bitmap, that you do things
within the limits. When I lived in Austria, I did not bother getting an
Austrian Sozialversicherungsnummer so I have no idea it is ten digits long.
In any case, many things grow over time such as the length of telephone
numbers.

The same applies to things like airport codes. They can get longer for many
reasons and may well exceed 4 characters, and certainly UNICODE or other
such representations may exceed four bytes now if you allow non-ASCII
characters that map into multiple bytes. My point was to think about how
useful a Roaring bitmap is if it takes only 32 bit integers and one trivial
mapping was to use any four bytes to represent a unique integer. But clearly
you could map longer strings easily enough if you restrict yourself to 26
upper case letters and perhaps a few other symbols that can be encoded in 5
bits. I am not saying it is worth the effort, but that means 6 characters
can fit in 32 bits.

The Unicode Consortium have said that Unicode will be limited to 
U+..U+10 (21 bits).



I do wonder if the basic idea has to be limited to 32 bits or if it can
expand to say 64 or use additional but fast methods of storing the data
beyond the two mentioned.

There are variants of other ideas I can think of like sparse arrays or
matrices such as you find in the scipy module in Python. If they hold a
Boolean value, they sound like they are a similar idea where you simply keep
track of the ones marked True, or if it makes sense, the ones considered
False.


[snip]

--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing caching strategies

2023-02-18 Thread Thomas Passin

On 2/18/2023 2:59 PM, [email protected] wrote:

I do not know the internals of any Roaring Bitmap implementation so all I
did gather was that once the problem is broken into accessing individual
things I chose to call zones for want of a more specific name, then each
zone is stored in one of an unknown number of ways depending on some logic.


Somewhat tangential, but back quite a while ago there was a C library 
called IIRC "julia list".  It implemented lists in several different 
ways, some quite sophisticated, depending on the size and usage pattern. 
 I remembered it as soon as I took a look at Roaring Bitmap and saw 
that the latter can use different representations depending on size and 
access patterns.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing caching strategies

2023-02-18 Thread Peter J. Holzer
On 2023-02-18 15:59:32 -0500, Thomas Passin wrote:
> On 2/18/2023 2:59 PM, [email protected] wrote:
> > I do not know the internals of any Roaring Bitmap implementation so all I
> > did gather was that once the problem is broken into accessing individual
> > things I chose to call zones for want of a more specific name, then each
> > zone is stored in one of an unknown number of ways depending on some logic.
> 
> Somewhat tangential, but back quite a while ago there was a C library called
> IIRC "julia list".

ITYM Judy arrays. They were mentioned here already.

> It implemented lists in several different ways, some quite
> sophisticated, depending on the size and usage pattern.  I remembered
> it as soon as I took a look at Roaring Bitmap and saw that the latter
> can use different representations depending on size and access
> patterns.

Yes, Roaring Bitmaps are somewhat similar. Judy arrays are more
versatile (they support more data types) and in many ways more
sophisticated, despite being 10+ years older. OTOH Roaring Bitmaps are a
lot simpler which may have contributed to their popularity.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | [email protected] |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


RE: Comparing caching strategies

2023-02-18 Thread avi.e.gross
It is not an unusual pattern, Thomas, to do something selective to some object 
rather than do all parts just one way.

The history of computing has often been one where you had to deal with scarcity 
of expensive resources.

Consider the Python "list" as a rather wasteful container that is best used for 
diverse contents. As soon as you make all the contents of the same basic type, 
you start wondering if you are better off with a restricted type that holds 
only one kind of item, often as something like a numpy array. Or, you start 
wondering if maybe a good way to store the contents is in a dictionary instead, 
as searching it is often way faster.

But fundamentally, there is nothing WRONG with uses of a Python list as it 
handles almost anything and for small sample sizes, it is often not work 
spending time redoing it. Still, if you implement a 3-D matrix as a list of 
lists of lists, and use slicing and other techniques to do things like matrix 
multiplication, it gets unwieldly enough so arguably it is the wrong tool.

If you look at the history of Python, they deliberately left out many of the 
containers other programming languages used and stressed lists and tuples. No 
vectors or arrays were the focus. Later stuff had to be added as people noted 
that generality has costs. If you look at a language like JavaScript, it is 
beyond weird how they decided to use attributes of an object to make an array. 
So an object might have attributes like "name" and "length" alongside some like 
"1" and "2" and "666" and when you wanted to treat the object like an array, it 
might look at the attributes and ignore the non-numerical ones and look like it 
has a sparsely populated array/vector of items indexed from 1 to 666. You can 
do all kinds of arithmetic operations and add missing indices or remove some 
and it still works like a typical array but with weird costs such as sometimes 
having to reindex lots of items if you want to insert a new item as the 3rd or 
1st element so any above need renumbering the hard way! It works but strikes me 
as a kludge.

If you look within Python  at numpy and pandas and similar utilities, they are 
well aware of the idea of abstracting out concepts with different 
implementations and use lots of Python tools that access different storage 
methods as needed often by using objects that implement various "protocols" to 
the point where manipulating them similarly seems straightforward. In 
principle, you could create something like a pandas data.frame and then call a 
function that examines the columns, and perhaps rows, and returns a modified 
(or new) data.frame where the contents have been adjusted so the storage is 
smaller or can be accessed more efficiently, based on any other preferences and 
hints you supply, such as if it can be immutable. A column which is all 1's or 
Trues can obviously be done other ways, and one that has all values under 256, 
again, can use less storage. 

Of course this would need to be done very carefully so that changes that 
violate assumptions will result in a refactoring to a version that handles the 
changes, or results in an error. And a long-running system could be set to keep 
track of how an object is used and perhaps make adjustments. As one example, 
after some number of accesses, you might change a function "live" to begin 
caching, or have an object reconfigure to be faster to search even if occupying 
more room.

Back to Python lists, a typical change is simply to convert them to a tuple 
which makes them easier to store and often faster to search. And, if the keys 
are unique, now that dictionaries maintain order of insertion, sometimes you 
may want to convert the list to a dict. 

But I hasten to note some "improvements" may not really improve. In a language 
like R, many operations such as changing one vector in a data.frame are done 
lazily and the new copy is still sharing mostly the same storage as the old. 
Making some changes can result in negative effects. A common problem people 
have is that trying to store some objects in an existing vector can work except 
when done, the entire vector has been transformed into one that can carry any 
of the contents. A vector of integers may become a vector of doubles or even a 
vector of characters that now have entries like "777" as a character string. 
The flexibility comes with lots of programming errors!

Note how this can cause problems with the original idea here of caching 
strategies. Imagine a function that checks the environment as to what encoding 
or human language and so on to produce text in. If you cache it so it produces 
results that are stored in something like a dictionary with a key, and later 
the user changes the environment as it continues running, the cache may now 
contain invalid results. You might need to keep track of the environment and 
empty the cache if things change, or start a new cache and switch to it.  An 
example would be the way I use Google T

Re: Comparing caching strategies

2023-02-18 Thread MRAB

On 2023-02-18 23:04, [email protected] wrote:
[snip]


Note how this can cause problems with the original idea here of caching strategies. 
Imagine a function that checks the environment as to what encoding or human language and 
so on to produce text in. If you cache it so it produces results that are stored in 
something like a dictionary with a key, and later the user changes the environment as it 
continues running, the cache may now contain invalid results. You might need to keep 
track of the environment and empty the cache if things change, or start a new cache and 
switch to it.  An example would be the way I use Google Translate. I sometimes am 
studying or using a language and want to look up a word or phrase or entire sentence. If 
Google Translate keeps track, it may notice repeated requests like "Valentines 
Day" and cache it for re-use. But I often click to switch languages and see if the 
other one uses a similar or different way to describe what it means or something similar 
but spelled another way. German does the latter as in Valentinstag which is a fairly 
literal translation as does Dutch (Valentijnsdag ) and  Hungarian (Valentin nap) .

But Hebrew calls it the holiday of love, sort of (חג האהבה). Portuguese is 
similar but includes day as well as love (Dia dos Namorados)

Esperanto tosses in more about sainthood (Sankta Valentín) and in a sense 
Spanish does both ways with day and saint (Día de San Valentín).

The Esperanto didn't look right to me; it's "Valentena tago" or 
"Sankt-Valentena tago".


[snip]

--
https://mail.python.org/mailman/listinfo/python-list


Re: Comparing caching strategies

2023-02-18 Thread Thomas Passin

On 2/18/2023 5:55 PM, Peter J. Holzer wrote:

On 2023-02-18 15:59:32 -0500, Thomas Passin wrote:

On 2/18/2023 2:59 PM, [email protected] wrote:

I do not know the internals of any Roaring Bitmap implementation so all I
did gather was that once the problem is broken into accessing individual
things I chose to call zones for want of a more specific name, then each
zone is stored in one of an unknown number of ways depending on some logic.


Somewhat tangential, but back quite a while ago there was a C library called
IIRC "julia list".


ITYM Judy arrays. They were mentioned here already.


Ha! Fading memory, I guess.  Maybe that's why I couldn't find them with 
an internet search.



It implemented lists in several different ways, some quite
sophisticated, depending on the size and usage pattern.  I remembered
it as soon as I took a look at Roaring Bitmap and saw that the latter
can use different representations depending on size and access
patterns.


Yes, Roaring Bitmaps are somewhat similar. Judy arrays are more
versatile (they support more data types) and in many ways more
sophisticated, despite being 10+ years older. OTOH Roaring Bitmaps are a
lot simpler which may have contributed to their popularity.

 hp




--
https://mail.python.org/mailman/listinfo/python-list


RE: Comparing caching strategies

2023-02-18 Thread avi.e.gross
MRAB,

I made it very clear I was using the translation provided by Google Translate. 
I copied exactly what it said and as I speak the languages involved, they 
seemed reasonable.  I often find it provides somewhat different translations 
than I expect and sometimes I need to supply a longer sentence to get it on 
track and sometimes it is just plain wrong. Getting it to provide 
female-specific sentences can be an exercise in frustration, for example.

What it produces is not important to the main point. 

It is hard to have conversations when people keep niggling on details and 
especially details that make absolutely no difference. 

My point was you can have functions that cannot be cached for results just any 
trivial way. The EXAMPLE I gave suggests if your had a Python program that did 
something long these lines between multiple languages, the RESULTS will depend 
on not just the phrase used. But they can be cached well in several ways if you 
want.

Let me point out another related use. When you type a message on your phone, 
you may have a sort of prediction algorithm running that keeps offering to 
auto-complete your words or fix spelling errors. It does this dynamically and 
learns new words and eventually may start remembering patterns. I have totally 
mucked up my ENGLISH keyboard because it now remembers snippets of my typing in 
languages like Spanish and German and lately Esperanto and it makes suggestions 
from a very confused composite state including lots of funny accented 
characters. I now switch keyboards when I remember but much of the damage is 
done! LOL!

That too is an example of some hidden data the program uses which is loaded 
with past data of what I have typed and of short phrases so it can make a guess 
that word A is often followed by word B and after seeing both A and B is 
extremely likely to be followed by C. Now obviously this would not be a trivial 
use of caching as part of guessing but could be part of a method that 
increments some keys like "A B" or "A B C" with a count of how often they have 
seen that combo. Yes, this is not direct caching, but also a side effect you 
might want to add to a function.

As for Esperanto, I am still learning it and as a created language, the snippet 
I used can likely be translated multiple ways and as someone who could care 
less about Valentines Day, I don't ever intend on using any of those ways in 
conversation. Most languages have alternate ways of saying things and it would 
not shock me if there are sentences like "The Holiday used in Capitalist 
Western Countries to commemorate a day of love based on a person some religions 
consider of merit and have given a religious title" or whatever.

So back on topic, an original question here was how to cache things, perhaps 
using a LRU algorithm with a data structure using some maximum.

My comment was that using a function decorator that caches any result may not 
be adequate in many cases. I presented several examples and my point is that in 
the above example, it may make more sense to have multiple caches that exist 
perhaps outside any one function, or a more complex cache that stores using a 
more complex key

-Original Message-
From: Python-list  On 
Behalf Of MRAB
Sent: Saturday, February 18, 2023 7:04 PM
To: [email protected]
Subject: Re: Comparing caching strategies

On 2023-02-18 23:04, [email protected] wrote:
[snip]
> 
> Note how this can cause problems with the original idea here of caching 
> strategies. Imagine a function that checks the environment as to what 
> encoding or human language and so on to produce text in. If you cache it so 
> it produces results that are stored in something like a dictionary with a 
> key, and later the user changes the environment as it continues running, the 
> cache may now contain invalid results. You might need to keep track of the 
> environment and empty the cache if things change, or start a new cache and 
> switch to it.  An example would be the way I use Google Translate. I 
> sometimes am studying or using a language and want to look up a word or 
> phrase or entire sentence. If Google Translate keeps track, it may notice 
> repeated requests like "Valentines Day" and cache it for re-use. But I often 
> click to switch languages and see if the other one uses a similar or 
> different way to describe what it means or something similar but spelled 
> another way. German does the latter as in Valentinstag which is a fairly 
> literal translation as does Dutch (Valentijnsdag ) and  Hungarian (Valentin 
> nap) .
> 
> But Hebrew calls it the holiday of love, sort of (חג האהבה). 
> Portuguese is similar but includes day as well as love (Dia dos 
> Namorados)
> 
> Esperanto tosses in more about sainthood (Sankta Valentín) and in a sense 
> Spanish does both ways with day and saint (Día de San Valentín).
> 
The Esperanto didn't look right to me; it's "Valentena tago" or 
"Sankt-Valentena tago".

[snip]

--
https:/