SOCKS5 + python = pain in you-know-where ?
Hi all just googled both the web and groups. Who could believe in that: nothing simple, helpful and working concerning SOCKS5 support in python. Anyone got success here? Regards, Valery. -- http://mail.python.org/mailman/listinfo/python-list
a huge shared read-only data in parallel accesses -- How? multithreading? multiprocessing?
Hi all, Q: how to organize parallel accesses to a huge common read-only Python data structure? Details: I have a huge data structure that takes >50% of RAM. My goal is to have many computational threads (or processes) that can have an efficient read-access to the huge and complex data structure. "Efficient" in particular means "without serialization" and "without unneeded lockings on read-only data" To what I see, there are following strategies: 1. multi-processing => a. child-processes get their own *copies* of huge data structure -- bad and not possible at all in my case; => b. child-processes often communicate with the parent process via some IPC -- bad (serialization); => c. child-processes access the huge structure via some shared memory approach -- feasible without serialization?! (copy-on-write is not working here well in CPython/Linux!!); 2. multi-threading => d. CPython is told to have problems here because of GIL -- any comments? => e. GIL-less implementations have their own issues -- any hot recommendations? I am a big fan of parallel map() approach -- either multiprocessing.Pool.map or even better pprocess.pmap. However this doesn't work straight-forward anymore, when "huge data" means >50% RAM ;-) Comments and ideas are highly welcome!! Here is the workbench example of my case: ## import time from multiprocessing import Pool def f(_): time.sleep(5) # just to emulate the time used by my computation res = sum(parent_x) # my sofisticated formula goes here return res if __name__ == '__main__': parent_x = [1./i for i in xrange(1,1000)]# my huge read- only data :o) p = Pool(7) res= list(p.map(f, xrange(10))) # switch to ps and see how fast your free memory is getting wasted... print res ## Kind regards Valery -- http://mail.python.org/mailman/listinfo/python-list
Re: a huge shared read-only data in parallel accesses -- How? multithreading? multiprocessing?
Hi Klauss, > How's the layout of your data, in terms # of objects vs. bytes used? dict (or list) of 10K-100K objects. The objects are lists or dicts. The whole structure eats up to 2+ Gb RAM > Just to have an idea of the overhead involved in refcount > externalization (you know, what I mentioned > here:http://groups.google.com/group/unladen-swallow/browse_thread/thread/9... > ) yes, I've understood the idea explained by you there. regards, Valery -- http://mail.python.org/mailman/listinfo/python-list
Re: a huge shared read-only data in parallel accesses -- How? multithreading? multiprocessing?
Hi Antoine On Dec 11, 3:00 pm, Antoine Pitrou wrote: > I was going to suggest memcached but it probably serializes non-atomic > types. It doesn't mean it will be slow, though. Serialization implemented > in C may well be faster than any "smart" non-serializing scheme > implemented in Python. No serializing could be faster than NO serializing at all :) If child process could directly read the parent RAM -- what could be better? > What do you call "problems because of the GIL"? It is quite a vague > statement, and an answer would depend on your OS, the number of threads > you're willing to run, and whether you want to extract throughput from > multiple threads or are just concerned about latency. it seems to be a known fact, that only one CPython iterpreter will be running at a time, because a thread is aquiring the GIL during the execution and other threads within same process are then just waiting for GIL to be released. > In any case, you have to do some homework and compare the various > approaches on your own data, and decide whether the numbers are > satisfying to you. well, I the least evil is to pack-unpack things into array.array and/ or similarly NumPy. I do hope that Klauss' patch will be accepted, because it will let me to forget a lot of those unneeded packing-unpacking. > > I am a big fan of parallel map() approach > > I don't see what map() has to do with accessing data. map() is for > *processing* of data. In other words, whether or not you use a map()-like > primitive does not say anything about how the underlying data should be > accessed. right. However, saying "a big fan" has had another focus here: if you write your code based on maps then you have a tiny effort to convert your code into a MULTIprocessing one :) just that. Kind regards. Valery -- http://mail.python.org/mailman/listinfo/python-list
multiprocessing + console + windows = challenge?
Hi all. So, the doc is pitiless: "Note Functionality within this package requires that the __main__ method be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the multiprocessing.Pool examples will not work in the interactive interpreter. For example:" My question: Q: did any one manage to resurrect multiprocessing module in interactive Python console? Especially interesting would be on Windows :) pprocess library works in console when on Linux, but it doesn't on Windows :-/ regards Valery -- http://mail.python.org/mailman/listinfo/python-list
"from logging import *" causes an error under Ubuntu Karmic
Hi all is it a pure Ubuntu Karmic (beta) issue?.. $ python Python 2.6.3 (r263:75183, Oct 3 2009, 11:20:50) [GCC 4.4.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from logging import * Traceback (most recent call last): File "", line 1, in AttributeError: 'module' object has no attribute 'NullHandler' $ uname -a Linux vaktop 2.6.31-11-generic #38-Ubuntu SMP Fri Oct 2 11:55:55 UTC 2009 i686 GNU/Linux -- regards Valery -- http://mail.python.org/mailman/listinfo/python-list
Re: "from logging import *" causes an error under Ubuntu Karmic
OK, I've filed a bug. Because Python2.5 works fine here. -- Valery -- http://mail.python.org/mailman/listinfo/python-list
Re: How to initialize each multithreading Pool worker with an individual value?
On Dec 1, 3:24 am, James Mills wrote: > I assume you are talking about multiprocessing > despite you mentioning "multithreading" in the mix. yes, sorry. > Have a look at the source code for multiprocessing.pool > and how the Pool object works and what it does > with the initializer argument. I'm not entirely sure it > does what you expect and yes documentation on this > is lacking... I see I found my way "to seed" each member of Pool with own data. I do it right after after initialization: port = None def port_seeder(port_val) from time import sleep sleep(1) # or less... global port port = port_val if __name__ == '__main__': pool = Pool(3) pool.map(port_seeder, range(3), chunksize=1) # now child processes are initialized with individual values. Another (a bit more heavier) approach would be via shared resource. P.S. sorry, I found your answer only now. reagrds -- Valery -- http://mail.python.org/mailman/listinfo/python-list
Anything better than asyncio.as_completed() and asyncio.wait() to manage execution of large amount of tasks?
Hi, both asyncio.as_completed() and asyncio.wait() work with lists only. No generators are accepted. Are there anything similar to those functions that pulls Tasks/Futures/coroutines one-by-one and processes them in a limited task pool? I have gazillion of Tasks, and do not want to instantiate them all at once, but to instantiate and to address them one by one as the running tasks are completed. best regards -- Valery -- https://mail.python.org/mailman/listinfo/python-list
Re: Anything better than asyncio.as_completed() and asyncio.wait() to manage execution of large amount of tasks?
Hi Maxime,
many thanks for your great solution. It would be so great to have it in
stock asyncio and use it out-of-the-box...
I've made 4 fixes to it that are rather of "cosmetic" nature. Here is the
final version:
import asyncio
from concurrent import futures
def as_completed_with_max_workers(tasks, *, loop=None, max_workers=5,
timeout=None):
loop = loop if loop is not None else asyncio.get_event_loop()
workers = []
pending = set()
done = asyncio.Queue(maxsize=max_workers, loop=loop) # Valery: respect
the "loop" parameter
exhausted = False
timeout_handle = None # Valery: added to see, if we indeed have to call
timeout_handle.cancel()
@asyncio.coroutine
def _worker():
nonlocal exhausted
while not exhausted:
try:
t = next(tasks)
pending.add(t)
yield from t
yield from done.put(t)
pending.remove(t)
except StopIteration:
exhausted = True
def _on_timeout():
for f in workers:
f.cancel()
workers.clear()
# Wake up _wait_for_one()
done.put_nowait(None)
@asyncio.coroutine
def _wait_for_one():
f = yield from done.get()
if f is None:
raise futures.TimeoutError()
return f.result()
workers = [asyncio.async(_worker(), loop=loop) for _ in
range(max_workers)] # Valery: respect the "loop" parameter
if workers and timeout is not None:
timeout_handle = loop.call_later(timeout, _on_timeout)
while not exhausted or pending or not done.empty():
yield _wait_for_one()
if timeout_handle: # Valery: call timeout_handle.cancel() only if it is
needed
timeout_handle.cancel()
best regards
--
Valery A.Khamenya
--
https://mail.python.org/mailman/listinfo/python-list
asyncio with map&reduce flavor and without flooding the event loop
Hi all I am trying to use asyncio in real applications and it doesn't go that easy, a help of asyncio gurus is needed badly. Consider a task like crawling the web starting from some web-sites. Each site leads to generation of new downloading tasks in exponential(!) progression. However we don't want neither to flood the event loop nor to overload our network. We'd like to control the task flow. This is what I achieve well with modification of nice Maxime's solution proposed here: https://mail.python.org/pipermail/python-list/2014-July/675048.html Well, but I'd need as well a very natural thing, kind of map() & reduce() or functools.reduce() if we are on python3 already. That is, I'd need to call a "summarizing" function for all the downloading tasks completed on links from a page. This is where i fail :( I'd propose an oversimplified but still a nice test to model the use case: Let's use fibonacci function implementation in its ineffective form. That is, let the coro_sum() be our reduce() function and coro_fib be our map(). Something like this: @asyncio.coroutine def coro_sum(x): return sum(x) @asyncio.coroutine def coro_fib(x): if x < 2: return 1 res_coro = executor_pool.spawn_task_when_arg_list_of_coros_ready(coro=coro_sum, arg_coro_list=[coro_fib(x - 1), coro_fib(x - 2)]) return res_coro So that we could run the following tests. Test #1 on one worker: executor_pool = ExecutorPool(workers=1) executor_pool.as_completed( coro_fib(x) for x in range(20) ) Test #2 on two workers: executor_pool = ExecutorPool(workers=2) executor_pool.as_completed( coro_fib(x) for x in range(20) ) It would be very important that both each coro_fib() and coro_sum() invocations are done via a Task on some worker, not just spawned implicitly and unmanaged! It would be cool to find asyncio gurus interested in this very natural goal. Your help and ideas would be very much appreciated. best regards -- Valery -- https://mail.python.org/mailman/listinfo/python-list
Problem: neither urllib2.quote nor urllib.quote encode the unicode strings arguments
Hi all things like urllib.quote(u"пиво Müller ") fail with error message: : u'\u043f' Similarly with urllib2. Anyone got a hint?? I need it to form the URI containing non-ascii chars. thanks in advance, best regards -- Valery -- http://mail.python.org/mailman/listinfo/python-list
How to initialize each multithreading Pool worker with an individual value?
Hi, multithreading.pool Pool has a promissing initializer argument in its constructor. However it doesn't look possible to use it to initialize each Pool's worker with some individual value (I'd wish to be wrong here) So, how to initialize each multithreading Pool worker with the individual values? The typical use case might be a connection pool, say, of 3 workers, where each of 3 workers has its own TCP/IP port. from multiprocessing.pool import Pool def port_initializer(_port): global port port = _port def use_connection(some_packet): global _port print "sending data over port # %s" % port if __name__ == "__main__": ports=((4001,4002, 4003), ) p = Pool(3, port_initializer, ports) # oops... :-) some_data_to_send = range(20) p.map(use_connection, some_data_to_send) best regards -- Valery A.Khamenya -- http://mail.python.org/mailman/listinfo/python-list
Re: How to initialize each multithreading Pool worker with an individual value?
Hi Dan, > If you create in the parent a queue in shared memory (multiprocessing > facilitates this nicely), and fill that queue with the values in your > ports tuple, then you could have each child in the worker pool extract > a single value from this queue so each worker can have its own, unique > port value. this port number is supposed to be used once, namely during initialization. Quite usual situation with conections is so, that it is a bit expensive to initiate it each time the connection is about to be used. So, it is often initialized once and dropped only when all communication is done. In contrast, your case looks for me that you rather propose to initiate the connection each time a new job comes from queue for an execution. Regards, Valery -- http://mail.python.org/mailman/listinfo/python-list
