I'm using Python to parse metrics out of logfiles.
The logfiles are fairly large (multiple GBs), so I'm keen to do this in a
reasonably performant way.
The metrics are being sent to a InfluxDB database - so it's better if I can
batch multiple metrics into a batch ,rather than sending them individually.
Currently, I'm using the grouper() recipe from the itertools documentation to
process multiples lines in "chunks" - I then send the collected points to the
database:
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return zip_longest(fillvalue=fillvalue, *args)
with open(args.input_file, 'r') as f:
line_counter = 0
for chunk in grouper(f, args.batch_size):
json_points = []
for line in chunk:
line_counter +=1
# Do some processing
json_points.append(some_metrics)
if json_points:
write_points(logger, client, json_points, line_counter)
However, not every line will produce metrics - so I'm batching on the number of
input lines, rather than on the items I send to the database.
My question is, would it make sense to simply have a json_points list that
accumulated metrics, check the size each iteration and then send them off when
it reaches a certain size. Eg.:
BATCH_SIZE = 1000
with open(args.input_file, 'r') as f:
json_points = []
for line_number, line in enumerate(f):
# Do some processing
json_points.append(some_metrics)
if len(json_points) >= BATCH_SIZE:
write_points(logger, client, json_points, line_counter)
json_points = []
Also, I originally used grouper because I thought it better to process lines in
batches, rather than individually. However, is there actually any throughput
advantage from doing it this way in Python? Or is there a better way of getting
better throughput?
We can assume for now that the CPU load of the processing is fairly light
(mainly string splitting, and date parsing).
--
https://mail.python.org/mailman/listinfo/python-list