On Mon, Nov 4, 2013 at 9:41 AM, Amal Thomas <amalthomas...@gmail.com> wrote:
> @Steven: Thank you...My input data is basically AUGC and newlines... I > would like to know about bytearray technique. Please suggest me some links > or reference.. I will go through the profiler and check whether the code > maintains linearity with the input files. > > Hi Amal, I suspect that what's been missing here throughout this thread is more concrete information about the problem's background. I would strongly suggest we make sure that we understand the problem before making more assumptions. 1. What is the nature of the operation that you are doing on your data? Can you briefly discuss its details? Does it involve random-access, or is it a sequential operation? Are the operations independent regardless of what line you are on, or is there some kind of dependency across lines? Does it involve pattern matching, or...? Are you maintaining some in-memory data structure as you're walking through the file? The reason why we need to know this is because it can affect file access patterns. It may provide a hint as to whether or not you can avoid loading the whole file into memory or not. It may even effect whether or not you can distribute your work among several computers. Here's also why it's important to talk more about what the problem is trying to solve. Your question has been assuming that the dominating factor in your program's runtime is the access of your data, and that loading the entire file into memory will improve performance. But I see no evidence to support that assumption yet. Why should I not believe that the time that's being spent isn't being spent paging in virtual memory, for example, due to something else in your program's operations? In which case, then trying to load the file entirely into memory will be counterproductive. 2. What is the format of your input data? You mention it is AUGC and newlines, but more details would be really helpful. Why is it line-oriented, for example? I mean that as a serious question. Is it significant? Is it a FASTA file? Is it some kind of homebrewed format? Please be as specific as you can be here: you may be duplicating effort that folks who have spent _years_ on sequence-reading libraries have already done for you. Specifically, you might be able to reuse Biopython's libraries for sequence IO. http://biopython.org/wiki/SeqIO By trying to cook up file parsing by yourself, you may be making a mistake. For example, there might be issues in Python 3 due to Unicode encodings: http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit which might contribute to an unexpected increase in the size of a string's memory representation. Hard to say, since it depends on a host of factors. But knowing that, other folks have probably encountered and solved this problem already. Concretely, I'm pretty sure Biopython's SeqIO does the Right Thing in terms of reading files in binary mode and reading the line contents as bytes, as opposed to regular strings, and representing the sequence in some memory-efficient way. At the very least, I know that they think about these kind of problems a lot: http://web.archiveorange.com/archive/v/5dAwXDMfufikePQqtPgx Probably a lot more than us. :P So if it's possible, try to leverage what's already out there. You should almost certainly not be writing your own sequence-reading code.
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor