Once again I apologize for neglecting my email -- I'm spending most of
my time finding and downloading data from the web.  The good news is 
that there is a huge ocean of available data out there, much more than 
I expected.  The bad news is that the people who gathered it have used 
a wide range of data collection strategies and file formats, so it 
takes a lot more work to do anything with it than I expected.  Even 
just verifying that there is useful data in the downloaded files can 
be quite difficult!

Along with freely available and redistributable simulation software I 
think I'll try to come up with some non-proprietary file format standards
and software to convert existing codebook and dataset files.

I'm downloading a lot of educational and social survey data as well as
purely economic stuff.  As I see it, the economy depends on many 
factors that usually ignored by economists, and I don't want to ignore
any data that might help make the simulation accurate.

If you have a favourite social or economic factor, please let me know,
and I'll try to find some data for it.  I've already noted Jay Hanson's
concern for the energy-cost of energy, Tom Walker's concern about
investments at compound interest, and Eva Durant's interest in other
forms of ownership.  Whether I agree with these concerns is not 
important, I will still try to find some data about them.

This simulation should be self-validating in two ways:

1.  Complete cross-validation capabilities will be provided -- that 
    is to say, the whole simulation minus one individual datum can
    be used estimate that single number, and the result compared with
    the actual number.  If enough processing power is available, that
    can be done with ALL numbers, which will not only verify the
    simulation as a whole but help find erroneous data.

2.  As time goes by, new data like this month's unemployment rate will
    be added to tables that previously included only unemployment
    rates for previous months.  As these are added, they can be
    compared with predictions made last month, the month before that,
    and so on.  The difference between prediction and reality is often 
    called a residual, since it is what is left over after subtracting
    the prediction.  The smaller the residuals, the better the model.

In general more data is better than less, as long as it doesn't 
contain too many errors, so I'll be continuing to collect data as
I find it.

I should point out that like Eva, Tom, and Jay, I do have an axe to 
grind, my own pet theory -- the combinatorial stuff I've bored you 
with before.  You may well worry about my own theories being treated
too benevolently in this simulation, but I will be making all 
documentation, source code, and data public, which will let you snoop
around looking for bias to your heart's content.

I'm posting right after this a copy of a message I just sent to the
socialtechnology mailing list, which notes that survey data files
collected for this simulation will also be useful for my other 
project, an attempt to do social matching and optimization.  In fact
all my projects tend to blur into one another, and both the simulation
and matching projects will use software written earlier for semantic
network analysis (a species of linguistics).

I'll be putting some of that code up on the web as soon as I find 
myself a new web host that supports FTP and CGI as well as ordinary
web pages -- your recommendations or warnings about web hosts would
be appreciated.

      dpw

Douglas P. Wilson     [EMAIL PROTECTED]
http://www.island.net/~dpwilson/index.html

Reply via email to