Re: FOSS tool to do general stats from text indata

Emanuel Berg Sat, 24 Jun 2023 23:28:44 -0700

tomas wrote:

>>>> I mean a general tool, but with options to tweak the
>>>> report included, of course.
>>>
>>> If you can bear some tweaking, R is it.
>> 
>> Sure! Let's run R on this e-mail. Does it work and if so, what
>> does it say?
>
> T a generic question -- a generic answer


R is a programming language, I'm looking for a tool that
produces stats from text. If such a tool uses R, or any other
programming language or stats engine to produce the outcome,
for me as a potential user that is entirely up to them who
write it.

> I don't even know what you mean by "general stats"

Some examples from doing stats on text are: average word
lenght, most commonly used words, the longest paragraphs ...

Those are simple examples, the next step it gets more
interesting as it could show what is statistically unusual,
that would be fun/exotic stats that a human user would
probably not spot.

E.g., parsing this mail, it could say "Emanuel Berg is almost
always calm and collective, entirely professional in his
approach, but here in the 4th paragraph of his mail he gets
VISIBLY UPSET using CAPS ONLY, possibly expressing FRUSTRATION
about NOT BEING UNDERSTOOD."

> the sports example you put in the other mail suggests that
> you want statistics gathered about a subject from written
> text

In the sports world they input the stats manually and that
data is then crunched by computers to produce lists and neat
graphics for their broadcasts. This is the first step
described above. This isn't unlike for example Emacs
`count-words-region' in combination with gnuplot - indeed, it
is exactly the same, almost, as these chars I type now are
produced manually, then Emacs could count and gnuplot
could show.

This first step would be neat depending on how much stuff is
quantified, the more the better obviously.

The second step however, that would be those "fun facts" the
commentators say, these are more advanced, like, and now
I just make something up, "Here is an amazing figure.
Player X has the worst stats on face-offs in his team, except
when the team plays on its home field and is down by two or
more goals, then he is 2nd best".

That second step, to have with text, would of course be even
more exciting.

I don't know if those crazy stats are discovered by a bunch of
fanatic hockey nerds just using the "step 1 stats" in creative
combinations - maybe using some sort of relational algebra
approach? - _or_ if they have some stats engine that crunches
the stats further to the meta-stats level, if you will, to
have the weird facts pop up automatically?

But yeah, if we don't even have a proper "step 1 stats" tool
for text - which is impossible to believe BTW - well,
obviously one can only dream of a "step 2 stats", a stats tool
on the meta level ...

> involves "understanding texts written in human languages",
> another big can of worms (which has become somewhat
> fashionable as of late).

It is not about understanding, it is about finding patterns
and meta-patterns, finding statistics that are themselves
statistically uncommon, which is why they are interesting.
Think exceptions and unexpected interrelations. Again, the
best example is probably a combination of the different stats
available at the "step 1 stats" level.

> If it's text statistics, good statistics packages have lots
> of resources. R is a good statistics package

Yeah, maybe I should ask them but as Debian is such a huge
system one would think someone here could show us how it, or
similar software can be used on a bunch of text, for
example on a mail like this.

It is already a bunch of data, surely you are not saying there
isn't a tool to tell us something of that data?

-- 
underground experts united
https://dataswamp.org/~incal

Re: FOSS tool to do general stats from text indata

Reply via email to