weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
February 23rd, 2011 at 12:56:05 PM permalink
Quote: P90

More, but not 50,000 times more, if you are at least trying to limit the storage space used.


No, not 50,000 times. About 1000 times or so.


Quote:

If you only keep a reasonable amount of plaintext data, say, 100-200 megabit* - that's around what it takes for Jeopardy,


What are you basing your assertions on?

Quote:

*I'm using megabits, because storing text for computer search in 8-bit format should border on criminal negligence. There are different ways, it can be 5-bit encoding, it can be dictionary encoding, etc.


Oh. I feel more and more like talking to a betting system inventor. Do you think that all the people from the beginning of Computer Science to the modern day were stupid? Did something this obvious never occur to them? Or is it more likely that you are missing something?
First of all, 5 bits isn't enough. You can store (English) letters, but not (all of the) digits, and not punctuation. Perhaps, you could fit it into 6 bits, but do you really want to? You can't read or write less then a bit from memory, you can't put it into a register, you can't move it between registers, you can't perform operations on it, can't put it on stack or on the bus, can't address it. That means that to read every character, you would have to perform multiple bitwise operations, move the same stuff to and from memory multiple times, mask and unmask it. The processing cost of all this bit equillibristics would be truly enormous. On a larger scale, there are alignment problems (memory addresses need to be aligned at double-word, pages on disk are always the whole number of kilobytes, etc, etc), which will create tons of overhead biting into you meager savings of two bits per character. In short, you don't want to do that.
If you are talking about disk storage, it's all gzipped anyway, so it doesn't matter (gzip does a better job on plain text that you can ever do with your 6-bit packing), but disk storage has a lot of latency. You don't want that either for a real time application.


Quote:

There can be. In the two Watson episodes, most questions were quite straightforward, though.


Not really, they weren't. The "US cities" category was a rather odd exception.


Quote:

Then let me, just as well, reiterate what was posted below in the previous post: you only need to hit the center of the curve.


The question though is how do find the center of the curve. And how wide from the center you want your algorithm to be functional. You are assuming, the center is "US cities", and the area you need to cover is quite narrow. I see no basis for this assumption whatsoever. But even if we could agree on that, you still fail to suggest a method of selecting the data according to your criterion. I asked you about the categories, you named two. You keep saying "look at the past episodes" ... I gave you a few examples, you shrugged them off. I have to ask again: what are you basing your assertions on?


Quote:

On questions you get right, you win by having a quicker trigger finger.


Only if you get them right fast enough.

Quote:

You can even afford a lower accuracy than humans, because when you do hit, you win, while humans depend on their button pressing skills.


If accuracy is bad, you lose when you "hit" wrong.


Quote:

Watson didn't just win the game, it aced it.
Obliterated, walked over, Terminator going through a kindergarten. It was like shooting two unarmed humans with a 14.5x114mm machinegun.
There is a term for that: overkill.


Obviously, Watson's purpose in life isn't winning Jeopardy. Would you settle for a question about your health being answered with 90% accuracy? It's not overkill, you are just looking at it wrong.


Quote:

And, what's also to be noted, the times Watson won, it won with a large margin of confidence (as could be seen in that episode where they showed its three answers). The times it didn't win, it lost with an even larger margin. This indicates that increasing its processing power and storage capacity, say, 1,000-fold would have negligible if any at all effect on the results, programming being the same.


Not at all. I mean, I don't know what effect it would have, but what you said definitely does not mean it would be negligible. The low level of confidence simply means that it did not know the answer for whatever reason - maybe, it is missing from the database, or maybe it failed to analyze the question properly, or maybe certain tags in the cloud were missing ... Some of these could be solved by adding data, some by a change to the program, but most likely, both is required. Even if the problem is in the software, it is unlikely what you'd call "a bug", that just needs to be fixed. More probably, it was a conscious decision of the programmer to "cut some corners", a compromise due to limitation on the resources that could be used. Kinda what you are talking about. You can't index everything on everything, which means that you have to target the majority of possibilities with the most effectiveness, inevitably leaving something out. If more resources were available, less compromises would have to be made.
Much like with chess computers - they did not play very well when computers were small and slow, not because people who programmed them were stupid, but because there was not enough resources to do better.



Quote:


Can't resist but call your attention to AV-8, Yak-38 and F-35B.


I knew you would. :) But don't tell me you really did not get the analogy

Quote:

Aside from a hundred thousand or so other persons who have it, of course.


Where are they? I have been studying and working in the field for decades, and never met one, never even heard of one.
Can you possibly introduce me to one of these brilliant people? Are they as secretive as you are about their ways, or do some of them actually offer real solutions to real problems?

Quote:

Human-level AI.


Well, that doesn't have much to do with neural networks, beyond the name :)


Quote:

Practical, not. Possible to implement with less computer resources, yes.


No, not possible in practice.


Quote:

As I did. And outlined the general principles.


No, you did not. You just picked a couple of sample questions, and showed that those particular questions could be answered. It is not "principles", and not at all "general". In fact, generality is exactly what your description lacks.
Can you describe your method without using examples? That's what "general principles" would amount for.


Quote:

I'm not going to make the leap to actually dedicating the remainder of my life to attempting to implement it.


I know, you aren't. But the question is, again - what is it then that makes you think it is possible? What are you basing your assertions on?

Quote:

I could give example solutions to a simplified subset of the problem,


A simplified subset of the problem was never in dispute. Of course, you can solve simple problems on a simple computer, you don't need Watson for that.

Quote:

but if you want a complete solution without any simplifications


I want that, or, at least, the "general principles", the methodology and the algorithms you insist exist.

Quote:

, no luck here - Watson's software wasn't created on a lunch break either.


My point exactly



Quote:

Of course there was. For a short while in the 1970s, it was the winning approach, you brute-searched as deep as you had the time to.


You are mistaken. Perhaps, some papers out there call what was done "brute-force approach", but, if they do, they are not using that term in the same sense you are. Searching the whole game tree was never the approach.
Back in the 70s you could not solve a 4x4 tic-tac-toe with that approach.



Quote:

No. No and no and no.


I don't know what this is in regards to, but, something tells me, it's not worth going back and finding out :) I can just respond with Yes. Yes and yes and yes.



Quote:

Virtual reality? Eternal life? Cubic time? Faster than light travel? Vulcan death grip? Nuclear pumped X-ray lasers? Unlimited penis length? Parallel universes? Operating thetans in garlic cheese sauce?


No, I don't think any of these would be possible with your smaller and cheaper Watsons. Some of these are impossible, others just have nothing to with it. Perhaps, you are trying to be sarcastic, but I fail to see what it is you are trying to convey with your sarcasm.



Quote:

Yes, certainly. Watson's uptime is worth, I think, at least a few grand an hour.


You are wrong. Watson has about 3000 cores, google has millions. Watson has 300T of memory, google uses Petabytes. I am not sure what their operating expenses are, but they are certainly way larger than Watson's.


Quote:

Since google permits me to use its services for the lousy cost of a couple text ad displays,


"Lousy cost"??? Do you have any idea at all what they make on those text ads?

Quote:

I suspect its costs per query are much lower than Watson's, even considering that Watson does its queries in sub-second times as well.


It probably does cost less per query, because google handles more queries per time unit. So (hopefully) will Watson some day.
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
February 23rd, 2011 at 8:37:15 PM permalink
Quote: weaselman

Oh. I feel more and more like talking to a betting system inventor. Do you think that all the people from the beginning of Computer Science to the modern day were stupid? Did something this obvious never occur to them?


It did - that's why 6-bit format has been used for a long while.
Only later did it finally get superseded by 7-bit, and ultimately by 8-bit. And now 16-bit Unicode. Space is growing cheaper.

Quote:

Perhaps, you could fit it into 6 bits, but do you really want to? You can't read or write less then a bit from memory, you can't put it into a register, you can't move it between registers, you can't perform operations on it, can't put it on stack or on the bus, can't address it.


What does the problem of fractional bits have to do with anything here? 6 bits does not involve doing anything with "less than a bit".

5 bits (using a control prefix character for numbers, because they would be stored as binary, not decimal anyway) might indeed be relatively extreme. As to compression, it takes extra CPU cycles, and it's RAM space you want to save, not HDD. But 6 is as much as you need. Certain human constructs such as capital letters are not only useless for the purpose - they are counterproductive, because your search has to be case-insensitive anyway.

If you meant fractional bytes, that's not to worry about, because the purpose-designed machine (remember, ownership of all unaligned nations at stake - think of all the oil) would be optimized and designed from scratch for your byte size. It can be 36-bit, 48-bit, 40-bit, any you like.


Quote:

The question though is how do find the center of the curve. And how wide from the center you want your algorithm to be functional. You are assuming, the center is "US cities", and the area you need to cover is quite narrow. I see no basis for this assumption whatsoever.


Well, some other categories I recall were Decades, Theater, Celebrity Subtitles. Now, the last one starts to look like it's starting to complicate things indeed. But then, "Celebrity subtitles" and for instance "Famous words" would really be the same category.
A separate set of categories is ones including hints to the word's first letters in category name. May look complex, but really quite easy to recognize and use the same principle (search only with these letters).

Quote:

If accuracy is bad, you lose when you "hit" wrong.


As did Watson.
The approach of using customized algorithms would most likely deliver a very good hit percentage when it buzzes, just a lower percentage of cases where it buzzes.


Quote:

Obviously, Watson's purpose in life isn't winning Jeopardy. Would you settle for a question about your health being answered with 90% accuracy? It's not overkill, you are just looking at it wrong.


About health? Well, I don't know, do human doctors even manage 90% now? But that's not the issue at hand. The issue at hand is winning a wacky game show.

Watson's purpose in life is not winning Jeopardy. But let's get back to where we started. It was Watson winning Jeopardy that some people took as a sign that strong AI is finally almost here. To which my response was that winning a game show wasn't some big step for humanity, that the task solved wasn't much different from what a web search engine does, and finally that winning the game show with a computer could have probably been solved a long time ago if it was a task of life-changing importance.


And back to reality, where it's not, the reason Watson has so much resources (has, not necessarily uses) is that it isn't even a specialized Jeopardy Playing Computer. It's a test platform for developing a set of general purpose algorithms, algorithms that are in their very infancy now. Its database isn't custom tailored to the game, neither is its programming, the game was just, well, a game for it. Some ten years down the line, when these algorithms are developed better, chances are good that it will take just one Watson's node (current Watson's) or perhaps a desktop computer, running these improved algorithms, to win Jeopardy. Without 100TB of data.



Quote: weaselman

Not at all. I mean, I don't know what effect it would have, but what you said definitely does not mean it would be negligible. The low level of confidence simply means that it did not know the answer for whatever reason - maybe, it is missing from the database, or maybe it failed to analyze the question properly, or maybe certain tags in the cloud were missing ...


...Or maybe the method used just wasn't as perfect as it seems. You could have a million times more resources, and Watson would still get these questions wrong. When it was off, it was way off.

It's not all about resources. How well would a tag cloud work for solving mathematical equations? "Ein Volk, ein Reich, eine Schlagwortwolke" - it doesn't work for everything.

Quote:

Where are they? I have been studying and working in the field for decades, and never met one, never even heard of one.


You didn't bother to then, maybe? For instance, when you search for a particular user's posts on a forum, most software doesn't run a full-text search for his name through all the posts - it only searches the post author field for his binary ID.

Quote:

Can you describe your method without using examples? That's what "general principles" would amount for.


But I have. If you insist that I deliver it in working form, I can't. Generally, the principle is to parse the question and reduce it to a standardized query that is convenient to work with. Find a record with the prescribed set of attributes. I only used examples to illustrate.

Quote:

I know, you aren't. But the question is, again - what is it then that makes you think it is possible? What are you basing your assertions on?


The existence of computer software that can translate from one human language to another. The resemblance of some programming languages to some human languages. The existence of grammar, which is a formal set of rules for encoding ideas into phrases and decoding phrases into ideas.


Quote: weaselman

You are wrong. Watson has about 3000 cores, google has millions. Watson has 300T of memory, google uses Petabytes. I am not sure what their operating expenses are, but they are certainly way larger than Watson's.
"Lousy cost"??? Do you have any idea at all what they make on those text ads?


The cost of internet ads is usually about 0.3 cents per view. On search, maybe up to a cent.
Google has to handle I'm not even sure how many queries per second, it's not just searches these day. Watson was handling just one. So yes, the cost of running my google search query is much lower than running a question on Watson.




Quote: weaselman

Much like with chess computers - they did not play very well when computers were small and slow, not because people who programmed them were stupid, but because there was not enough resources to do better.


And because their programming was poor by modern chess engine standards.

It was not poor because people who programmed them were stupid, but because they were facing a challenging task and it took decades of algorithm development to get where we are now. But it was weak and made poor use of computer resources, compared to modern chess engines. Even given all the hardware of Deep Blue, with that period's programming it would still lose to humans.


Quote: weaselman

You are mistaken. Perhaps, some papers out there call what was done "brute-force approach", but, if they do, they are not using that term in the same sense you are. Searching the whole game tree was never the approach.


The approach was searching as much of the game tree as you could.

The first variant is explicitly called "brute force" in regards to chess. That engine would analyze every possible move and its consequences to a fixed depth, then calculate its total piece point count and subtract opponent's piece point count. The move reaching the highest point count would be followed. It's no less brute-force than cracking encryption by trying all possible keys starting with 0.

The second approach is a smart engine. That engine works more like a human player - it analyzes possible moves and decides which are worth pursuing, by evaluating the position and not just point count. It goes through branches selecting better moves and how deep should each branch be pursued. All modern chess engines are of this type.

But early engines of this type were so bad that they lost to the brute force algorithm run on the same computer. Because their programming was not good enough, not because they lacked resources.


Quote: weaselman

I don't know what this is in regards to, but, something tells me, it's not worth going back and finding out :) I can just respond with Yes. Yes and yes and yes.


The exchange went as this:

Quote: weaselman

"This by no means a different approach than what it was twenty years ago. The approach is the same, but more resources are available to throw at it. At the beginning, the computers were slow, and the amount of storage was small, so they could only consider few variants in reasonable time, and also lacked the knowledge base to do the trimming intelligently. Now, they are many orders of magnitude faster, so there are many more branches of the tree they can consider, and they also have massive databases, containing every game ever played, that allows them to do trimming a lot more efficiently."

Quote: P90

" No. No and no and no.
Chess engines have evolved massively, particularly in terms of decreasing computing requirements for given performance. Rybka is to original Computer Chess what the 0.953L engine in Kawasaki Z1000 is to the 0.953L engine in Benz Patent Motorwagen. "



Houdini or Rybka could make minced meat even out of Deep Blue, while running on 2006 CPU, with a fraction of the resources Deep Blue used. That's for a bit over a decade of difference. Go back to the 1970s, give them a modern machine with twelve-core Opterons, and their chess engines would be squashed by human grandmasters. Resources are a factor, but chess software has improved as impressively as hardware.

Modern engines don't consider a lot more branches or store enormous databases of every game ever played. In fact, Houdini takes just a megabyte. That's it, no enormous databases with 50,000 indexes. Just good programming, perfected over decades. They consider fewer branches, not more.
Will we come to the point of beating Kasparov with a 80386, no. But to the point of beating Kasparov on a 2003 CPU while fitting in 256 or 64 kilobytes, yes, most likely so.

Well, unless your approach of just throwing more resources at everything indeed prevails even there and yet another part of human body goes the way of the coccyx.
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
February 24th, 2011 at 5:52:24 AM permalink
Quote: P90

It did - that's why 6-bit format has been used for a long while.
Only later did it finally get superseded by 7-bit, and ultimately by 8-bit. And now 16-bit Unicode. Space is growing cheaper.


Are you talking about BCD? Lol. That was used for about a couple of years (not exactly "a long while"), by IBM only, in early sixties, and then was replaced by 8-bit EBCDIC, not because the memory suddenly became cheaper, but because it did not have enough characters. More importantly, the characters, encoded as 6-bits still used the whole byte, much like the current ASCII codes are actually 7-bit long, not 8, as you are incorrectly pointing out.
For all practical purpose, when it comes to hardware, a byte is the "atom" of memory, you can't have a thing in memory less than a byte in size. And never could, even when they used BCD.


Quote:

What does the problem of fractional bits have to do with anything here? 6 bits does not involve doing anything with "less than a bit".


Oh, I meant "less than a byte" of course. Sorry.


Quote:

5 bits (using a control prefix character for numbers, because they would be stored as binary, not decimal anyway)


Ah, now you are inventing the Unicode? Get ready to increase your processing times ten-fold if you go this route.
Storing numbers you see in text in binary without doing any other structural analysis is a waste of time. And space. Take "80386" in your post. You need 3 bytes to store it in binary, add your signal prefixes, it becomes six - same as in text, except much harder (as in takes longer) to read and deal with: (assuming your 5 bit compressing): you have to read first 8 bits, throw away last three read, perform a comparison with control sequence (you have to do this now before reading EVERY character btw), throw it away, read in first 16 bits, throw away the first eight (I am assuming the control sequence is the whole byte? Or do we also have to realign the whole thing?), store the remaining byte, repeat the whole sequence again, merge previously stored byte with the new one, repeat six times, read the next eight bits, compare them with the control sequence discover that it is a regular character, throw it away, take the 24 bits that you have accumulated, and move them into a double-word register (CPU can't interpret pieces of data that are sized in uneven number of bytes), the swap every byte around, with one bitwise shift and two logical operations, involving an additional register (this is because the CPU needs the least significant byte at the lowest address, and the way you got it so far is backwards, you could not get it in the right order without knowing how many bytes are going to be read).

This does not only look long because I described it in so much detail. It also takes long when performed by the computer (not "hours" long, "milliseconds" long, but compared to the regular character processing, it's like ages).
Note BTW, that all that work is really wasted, because 80386 isn't really a number in this context to begin with.

Quote:

As to compression, it takes extra CPU cycles, and it's RAM space you want to save, not HDD.


The point is, that compression takes fewer cycles than your "naive packing" approach.
You can also compress data in memory. But it will make you slower (not as much slower as your suggestion, but still significantly slower).

Quote:

Certain human constructs such as capital letters are not only useless for the purpose - they are counterproductive, because your search has to be case-insensitive anyway.



They are not useless at all. They are semantic constructs, vital for the understanding of the text. Knowing that the word is a proper noun provides important information not only about the word itself, but also about words that surround it. It helps not only semantic analysis, but also the search itself (the capitalized keywords are weighted higher because they are more important).

Quote:

If you meant fractional bytes, that's not to worry about, because the purpose-designed machine (remember, ownership of all unaligned nations at stake - think of all the oil) would be optimized and designed from scratch for your byte size. It can be 36-bit, 48-bit, 40-bit, any you like.



I thought, we were talking about running it on a low-end, consumer targeted, 32-bit (you actually said even 16-bit would suffice) 3086, not a "purpose-designed machine".


Quote:

Well, some other categories I recall were Decades, Theater, Celebrity Subtitles.



Yes, and don't forget:
Crossword words "DE"
"Hard"
Ancient Worlds
Hollywood Legends
Odd Jobs
"CAT"egory
Silly Songs
Kids songs in other words
Presidential Firsts
"ANDY"
Legal "E"s
Starts with "B"
Human Vision


Quote:


Now, the last one starts to look like it's starting to complicate things indeed. But then, "Celebrity subtitles" and for instance "Famous words" would really be the same category.



I think we are about to end up at the "something else" category again pretty soon :)
How is it the same category? The word Sputnik is pretty famous, but not anyone's subtitle.
By the same logic "US cities" is the same category as just "citites", so Watson's answering "Toronto" should not be surprising to you at all.

Quote:

A separate set of categories is ones including hints to the word's first letters in category name. May look complex, but really quite easy to recognize and use the same principle (search only with these letters).



Search where?
Here is a couple sample questions for you from the "Starts with "B" category:
"Quality of a Scot's brogue which makes the "R" r-r-r-oll"
"Nickname of Gil Gerard's futuristic Capt. William Rogers"
"Australian bay named for the many plants growing on its shores"


Quote:

As did Watson.



Yes, it did. I thought you were talking about Watson being an overkill, and wanting to give wrong answer more frequently, because "if you win you win".

Quote:

The approach of using customized algorithms would most likely deliver a very good hit percentage when it buzzes, just a lower percentage of cases where it buzzes.



Quote:

Oh, most of the time - it's better than the, what, 70% accuracy these questions are normally answered with?



What are you basing your assertions on?

Quote:

Watson's purpose in life is not winning Jeopardy. But let's get back to where we started. It was Watson winning Jeopardy that some people took as a sign that strong AI is finally almost here.


I don't know what exactly you mean by "strong AI". Watson is indeed pretty strong, and its winning Jeopardy is a pretty good sign of it, yes.
For example, if you take an IQ test, and get a good result, that's an indication that you have a high IQ, but taking tests by itself is useless to you just as playing Jeopardy is useless to Watson.


Quote:

...Or maybe the method used just wasn't as perfect as it seems. You could have a million times more resources, and Watson would still get these questions wrong. When it was off, it was way off.


What are you basing your assertions on?
"Way off" is actually much, much better in these matters than "almost got it". It is usually an indication that there is an easy fix.


Quote:

It's not all about resources. How well would a tag cloud work for solving mathematical equations?


It would not. Who said it would?
Would your mysterious approach work for that too?

Quote:


You didn't bother to then, maybe? For instance, when you search for a particular user's posts on a forum, most software doesn't run a full-text search for his name through all the posts - it only searches the post author field for his binary ID.


Yes. So what? It goes directly opposite to your point: Indexing the forum posts with the author, doesn't make the engine database smaller, it makes it require more space.


Quote:

But I have. If you insist that I deliver it in working form, I can't. Generally, the principle is to parse the question and reduce it to a standardized query that is convenient to work with. Find a record with the prescribed set of attributes. I only used examples to illustrate.


How? How to do that? What is the method? What is the "standardized query"? Which query is "convenient to work with"?
What method to use to "find the entry with prescribed set of attributes"?
No, you haven't. You ONLY used examples to illustrate, yes. But you did not do anything else.
You see, the problem is, that the "general principles" you have described so far is exactly what Watson (and the rest of the world) uses - it parses the question, generates a "standardized query that is convenient to work with", and finds entries "with the prescribed set of attributes". That is exactly what it does.
What you still have not described (beyond a couple of examples) is anything that would differentiate your approach from Watson's (and the rest of the world's), and let it use (tremendously) less resources as you claim it should be able to.



Quote:

The existence of computer software that can translate from one human language to another. The resemblance of some programming languages to some human languages. The existence of grammar, which is a formal set of rules for encoding ideas into phrases and decoding phrases into ideas.



Can you elaborate? All of the things you mention use the standard approach, nothing revolutionary like what you are talking about. How does their existence affirm anything of what you are saying?

Quote:


The cost of internet ads is usually about 0.3 cents per view. On search, maybe up to a cent.



No, you are way off. The ads, that pay per view are actually much cheaper (they are usually called CPM - for "cents per million", and cost 3-5 cents per million impressions), but google does not have them (not on the search pages). The ones google has are called CPC - "cents per click", and pay for user's clicks, not views. They are usually 15-20 cents per click for a site that provides qualified enough traffic. But that's the price google sells them to you. What they make on a click is usually about a dollar or two, and up to six or seven or even ten for highly contentious keywords. This ads up to enormous revenues when multiplied by google's volumes of traffic. This is only the ads on the right hand side of the page. The promoted results on top are even more expensive, and so are "smart ads" (try searching for "hotels in Las Vegas" - you'll get a little input form, where you can enter dates, and go straight to a travel site of your choice), and "Google Places" results (search for restaurants in Vegas or something like that - top results - ones with review links on them - are real money makers, the places, being promoted pay insane money to google to get on that list).

Quote:

Google has to handle I'm not even sure how many queries per second, it's not just searches these day.

Watson was handling just one. So yes, the cost of running my google search query is much lower than running a question on Watson.



Yes, I already said that. The cost per query is lower, but it uses (much) more resources.




Quote:


And because their programming was poor by modern chess engine standards.
It was not poor because people who programmed them were stupid, but because they were facing a challenging task and it took decades of algorithm development to get where we are now. But it was weak and made poor use of computer resources, compared to modern chess engines. Even given all the hardware of Deep Blue, with that period's programming it would still lose to humans.


So, do we agree on this finally? I said the same thing in the previous post - availability of resources and quality of algorithms are interconnected, if you don't have enough resources, you have to make compromises and cut corners, making your program "poor". Without enough resources, even a brilliant programmer would not be able to make Watson. That was my point from the beginning - you can't make it work on 386, you just can't.
You sound like we are finally in agreement on that. Or do you only agree that chess computers need resources, but everything else can just run on cell phones?


Quote:


Quote: weaselman

You are mistaken. Perhaps, some papers out there call what was done "brute-force approach", but, if they do, they are not using that term in the same sense you are. Searching the whole game tree was never the approach.


The approach was searching as much of the game tree as you could.



Yes, and it is still is that.
The key is in selecting the most promising subtrees, where "brute-force" would mean just search everything sequentially, and stop when the allotted time has elapsed.
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
February 24th, 2011 at 7:57:52 AM permalink
Quote: weaselman

Are you talking about BCD?

BCD, and Sixbit, and Fieldata sixbit ASCII.

Quote: weaselman

For all practical purpose, when it comes to hardware, a byte is the "atom" of memory, you can't have a thing in memory less than a byte in size.


And bytes were not always necessarily 8 bits long. Everyone has just agreed on the 8-bit byte because it was a good compromise. It could have been 16, it could have been 6.

Were computers invented for the purpose of playing Jeopardy, 6-bit byte could have become the standard just as well.


Quote: weaselman

(assuming your 5 bit compressing): you have to read first 8 bits, throw away last three read...
(CPU can't interpret pieces of data that are sized in uneven number of bytes).


Actually first 5 bits, as the byte would be 5 bits in that example. But forget it, it was going overboard with space greed. You still overcomplicated it based on the assumption of an 8-bit byte, however.


Quote: weaselman

I thought, we were talking about running it on a low-end, consumer targeted, 32-bit (you actually said even 16-bit would suffice) 3086, not a "purpose-designed machine".


I had to revise that estimate up since watching another couple episodes of the show. The episode with Watson had a lot of very computer-friendly questions. Others were less computer-friendly.

As to purpose-designed machines, that category would apply in the intentionally absurd scenario of the fate of all unaligned nations being staked on a game of computer Jeopardy between superpowers. The intentional absurdity was to frame a situation in which human effort dedicated to the task would be disproportionately large, and in which there would be pressure to refine algorithms to solve it as soon as possible, rather than wait until there are enough resources to solve it easily.


Quote: weaselman

Yes, and don't forget:
Crossword words "DE"
"Hard"
"CAT"egory
Legal "E"s
Starts with "B"


It's actually quite a simple set of categories for a computer. Not sure if "cat"egory was along the lines of "cataclysm" or more like "puma", but categories like "starts with B" greatly help the machine. They let it work with a much smaller subset of data.


Quote: weaselman

Search where?
Here is a couple sample questions for you from the "Starts with "B" category:
"Quality of a Scot's brogue which makes the "R" r-r-r-oll"
"Nickname of Gil Gerard's futuristic Capt. William Rogers"
"Australian bay named for the many plants growing on its shores"


In this case, through B* words.
The first one, that's quite hard, not sure even Watson could answer with confidence.
The second one would be quite simple if understood. Checking "William Rogers", "Capt. William Rogers", "Captain William Rogers", "Gil Gerard" entries would give the answer.
The third one, search through B* entries only in Locations.Australia.


Quote: weaselman

I don't know what exactly you mean by "strong AI".


http://en.wikipedia.org/wiki/Strong_AI

Artificial intelligence that can perform the full or nearly full range of tasks human intelligence can.
With Watson, we are still strictly in the category of applied AI.


Quote: weaselman

What are you basing your assertions on?
"Way off" is actually much, much better in these matters than "almost got it". It is usually an indication that there is an easy fix.


I'm basing them on how Watson's displayed three top picks did not adjust as time passed and it wasn't pressing the button.
The fix would be a software one, not throwing more resources at it.


Quote: weaselman

It would not. Who said it would?
Would your mysterious approach work for that too?


Yes, absolutely. It is the approach used for solving equations - recognize the equation and apply a solution algorithm.
As opposed to: brute force approach of going through all possible values of X.


Quote: weaselman

Yes. So what? It goes directly opposite to your point: Indexing the forum posts with the author, doesn't make the engine database smaller, it makes it require more space.


The database *already* contains the author ID field. If it didn't, it could not display the authors of messages.
It also takes less space than signing the messages in plaintext would, because user number generally takes less space than a string for username.

Quote: weaselman

How? How to do that? What is the method? What is the "standardized query"? Which query is "convenient to work with"?
What method to use to "find the entry with prescribed set of attributes"?


Go through entries and pick the one or ones where given fields have given values.

Quote: weaselman

Can you elaborate? All of the things you mention use the standard approach, nothing revolutionary like what you are talking about.


Their standard approach is *not* the approach used by internet search engines or Watson.
The standard approach in machine translation (in its most basic form) is to substitute words with fixed translations.
The standard approach in compilers (most basic ones) is to substitute commands with code sequences.


Quote: weaselman

What you still have not described (beyond a couple of examples) is anything that would differentiate your approach from Watson's (and the rest of the world) uses


Approach 1: Treat all words as symbols only, without attempting to process their meaning. Search for them as strings in a text database.
Approach 2: Interpret the meaning of as many words as possible, as well as their relative position. Select subsequent behavior depending on the interpreted meaning.
Is that a clear enough difference?

Approach 1 is used by internet search engines.
Approach 2 is used by compilers.

The immediate actions performed by a machine following approach 1 will be the same [except for symbols searched for] regardless of which words are used.
The immediate actions performed by a machine following approach 2 will depend on which words are used.

I don't see how you can lump the two into the same "standard approach", because they are very distinct, even if often mixed. I hope I made the distinction clear, because it appears as if we have been arguing about entirely different things.




Quote: P90

And because their programming was poor by modern chess engine standards.
It was not poor because people who programmed them were stupid, but because they were facing a challenging task and it took decades of algorithm development to get where we are now. But it was weak and made poor use of computer resources, compared to modern chess engines. Even given all the hardware of Deep Blue, with that period's programming it would still lose to humans.


Quote: weaselman

So, we agree on this? I said the same thing in the previous post - availability of resources and quality of algorithms are interconnected, if you don't have enough resources, you make to make compromises and cut corners, making your program "poor".


I'm not sure it's quite the same thing.

Modern chess software delivers better results with less resources than older chess software. Better algorithms, less resources.

Earliest chess software wasn't as weak as it was because they lacked computer resources. It was weak because algorithms used were not well optimized. Resources and algorithms are indeed interconnected - the better your algorithms, the less resources you need for a given task, and vice versa.


Quote: weaselman

Yes, and it is still is that.
The key is in selecting the most promising subtrees, where "brute-force" would mean just search everything sequentially, and stop when the allotted time has elapsed.


The approaches in chess changed twice.

There are two different options:
1. Search EVERY possible branch and sub-branch to a fixed depth, say, 3 plies. That is brute force. Consider every possible move, every possible response, every possible next move, calculate piece points.
2. Select the best branches and sub-branches and search them to varied depths.

At first chess software tried to search the most promising branches. It used poor selection algorithms, due to programming not being good enough (not due to lack of resources) and didn't work well.
Then chess software switched to brute force, searching everything to a fixed depth. It actually worked better, even with the same resources, because programmers' chess skills didn't matter as much.
Finally, as chess expertise was applied to refining these algorithms, the selective approach became dominant again.

The performance has been steadily and greatly improving even for decreasing amounts of resources.
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
February 24th, 2011 at 9:07:44 AM permalink
Quote: P90

BCD and sixbit ASCII.


ASCII is 7 bit. And it takes a whole byte anyway.



Quote:

And bytes are not necessarily 8 bits long.


Byte is by definition 8 bits long.

Quote:

The 8-bit byte has just been standardized on because it was convenient for particular purposes. It could have been 16, it could have been 6.


Yes, it could. But so could the word "blue" to be taken to mean "red". It did not.


Quote:

Were computers invented for the purpose of playing Jeopardy, 6-bit byte could have become the standard just as well.


If my grandmother had balls, she'd be my grandfather :)
I'll remind you again, we are talking about implementing Watson on 386, not some hypothetical custom-designed machine (I am not conceding that 6-bit bytes would be better even on the custom-designed hardware BTW, there are lots of things you are overlooking, but I don't want to go on a tangent here - we are talking about standard, 8-bit bytes, and standard double-word processors).


Quote:

I had to revise that estimate up since watching another couple episodes of the show. The episode with Watson had a lot of very computer-friendly questions. Others were less computer-friendly.


Which estimate? Up to what?


Quote:

As to purpose-designed machines, that category would apply in the ridiculous scenario of the unaligned nations being staked on a game of computer Jeopardy between superpowers.



I am not really interested very much in discussing ridiculous scenarios. Shouldn't we just stick to the original topic instead?


Quote:

It's actually quite a simple set of categories for a computer. Not sure if "cat"egory was along the lines of "cataclysm" or more like "puma"


I am not sure either. It can be anything, and everything. Some questions may be about pumas, and others could be about cataclysms. The point is to show you that Jeopardy's "categories" is not what you you think they are - there is no taxonomy of topics of any kind, they are just additional context, that could as well be just appended to the actual (uncategorized) clues as text.

Quote:

, but categories like "starts with B" greatly help the machine. They let it work with a much smaller subset of data.


How?

Quote:


Quote: weaselman

Search where?
Here is a couple sample questions for you from the "Starts with "B" category:
"Quality of a Scot's brogue which makes the "R" r-r-r-oll"
"Nickname of Gil Gerard's futuristic Capt. William Rogers"
"Australian bay named for the many plants growing on its shores"


In this case, through B* words.
The first one, that's quite hard, not sure even Watson could answer with confidence.
The second one would be quite simple if understood. Checking "William Rogers", "Capt. William Rogers", "Captain William Rogers", "Gil Gerard" entries would give the answer.


How exactly would you check those entries? Which "category" would you use to access them?

Quote:

The third one, search through B* entries only in Locations.Australia.


Are you suggesting that entries, related to Australia (1), that are Locations (2), should also be indexed by the first letter (3)? Can you think of any more "general" categories you would want to index them on before I give you the next example? How many of those different indexes are you planning on building? How about a combination of attributes? Do you want a separate alphabetical index for things that are not locations in australia, but, say books, written in Sanskrit, or are you planning on using multi-attribute indexes to cover various combinations? What about books, written in Sanskrit, that are located in Australia? Or Locations in South America, that were first explored by a scientist, born in Australia?



Quote:


http://en.wikipedia.org/wiki/Strong_AI

Artificial intelligence that can perform the full or nearly full range of tasks human intelligence can.
With Watson, we are still strictly in the category of applied AI.


Yes. No argument here.


Quote:

I'm basing them on how Watson's displayed three top picks did not adjust as time passed and it wasn't pressing the button.
The fix would be a software one, not throwing more resources at it.


What exactly would be the fix you are suggesting? Lower the confidence threshold? That's not even software, it's a parameter. What does it prove?



Quote:

Yes, absolutely. It is the approach used for solving equations - recognize the equation and apply a solution algorithm. As opposed to: brute force approach of going through all possible values of X.


Oh, in that way, yeah, sure Watson could solve equations. It just needs a knowledge base of solution algorithms.
I don't know why you are opposing it to looking through all possible values of X. That's not practical, and not akin to any methodology Watson is using.



Quote:

The database *already* contains the author ID field. If it didn't, it could not display the authors of messages.


It doesn't need it though (you don't need id to display the name, you just need the name). If you wanted to reduce the database size, you could get rid of it.
More importantly, it's not enough to just have the id filed in the database - that would result in having to use the "brute-force" approach (better known as "sequential scan" in this case) that you loath so much. You also need to index the database by the author id, and the index means a lot of additional space.

Quote:


It also takes less space than signing the messages would, because user number generally takes less space than a string for username.



No, it doesn't take less space. It adds a whole separate column to the table, that makes every row about 16 bytes longer, plus it adds a whole new table to keep references for names, that is another 64 bytes or so. If you did not need the id, you could just throw the name into text, which, in your case, only need three bytes.
Not to even mention the index, that bloats the size of the data by orders of magnitude.


Quote:


What method to use to "find the entry with prescribed set of attributes"?
Go through entries and pick the one or ones where given fields have given values.


That's exactly what Watson does though. I am asking what is it you are suggesting that should allow the same thing done with billions of times fewer resources.


Quote:

Their standard approach is *not* the approach used by internet search engines or Watson.


Sure. Because they do different stuff :)


Quote:


Approach 1: Treat all words as symbols only, without attempting to process their meaning.
Approach 2: Interpret the meaning of as many words as possible, as well as their relative position.
Is that a clear enough difference?


Yeah, it's the difference ok. But I don't understand who is using "Approach 1", and what you mean by "Interpret" in the "Approach 2", and what is the "approach" to doing the interpretation.


Quote:

Approach 1 is used by internet search engines.


No, it is not. It is actually closer to #2, than #1.

Quote:

Approach 2 is used by compilers.


No, not really. Perhaps, you mean "interpreters" or "parsers", when you say "compilers" - that would be closer, but still not quite. They deal with tokens and grammar, not words and semantics. They don't "interpret the meaning of most words", they first break the text into tokens, then group them into constructs, according to grammar, and then map constructs to operations. There is no "interpretation of words " involved at all, the tokens in computer language aren't really like words in a human one, if anything, they are more like punctuation, denoting the structure.



Quote:

I don't see how you can lump the two into the same "standard approach", because they are very distinct, even if often mixed.


They are not the same. The first one is unfit for pretty much any purpose other than counting words, the second one is underspecified the extent it could really mean anything at all.



Quote:



I don't think we are saying the same thing.

Modern chess software delivers better results with less resources than older chess software.



Less resources? Can you name a piece of modern hardware that has less resources than a 8086? Really?
Does it have chess software? Is it decent enough to play above kindergarten level?
I don't think so.

Quote:

Better algorithms, less resources.


Wishful thinking.


Quote:

Earliest chess software wasn't as weak as it was because they lacked computer resources. It was weak because algorithms used were not well optimized.


It wasn't really "weak". It played at the Masters rating level pretty early on. It could not win against Kasparov, but that was exactly because of lack of resources. Like I said earlier, the algorithms aren't as good as they are now either, but that wasn't because people did not know how to make them better, it was because better algorithms would require more resources.

Quote:


The approaches in chess changed twice.

There are two different options:
1. Search EVERY possible branch and sub-branch to a fixed depth, say, 3 plies. That is brute force. Consider every possible move, every possible response, every possible next move, calculate piece points.
2. Select the best branches and sub-branches and search them to varied depths.


What is your source for this? Can you name a piece of (professional) software that ever did #1? I believe, you are mistaken in this as well.
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
February 24th, 2011 at 9:56:09 AM permalink
Quote: weaselman

I'll remind you again, we are talking about implementing Watson on 386...


I went back on that already and revised my position. No, not on 386. But not the level of hardware Watson uses either. Also, not implementing Watson, but a strictly Jeopardy-playing program.
Not sure what would the minimum requirements be, but far lower than Watson's hardware.

Quote: weaselman

How exactly would you check those entries? Which "category" would you use to access them?


In this case, it would have to be just a text search.
My approach is undeveloped and not practical per se because I have given it all the thought of one minute, not with the intention of implementing it, but only illustrating that there are multiple approaches possible.

Quote: weaselman

Are you suggesting that entries, related to Australia (1), that are Locations (2), should also be indexed by the first letter (3)?


No, rather the entries in Locations list (1) are subdivided by country, including Locations.Australia (1.1), and stored in alphabetical order.

Quote: weaselman

What exactly would be the fix you are suggesting? Lower the confidence threshold?


I don't know. I haven't seen the software even. What would be the fix you are suggesting?

Quote: weaselman

Oh, in that way, yeah, sure Watson could solve equations. It just needs a knowledge base of solution algorithms.
I don't know why you are opposing it to looking through all possible values of X. That's not practical, and not akin to any methodology Watson is using.


No, that's not, it's an extreme brute force approach.
Watson only partially relies on brute force.

Quote: weaselman

That's exactly what Watson does though. I am asking what is it you are suggesting that should allow the same thing done with billions of times fewer resources.


Not billions.

But for one? A smaller, purpose-written database, tailored to the game show and nothing else.


Quote: weaselman

Yeah, it's the difference ok. But I don't understand who is using "Approach 1", and what you mean by "Interpret" in the "Approach 2", and what is the "approach" to doing the interpretation.


Search engines, at least simple engines, are using Approach 1. They treat words as symbols only [at least usually]. The only difference between querying "bread crumb" and "dark exit" is the symbols searched for.

Quote: weaselman

They deal with tokens and grammar, not words and semantics. They don't "interpret the meaning of most words", they first break the text into tokens, then group them into constructs, according to grammar, and then map constructs to operations.


There are major differences, of course. But what remains is that "Int main(void)" and "if (A+B)" inputs produce different actions on part of the program.

Directly comparing words to operators is going too far indeed, but the point here is that the solution can be optimized depending on the words included in the question, rather than just indiscriminate search.

I know what you are going to say now, that Watson already does that. Yes, it does. If it didn't, it would need even more resources for the same performance. If it lear... When it learns to optimize its searches better, it will need less resources for the same performance. Its successor, further less resources.




Quote: weaselman

Less resources? Can you name a piece of modern hardware that has less resources than a 8086? Really?


I can name a piece of modern hardware that has less resources than Deep Blue. I even have one and so does everyone here.

Quote: P90

There are two different options:
1. Search EVERY possible branch and sub-branch to a fixed depth, say, 3 plies. That is brute force. Consider every possible move, every possible response, every possible next move, calculate piece points.
2. Select the best branches and sub-branches and search them to varied depths.

Quote: weaselman

What is your source for this? Can you name a piece of (professional) software that ever did #1?


Yes. It was Chess 4.0 and above.
Chess (software)

Quote: weaselman

It wasn't really "weak". It played at the Masters rating level pretty early on. It could not win against Kasparov, but that was exactly because of lack of resources. Like I said earlier, the algorithms aren't as good as they are now either, but that wasn't because people did not know how to make them better, it was because better algorithms would require more resources.


It was exactly because people did not know how to make them better. Not how specifically.

Deep Blue ran on 480 chips purpose-designed specifically for playing chess. Rybka runs fine using a single general purpose CPU.
Deep Blue evaluated 200,000,000 positions per second. Rybka on a modern PC evaluates 20,000 to 400,000.
Deep Blue had a massive database including 700,000 grandmaster games. Modern chess engines don't.

In summary, Rybka or Houdini run on desktops with only a tiny fraction of the resources available to Deep Blue.

Rybka and Houdini have wiped the floor with stronger engines than Deep Blue. They would do it to Deep Blue itself as well if it hadn't been retired. The Elo ratings of these engines, run on ordinary CPU, exceed Deep Blue's by over 300, comparable to the difference between a Grandmaster and a Candidate Master.
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
February 24th, 2011 at 11:19:50 AM permalink
Quote: P90

I went back on that already and revised my position. No, not on 386. But not the level of hardware Watson uses either.


Well, I don't know anything about 6-bit computers, or about what else you propose to be different in you your hardware from what we came to know as "computers", so I can't really discuss that topic intelligently.

Certainly, there are ways to build specialized equipment that will be better at performing a specific kind of task.
I think I have mentioned this example before - back in the eighties I used to work on a project that needed to calculate the parameters of huge pipeline networks. The computers available to us then took days or even weeks to solve one configuration, and we could not wait that long, so instead we just built an electrical grid, mimicking the topology of the network we were modeling, plugged it in, got an ammeter, and had the answer right away.

A computer is not the (best) answer to everything, I am not arguing that. Maybe, someone could invent a machine, good at playing Jeopardy. I can't have an opinion on that before I know the details of the design and architecture of that machine, in particular, what exactly would be different in it from the traditional computer, what instruction set it would use, and what resources would it have available.



Quote:

In this case, it would have to be just a text search.



If you want text search, you need resources that text search requires. This basically negates all your previous arguments, and assertions, and takes us to square one - you need a text search machine ("tag cloud") at a minimum, plus perhaps some additional enhancements (that would only add more resources to the requirements).

Quote:

My approach is undeveloped and not practical per se because I have given it all the thought of one minute, not with the intention of implementing it, but only illustrating that there are multiple approaches possible.


If it took you one minute to type all the stuff you have typed in this thread, you must be a superman :)
Your approach is not just "undeveloped", it's "non-existent". You still haven't' indicated anything of substance that you'd do differently from Watson, leave alone provided any basis for the assertion that doing so would actually (a) work, and (b) require less resources.
All you said is that you think there may be a different approach out there, that would require less resources. This is more or less the same kind of statement as "I believe there is a formula out there, capable of generating all sequential primes in constant time". I can even provide a "general principle" of such formula, much like you do illustrating your approach - basically, you just take the last prime number, apply necessary transformations to it, and then generate the next number. For example, if the last prime number is 2, you just add 1, otherwise, you add 2. This will work for 3, 5, 11, 17 etc., but break down on 7, 13, 19 and 23 ... In those cases, you'll need to add 4.

Quote:

No, rather the entries in Locations list (1) are subdivided by country, including Locations.Australia (1.1), and stored in alphabetical order.


That's it? How about a question about location names than end with "ville". Are we talking about a "brute-force" (sequential scan) now? And what about books, that are located in Australia? You have (again) ignored my question(s) ...

Quote:

don't know. I haven't seen the software even. What would be the fix you are suggesting?



I have no idea. You said that there was a software fix, and sounded very confident about it. How do you know there is a fix, and how do you know that it is a software fix, but you don't know what it is, and haven't even seen the software?
Wishful thinking again?



Quote:

No, that's not, it's an extreme brute force approach.
Watson only partially relies on brute force.



Watson doesn't rely on brute-force at all (at least in the common meaning for the term, not sure what you mean by it exactly). But if you agree that "that's not", why are we talking about it?

Quote:

not Billions


How much then? I was still talking about 386, but turns out that you have conceded that that was impossible. What's the new hypothesis then? When you say that Watson is using too much resources, how much would say is enough?
Is like a half? Or is it more like orders of magnitude?


Quote:

But for one? A smaller, purpose-written database, tailored to the game show and nothing else.


That's just words. You need to give me a method of building the database, that would be smaller, and you haven't so far (we are not seriously considering the 6-bit-byte approach, are we?)


Quote:

Search engines, at least simple engines, are using Approach 1.


No, they aren't.

Quote:

They treat words as symbols only [at least usually]. The only difference between querying "bread crumb" and "dark exit" is the symbols searched for.


Yes, it is. And what would "your" approach do in this case?

But you are very wrong generally about "treating words as symbols". Content search engines (even simple ones) recognize and categorize words based on different criteria - such as the lexical and morphological forms (e.g., the words with common roots are treated as - roughly - equivalent), the word class (a nonun has higher weigh than a participle, which is higher than a preposition, and things like articles or pronouns are usually ignored, except for specific cases), the commonality of the word (something like "number" has much less weight than "supercilious"), the frequency in the document, the location (titles weigh more for example), the relative position of different terms (the closer the better, prepositions are not split from the defining word etc.), even the actual meaning of the word (sometimes, a term can be matched against a synonym. for example, a search for "supercilious" could find documents about "haughty".



Quote:

There are major differences, of course. But what remains is that "Int main(void)" and "if (A+B)" inputs produce different actions on part of the program.


Yeah ... Different inputs produce different results. That's you approach? :) "Vary results depending on the input"?


Quote:

Directly comparing words to operators is going too far indeed, but the point here is that the solution can be optimized depending on the words included in the question, rather than just indiscriminate search.


You are missing the point. Computer language lexers is a bad example, because what they do is an entirely different problem. There is a fixed (small) set of tokens, mapped (unambigously) to groups, mapped (again, unambigously) to instructions. You simply tokenize the input, and apply the pre-defined mappings. That's all. There is no intelligence there, it is a simple transformation function, much like y=x2.


Quote:

I know what you are going to say now, that Watson already does that. Yes, it does. If it didn't, it would need even more resources for the same performance.


No, if it did not, it simply could not do it (with any performance, limited, say, by the age of the Earth), even if it had all the resources in the world.

Quote:

When it learns to optimize its searches better, it will need less resources for the same performance. Its successor, further less resources.


Maybe. Maybe not. From my experience, I think the latter is more likely. The computers aren't using less and less resources as Computer Science progresses. So far it was all going in the opposite direction. I don't see any reason why Watson should be an exception. You still haven't given me any reason to think that.

BTW, don't get me wrong, there are applications of AI that can run on a cell phone. If you want to see one, try this. I find it pretty impressive. Another way it is relevant to our discussion is that it is an application of neural networks (yes, an NN does not need a lot of resources to run, it does need them to be trained though).
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
February 24th, 2011 at 12:49:25 PM permalink
Quote: weaselman

If you want text search, you need resources that text search requires. This basically negates all your previous arguments, and assertions, and takes us to square one - you need a text search machine ("tag cloud") at a minimum, plus perhaps some additional enhancements (that would only add more resources to the requirements).


Text search would be necessary in any case. What could be done, if computer resources were constrained, is limiting the search. An algorithm that is aware of categories (Watson, it appears, wasn't, and used its general purpose routine) can do that. In a category "words starting with B", skip all entries that don't start with B. In a category "US cities", only search among cities, specifically US cities.
It would not always work well, but for a considerable fraction of categories and types of questions it would.


Quote: weaselman

That's it? How about a question about location names than end with "ville". Are we talking about a "brute-force" (sequential scan) now? And what about books, that are located in Australia? You have (again) ignored my question(s) ...


Having the option of searching through the database remains a requirement. You would, however, be able to considerably limit the amount of text to be searched by 1) only looking through entries on the Locations index, and 2) only looking through the content of entries that end with "ville".

Watson, if I understood what was said about it, had the resources to search through its entire database or a very large portion of it, multiple times. It's cool, but it is overkill for most questions. If these resources were not available, limiting the search to a fraction of the database would be a necessity. Currently it's merely an option.


Quote: weaselman

I have no idea. You said that there was a software fix, and sounded very confident about it.


Really? If I read right, the mention that there must be a fix was yours. My response was that such a fix would be a software one, not throwing more hardware at it.


Quote: weaselman

How much then? I was still talking about 386, but turns out that you have conceded that that was impossible. What's the new hypothesis then? When you say that Watson is using too much resources, how much would say is enough?


It appears that at most just one of its servers would suffice.

Quoting from an article on it:
"Watson's main innovation was not in the creation of new algorithm for this operation but rather its ability to quickly execute thousands of proven language analysis algorithms simultaneously to find the correct answer. The more algorithms that find the same answer independently the more likely Watson is to be correct."
- i.e. it did not only take all of its resources to find the answer, but it had enough resources to find the answer a large number of times with different algorithms, and then selected answers with the most hits.

By the way, by winning the game, Watson has refuted the assertion that it would take at least 100TB of databases to win the game. It only used RAM, not hard drives, and the total database size with indexes was 4TB. Still a lot, but not 100TB.


Quote: weaselman

That's just words. You need to give me a method of building the database, that would be smaller, and you haven't so far


What do you mean by method? A complete algorithm that would do it?
The method suggested was manual selection of data judged by humans (highly familiar with the game) to be most important for the task.

Watson's database contained a lot of data that was irrelevant to the show. Among other things there was the full text of wikipedia, which contains not only popular knowledge, but a large amount of scientific and technical information, far more in-depth than would ever be needed for the show.


Quote: weaselman

Maybe. Maybe not. From my experience, I think the latter is more likely. The computers aren't using less and less resources as Computer Science progresses. So far it was all going in the opposite direction.


That is simply not so. Computers are using less and less resources for accomplishing the same tasks equally well. Not always (consumer bloatware is to be excepted), but often. Windows 7 boots in less time than Windows XP, on the same machine (or in the same time on a slower machine). Newer versions of AutoCAD perform many tasks faster on the same machines (or as fast on slower machines). Antivirus software I have now doesn't slow the machine down nearly as much as two years ago. uTorrent 2.2 consumes less network resources than uTorrent 1.x or other clients, thanks in part to its new uTP protocol. CoreAVC 2.0 decodes highly compressed HD video faster than CoreAVC 1.0. 3D applications run faster with Forceware 266 than they did with Forceware 83.

I'm typing this on the same computer I used two years ago, sans a few extra hard drives, but it is now more stable and works noticeably faster, all due to software improvement alone. Software improvements result in tasks being accomplished with less resources, not more.
When I upgrade my computer, it will be obviously be to the new generation of CPU. Sandy Bridge 2500K CPU contains 995 million transistors, using 32nm process and running at 3.3GHz, with a TDP of 95W. The older Gulftown 980X CPU uses the same 32nm process, 3.33GHz, with a TDP of 130W, and has 1,170 million transistors. In the majority of tasks, Sandy Bridge 2500K outperforms Gulftown 980X, despite using fewer transistors and less power. That is delivering increased performance with less resources.

It's not just chess software where performance keeps increasing even as resources consumption is decreased. Fortunately for us all, performance improvement through intelligent design and optimization is not yet a lost art.
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
February 24th, 2011 at 1:40:22 PM permalink
Quote: P90


Text search would be necessary in any case. What could be done, if computer resources were constrained, is limiting the search. An algorithm that is aware of categories (Watson, it appears, wasn't, and used its general purpose routine) can do that. In a category "words starting with B", skip all entries that don't start with B In a category "US cities", only search among cities, specifically US cities.


What do you mean by "skip all entries"? You are not suggesting to scan the entire list of documents one by one, and filter out stuff that does not start with "B", are you? If not, then what exactly are you talking about?


Quote:

Having the option of searching through the database remains a requirement.


Sequentially? Brute-force? That will take months. You can forget about answering the question not only in time to win, but in time to be of interest to anyone at all.


Quote:

Watson, if I understood what was said about it, had the resources to search through its entire database or a very large portion of it, multiple times.
It's cool, but it is overkill for most questions. If these resources were not available, limiting the search to a fraction of the database would be a necessity. Currently it's merely an option.



Well, as much as I understand your "approach" so far, it is exactly what Watson does. It just does it in a more efficient way than you are suggesting. Instead of hand-picking a handful of categories, and letting everything else just be slow, it uses a full text index (a tag cloud if you wish) as a way to essentially categorize all the entries without having a preset rigid taxonomy. This allows it to deal with answering questions from "categories", that its creators (or perhaps anyone else) had never heard about - like 'things, ending with "ville"' or '"Scary children songs"'.



Quote:

Really? If I read right, the mention that there must be a fix was yours. My response was that such a fix would be a software one, not throwing more hardware at it.


You said: "You could have a million times more resources, and Watson would still get these questions wrong. When it was off, it was way off."
I responded, saying that your implication is wrong, because "way off" is actually a good thing, suggesting that there is an easy fix (meaning that "way off" usually means exactly that something is missing in the db - a tag, an index, a document, a grouping ... something simple, and easily fixable).
You then said that the fix would be in software.
I asked what kind of software fix you are talking about.
You said you did not know.
I was surprised because how do you know it is in software if you don't know what it is?



Quote:

It appears that at most just one of its servers would suffice.

Quoting from an article on it:
"Watson's main innovation was not in the creation of new algorithm for this operation but rather its ability to quickly execute thousands of proven language analysis algorithms simultaneously to find the correct answer. The more algorithms that find the same answer independently the more likely Watson is to be correct."
- i.e. it did not only take all of its resources to find the answer, but it had enough resources to find the answer a large number of times with different algorithms, and then selected answers with the most hits.



Well, yes. Different algorithms are different ways to analyze the question, and to categorize the documents. Do you have a particular one method that you think should trump all of the others, or are you suggesting that it should be picked at random?


Quote:

By the way, by winning the game, Watson has refuted the assertion that it would take at least 100TB of databases to win the game. It only used RAM, not hard drives, and the total database size with indexes was 4TB. Still a lot, but not 100TB.


Was it? Where does your number come from? I heard 300TB on TV.

4TB vs 100TB is not that much of a difference. My point was that it is hopeless to put anywhere enough data into 4Gig you could (barely) address with a 386 cpu (it was actually more like 2 Gig due to OS limitations). I never meant to pretend I had the exact figure - just a rough estimate of the ballpark.


Quote:


Quote: weaselman

That's just words. You need to give me a method of building the database, that would be smaller, and you haven't so far


What do you mean by method? A complete algorithm that would do it?


Maybe, not a complete one, but something detailed enough to (1) demonstrate how it would be different from Watson's method, (2) explain convincingly why it would result in a significantly smaller database and (3) explain convincingly that it won't be significantly worse than Watson's performance functionally.

Quote:

The method suggested was manual selection of data judged by humans (highly familiar with the game) to be most important for the task.


"be most important" isn't really a "method". As for selecting data by humans, that was done in Watson's case, so there is no difference so far.


Quote:

Watson's database contained a lot of data that was irrelevant to the show.


Says who?

Quote:

Among other things there was the full text of wikipedia


Where did you get that?

Quote:

, which contains not only popular knowledge, but a large amount of scientific and technical information, far more in-depth than would ever be needed for the show.


Even if that was true (which I don't believe), you would need to now demonstrate somehow that removing the material you deem "irrelevant" from Watson's memory would (2) significantly reduce the amount of resources it required and (3) not impair it's ability to play the game competitively.
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
February 24th, 2011 at 2:58:09 PM permalink
Quote: weaselman

What do you mean by "skip all entries"? You are not suggesting to scan the entire list of documents one by one, and filter out stuff that does not start with "B", are you? If not, then what exactly are you talking about?


Only entries starting with "B" would be considered.

Quote: weaselman

Sequentially? Brute-force? That will take months.


Watson searched its entire database for each question. Multiple times.


Quote: weaselman

Well, as much as I understand your "approach" so far, it is exactly what Watson does. It just does it in a more efficient way than you are suggesting. Instead of hand-picking a handful of categories, and letting everything else just be slow, it uses a full text index (a tag cloud if you wish) as a way to essentially categorize all the entries without having a preset rigid taxonomy.


It doesn't. It runs its algorithms through the entire database. With the database stored entirely in RAM, it has enough resources to do that. Apparently also thousands times over even.

So there is no optimization by categories. Simply a large enough headspace in terms of throughput and processing power that no such optimization is required. It will still search through the entire text of "Ablation" article, multiple times, even when asked to name a city starting with "B", and still has enough time to do it all over before humans can even put their fingers on the button.

It's certainly effective, but it's hard to call cost-efficient.


Quote: weaselman

I responded, saying that your implication is wrong, because "way off" is actually a good thing, suggesting that there is an easy fix (meaning that "way off" usually means exactly that something is missing in the db - a tag, an index, a document, a grouping ... something simple, and easily fixable).
You then said that the fix would be in software.
I asked what kind of software fix you are talking about.
You said you did not know.
I was surprised because how do you know it is in software if you don't know what it is?


It can be either in software or in hardware. Since hardware worked properly and it didn't even use all of its hardware resources, that fix can't be in adding more hardware. By exclusion, if it exists, it has to be in software.

Suggesting it's all because of a missing tag is rather optimistic (or pessimistic, if you look at it the other way, because it would mean the system can be ruined by one missing tag). Since it's using averaging noise suppression from thousands of different algorithms, Watson is a highly robust system. Such systems don't go down in flames because of one missing tag; if they fail to filter out useful signal, it's usually because they didn't get any to begin with.


Quote: weaselman

Well, yes. Different algorithms are different ways to analyze the question, and to categorize the documents. Do you have a particular one method that you think should trump all of the others, or are you suggesting that it should be picked at random?


Just as much as reducing the number of algorithms applied from thousands to, for instance, tens, picking the ones that performed best (in a million simulated episodes) would result in a considerable reduction in required processing power.

You are familiar with the law of diminishing returns, right? Using thousands of algorithms is an averaging approach, performing noise filtering. When only a small fraction of algorithms actually works [on a given question], their useful signal is mutually reinforced, while noise is not, similar to the principle used in RADAR.

It appears that Watson didn't even attempt to decide which algorithms would work best for a particular question, and just used them all. Here's an optimization opportunity for you, using pretty much the current approach: determine the correlation between question structure and best-performing algorithms, and select only the top hundred to run.


Quote: weaselman

Was it? Where does your number come from? I heard 300TB on TV.


I checked wikipedia's article for specific figures.


Quote: weaselman

Maybe, not a complete one, but something detailed enough to (1) demonstrate how it would be different from Watson's method, (2) explain convincingly why it would result in a significantly smaller database and (3) explain convincingly that it won't be significantly worse than Watson's performance functionally.


You're asking too much, knowing it would take a lot of work to develop a sufficient set of proofs.

As mentioned in articles about Watson, its database was built up rather indiscriminately, just dumping entire repositories of text information. Show's specifics, such as focus on popular knowledge and trivia rather than scientific and technical information, were not used for filtering that information. There was simply too much headspace still available to bother.


Quote: weaselman

Where did you get that?


Watson (AI)
"...Watson had access to 200 million pages of structured and unstructured content consuming four terabytes of disk storage,[8] including the full text of Wikipedia."

The contents of its database weren't hand-picked, they were dumped in. It's not designed for winning TV shows, that was just a publicity stunt. Its purpose is to develop algorithms for understanding natural language. It runs thousands of algorithms concurrently and will run more, for the purpose of developing better algorithms (presumably selecting ones that perform better, possibly applying genetic approach). Its database is there to give these algorithms something to practice on, not to win game shows.

The end result of this project might well be arriving at an algorithm that actually works and doesn't require averaging among thousands of mostly wrong answers - thus providing a solution that could answer natural language questions with a tiny fraction of Watson's resources. But even if it fails in the end, it will still improve its algorithms with time, if nothing else, at least because that's the very thing it was designed to do.

Using Watson to play game shows is, in a way, like using a blackjack strategy calculator to play BJ directly, rather than using the tables it makes. In the chess engine world, it would be like running all chess engines ever created concurrently and making the move the largest number of engines voted for.
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
February 24th, 2011 at 3:32:24 PM permalink
Quote: P90

Only entries starting with "B" would be considered.


But how do you know which ones start with 'B'? Don't you have to look at ("consider") an entry before you know if it starts with "B" or not?

Quote:


Quote: weaselman

Sequentially? Brute-force? That will take months.


Watson searched its entire database for each question. Multiple times.


Yes. It also used a lot of resources to make that fast. It did not scan all entries sequentially, it used the "tag clouds" and full text indexes to navigate directly to the relevant documents. You are talking about an alternative approach, using much less resources, that would not rely on the tag clouds, but instead have just a few hand-picked category indexes. If you do that, and then try to do the same thing Watson did (scan the whole database for an answer), it will take you much, much more time, than it took Watson. That is the price you'll pay for not using those additional resources.

Quote:


Quote: weaselman


Well, as much as I understand your "approach" so far, it is exactly what Watson does. It just does it in a more efficient way than you are suggesting. Instead of hand-picking a handful of categories, and letting everything else just be slow, it uses a full text index (a tag cloud if you wish) as a way to essentially categorize all the entries without having a preset rigid taxonomy.


It doesn't. It runs its algorithms through the entire database. With the database stored entirely in RAM, it has enough resources to do that. Apparently also thousands times over even.



Yeah, it does. You misunderstood the article you quoted. It said that it searches the database multiple times, not that it scans every entry multiple times. Now, the way (method) it uses to perform those multiple searches is what I described above.


Quote:

So there is no optimization by categories.


Not by a predefined, hand-picked set, no, that would not work. There is a dynamic, flexible taxonomy (a tag cloud) instead.

Quote:

It will still search through the entire text of "Ablation" article, multiple times, even when asked to name a city starting with "B", and still has enough time to do it all over before humans can even put their fingers on the button.



No. If the question was to simply name a city (any city), starting with "B", it would, probably, simply get a list of documents, tagged "city", a list of documents, tagged name:B*, and intersect the two lists, no need look at the actual documents at all.
There are more efficient ways to build a database, to answer this specific question, but they would be terribly inefficient for most real Jeopardy questions.





Quote:

It can be either in software or in hardware.


Or in the data.

Quote:


Quote: weaselman

Well, yes. Different algorithms are different ways to analyze the question, and to categorize the documents. Do you have a particular one method that you think should trump all of the others, or are you suggesting that it should be picked at random?


Just as much as reducing the number of algorithms applied from thousands to, for instance, tens, picking the ones that performed best (in a million simulated episodes) would result in a considerable reduction in required processing power.


Ok. Which ten algorithms would you suggest it should use? I'll agree, that it would result in the reduction of the required cpu power, but how do you go about demonstrating that it would not also result in significant degradation of the quality of the answers? Is it just your gut feeling? Why do you think Watson's programmers did not use ten algorithms?

Quote:

You are familiar with the law of diminishing returns, right? Using thousands of algorithms is an averaging approach, performing noise filtering. When only a small fraction of algorithms actually works [on a given question], their useful signal is mutually reinforced, while noise is not, similar to the principle used in RADAR.


Well, ok ... And you point is ... ?


Quote:

It appears that Watson didn't even attempt to decide which algorithms would work best for a particular question,


Sure, it did not. Because it did not know how. Do you? Can you explain it? how about without using examples?

Quote:

and just used them all. Here's an optimization opportunity for you, using pretty much the current approach: determine the correlation between question structure and best-performing algorithms, and select only the top hundred to run.


What is "question structure"? You have to come up with a function, mapping any (absolutely) question to a number before you can talk about determining correlation. Can you do that?


Quote:

You're asking too much, knowing it would take a lot of work to develop a sufficient set of proofs.


Well, what did you expect? Do you think I should get convinced simply because you say so? All I heard from you so far is that you think it is possible to use less resources. You can't explain why you think so, you just do ...
It's like me thinking it is possible to have a formula to generate prime numbers. It is very hard to find, so I am not going to give you any arguments in support of my position, but I am pretty sure, there is a formula.
Does this sound convincing to you? Does it surprise you that I find your position about as convincing as this?


Quote:

As mentioned in articles about Watson, its database was built up rather indiscriminately, just dumping entire repositories of text information.


Yes. But why do you think that if it wasn't, it would be (1) significantly smaller, and (2) not significantly worse content-wise? You just think so, right?

Quote:

Show's specifics, such as focus on popular knowledge and trivia rather than scientific and technical information, were not used for filtering that information. There was simply too much headspace still available to bother.


Or too few benefit to be gained (you know the law of diminishing returns, right? ;)).

BTW, the original point of contention in this topic was you insisting a program like this could have been built to run on a 20 year old 386 computer. I think, you and I now agree that it is, in fact, impossible.

As to your next claim - that Watson program could run on only one of the servers instead of ten - it is a different story. I can't say, like in the 386 case, that it is plain impossible. It, probably, can run on only one of the servers. It would be able to answer less questions correctly, and would spend more time, looking for answers, but it would be able to do something meaningful enough to be characterized by a reasonable person as "it works".

Perhaps, we can agree on the above statement?
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
February 24th, 2011 at 11:37:12 PM permalink
Quote: weaselman

But how do you know which ones start with 'B'? Don't you have to look at ("consider") an entry before you know if it starts with "B" or not?


Only its title. Which is in the alphabetical index.


Quote:

Yes. It also used a lot of resources to make that fast. It did not scan all entries sequentially, it used the "tag clouds" and full text indexes to navigate directly to the relevant documents.


Of which it still considered a lot more than necessary. With several thousand algorithms, of which only a fraction got the answer right.
Every operation performed by an algorithm that delivered the wrong answer was wasted, from theoretical efficiency standpoint. Every operation that found the right answer after it has already been found was also wasted, as it was redundant.
In every case, only one of the algorithms was actually necessary. To improve efficiency, the number of algorithms used concurrently has to be decreased.


Quote:

You are talking about an alternative approach, using much less resources, that would not rely on the tag clouds, but instead have just a few hand-picked category indexes. If you do that, and then try to do the same thing Watson did (scan the whole database for an answer), it will take you much, much more time, than it took Watson.


Something like that. However, a much smaller database would help, as would only working one algorithm instead of thousands.

In time, Watson's work is hoped to if not achieve that, at least come close to it, selecting and developing the best algorithms. The reason hand-picking would be used in my example is that it presumed that the problem had to be solved without modern computer power, but with near-unlimited human resources.


Quote:

No. If the question was to simply name a city (any city), starting with "B", it would, probably, simply get a list of documents, tagged "city", a list of documents, tagged name:B*, and intersect the two lists, no need look at the actual documents at all.


That's how you would work in the optimized case. A 386 could do that much: run the city index against the alphabetical index.
That is not quite how Watson worked, however. It was, for the most part, unaware of the question's meaning. The approach above was probably used by some of the algorithms. But it would not work for finding a city that has the most words starting with "B" in its motto. Other algorithms had to look into the entries themselves.
If the Watson knew how to select only the right algorithms for the question, or at least the best subset - and it will know it, at least somewhat, after a decade or so of research - it could save a lot of resources by only using the best suited algorithms. By just intersecting the two lists.


Quote: weaselman

Quote:

Just as much as reducing the number of algorithms applied from thousands to, for instance, tens, picking the ones that performed best (in a million simulated episodes) would result in a considerable reduction in required processing power.


Ok. Which ten algorithms would you suggest it should use? I'll agree, that it would result in the reduction of the required cpu power, but how do you go about demonstrating that it would not also result in significant degradation of the quality of the answers? Is it just your gut feeling? Why do you think Watson's programmers did not use ten algorithms?


Because Watson was not constructed to win Jeopardy, it was constructed to field-test thousands of algorithms.

An eight-thousand-CPU, 15-TB computer is expensive. Such computers are going to remain expensive for a long time. The end goal of research Watson is used for is not to build one cool supercomputer, it's to bring natural language interaction technology into the mass market, where it will not have the benefit of a supercomputer's resources. When its selection process and human work discard poorly performing and redundant algorithms, then the better algorithms are improved further, the number of algorithms required will be greatly reduced, and this goal will become closer to reality. Further savings will come from predicting which algorithm(s) to select based on the question.


It is, in fact, just what happened to chess computers. They evolved from considering all possible branches to a fixed depth to being highly selective, greatly reducing the number of branches that had to considered. From unsuccessful engines, to moderately successful, to powerful but requiring large amounts of resources and searching through over a billion positions, and finally evolving into small, lean, portable executables with minimum resource requirements, only requiring a few thousand positions.
All is took was a few decades of grandmasters dedicating their knowledge to developing chess engines, plus a lot of field testing in dedicated championships for chess engines.
That's not even terminal evolution. Well within our lifetimes, as chess engines are evolving so far out of human player reach and computer chess is more and more a sport of its own, we are likely to see classic computer chess engine competitions, like there already are PDA and cell phone chess competitions and a tendency to introduce handicaps such as no books and ponder off.

What does this have to do with Watson is that Watson is, essentially, like Deep Blue, only even less selective. It runs too many algorithms, checks too many answers, does too much work. Improving selectivity has been seen to lead to major resource requirements reductions in other applications, and it's hard to expect question processing to be an exception.


Quote: weaselman

Quote:

You are familiar with the law of diminishing returns, right? Using thousands of algorithms is an averaging approach, performing noise filtering. When only a small fraction of algorithms actually works [on a given question], their useful signal is mutually reinforced, while noise is not, similar to the principle used in RADAR.


Well, ok ... And you point is ... ?


The gain is logarithmically dependent on the number of samples.
A small improvement in the SNR of particular algorithms used is equivalent to a large increase in the number of samples.
With good enough algorithms - in fact, with just as much as any positive SNR - only a few samples are required, and, finally, at a certain point averaging isn't required at all.
Since Watson set out to improve these algorithms, it appears that the task of improving them is not considered impossible, at the very least, by its creators and developers.


Quote: weaselman

Sure, it did not. Because it did not know how.


It didn't even try (and why would it, with the resources to run them all and have time left?). In time, as data is collected, that improvement will become well within the reach.

Quote: weaselman

Do you? Can you explain it? how about without using examples?


Yes, of course. One way is to mark which algorithms hit and which missed. Look for correlation with question structure, particular words present in the question, other clues. That is what Watson's individual algorithms do already when looking for most likely answers; what's left is to apply the same to selecting algorithms most likely to deliver good results. As to why it doesn't do it already, as I mentioned, it's because it has no need to.


Quote: weaselman

Well, what did you expect? Do you think I should get convinced simply because you say so? All I heard from you so far is that you think it is possible to use less resources. You can't explain why you think so, you just do ...


I explained why it is possible to use less resources. I explained how the reduction is possible. I did not give you working algorithms for achieving it, because that's a long step from outlining the principles, not one called for in ultimately inconsequential discussions such as this.

Some years down the line, there will most likely be programs delivering similar performance with just a fraction of resources. We are not dealing with exact science here that would involve proving theorems as much as with odds. If it happens, back bets such as IBM's will win, lay bets will lose. Both of us are beting $0, and I'm afraid we'll have to settle on agreeing to disagree. I bet my $0 on the mission of improving text processing algorithms succeeding and delivering solutions requiring much more practical amounts of resources, you as I understand it bet on them not improving or not improving substantially.


Quote: weaselman

Yes. But why do you think that if it wasn't, it would be (1) significantly smaller, and (2) not significantly worse content-wise? You just think so, right?


Human contestants did not have access to terabytes of data. You can't read that much in your entire life. Not that reading would be of very much help, because in-depth knowledge was not sought.
The sum of human knowledge, or even the fraction of it stored in wikipedia, far exceeds any individual's knowledge. Since individuals have demonstrated being highly effective at the task with only the limited knowledge they had, it has to be concluded that their limited knowledge was sufficient for competitive performance.


Quote: weaselman

Or too few benefit to be gained (you know the law of diminishing returns, right? ;)).


Yes, but it works in the opposite direction. The amount of data required to answer a given fraction of possible questions increases disproportionately, asymptotically approaching infinity for 100%.


Quote: weaselman

As to your next claim - that Watson program could run on only one of the servers instead of ten - it is a different story. I can't say, like in the 386 case, that it is plain impossible. It, probably, can run on only one of the servers. It would be able to answer less questions correctly, and would spend more time, looking for answers, but it would be able to do something meaningful enough to be characterized by a reasonable person as "it works".
Perhaps, we can agree on the above statement?


In some part. However, my claim is somewhat different: I expect that programs developed in the future, which may or may not be improved versions of Watsons or programs developed with its help, will be able to perform one of Watson's functions (answering trivia questions) using considerably smaller amounts of computing resources and delivering at least comparable performance.

I'm basing this expectation on three known things: 1) Watson's operating principle, noise averaging among multiple weak algorithms, which would require far fewer samples with even slightly stronger algorithms; 2) The history of other software developed to solve normally human games, such as chess engines, which achieved such improvements, and 3) Watson's primary purpose being developing stronger algorithms.
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
February 25th, 2011 at 6:27:31 PM permalink
Quote: P90

Only its title. Which is in the alphabetical index.



Are we talking about creating a (global) alphabetical index now? Any other indexes you want to build as well? Or is this the only one you think you'll need?


Quote:

Of which it still considered a lot more than necessary.



I'll remind you the context.
- You said that Watson used too much resources, and that you thought you could use less
- I asked how
- You said you'd use a smaller database by scanning all entries instead of using indexes.
- I said it'd take too long
- You said, Watson searched the database multiple times
- I said it could do that exactly because it was using additional resources (for indexing in particular)
- You said it was using too much resources.

The circle (in your argument) is thus complete.

Quote:

Every operation performed by an algorithm that delivered the wrong answer was wasted, from theoretical efficiency standpoint.



Imagine, that you are trying to solve a hard math problem. Do you just grab the right formula and plug in the numbers, or would you expect to try a few wrong approaches first, before you find the answer? If the latter, do you call it all "wasted from theoretical efficiency standpoint"? It does not exactly lead you to the right answer ... But on the answer hand, had you not have done that work, you would have never found the answer either ...

Quote:

Every operation that found the right answer after it has already been found was also wasted, as it was redundant.



Now imagine, that you are working on a complicated equation, and just happen to think "maybe, the answer is 3". You plug 3 into the formula, and, sure enough, it is the answer. Would you stop your work at that point and declare victory or would you continue "wasting" your resources on looking for the analytical solution?

Quote:

In every case, only one of the algorithms was actually necessary.



Imagine, that you worked out a complex solution to a complicated problem, and you are not sure that it is correct. Are you more likely to just say "I have a solution, one is enough", or, perhaps, look for the confirmation of your answer by using another method or, at least, checking your math?



Quote:

Something like that. However, a much smaller database would help, as would only working one algorithm instead of thousands.



Nah. Would not. To the contrary - not having a comprehensive full text index on the database (to make it smaller as you suggest) will not help the retrieval time, it will make it worse. That's what I said above. Watson could afford to do it, because the database was very well indexed, you can't, because your database is small. If you want fast access, you have to have a huge database.



Quote:

That's how you would work in the optimized case. A 386 could do that much: run the city index against the alphabetical index.



No, not really. In the optimized case, you'd just have a b-tree index of titles, and navigate that in-order to the first "B entry.

Quote:

That is not quite how Watson worked, however.


It is not how you said it, no. It is exactly how I described it in the previous post. The important thing here is to understand, that the two descriptions are fundamentally different. Do you see the difference?

Quote:

Watson was, for the most part, unaware of the question's meaning.



Watson is a machine, it does not have awareness. It is unaware of everything. Just like any other machine.

Quote:

The approach above was probably used by some of the algorithms.


The multiple algorithms you read about are used for text analysis. The database navigation is a ot more straightforward, there are no too many ways to do it. Not the realistic ones anyway. I assure you, it did it exactly the way I described.

Quote:

But it would not work for finding a city that has the most words starting with "B" in its motto.


Yeah, it would. You'd do the same thing, except look for "motto" tags instead of "name" tags. You see, this is exactly the thing about "my" approach versus "yours" - the tag cloud is suitable for answering any question, even the question that has never occurred to any of the programmers whereas "your approach" is limited by whatever questions/categories you deemed "relevant"/"useful" when creating the program. That is exactly "mine" works, and "yours" does not.

Quote:

Other algorithms had to look into the entries themselves.



I don't know what those algorithms are, and I don't know what makes you so sure. But what I can tell you is that looking at the entry in general is considered a big taboo in data retrieval world. An algorithm, that needs to look at the actual entries before the final candidate list (maybe, 10-20 documents) is considered immature and naive. It is very much like if you tried to use bubble sort to sort an array. For that reason, I highly doubt that what you are saying is true. I would be very surprised if any of the data selection (as opposed to final decision) algorithms looked at the contents of the documents in the database. In less words, I think, you are wrong.


Quote:

If the Watson knew how to select only the right algorithms for the question,


If my grandmother had balls ...


Quote:


Because Watson was not constructed to win Jeopardy, it was constructed to field-test thousands of algorithms.


You don't need to play live game to test the algorithms. Get a few thousand questions from the archive, run the algorithms on then, and select the best one(s). Then, unplug the "extra" servers, and play the game. Why do you think they did not do that? They actually did. There were more algorithms in the beginning.

Quote:

An eight-thousand-CPU, 15-TB computer is expensive. Such computers are going to remain expensive for a long time.


For you and me, yeah... For a business like google, it's peanuts. They spend more of the dry-cleaning they provide to their employees as a benefit.

Quote:

The end goal of research Watson is used for is not to build one cool supercomputer, it's to bring natural language interaction technology into the mass market, where it will not have the benefit of a supercomputer's resources.


You are still living in the 20th century. The "mass market" for desktop applications of tomorrow is in the cloud. It is the market of supercomputers.

Quote:

It didn't even try


Right, because it did not know how. And you are not telling.

Quote:


I explained why it is possible to use less resources.


No, you did not. You just said, that you thought it would be possible if such-and-such thing was done, but provided no reasoning as to why such-and-such thing was possible to do nor to why it would indeed be efficient in reducing the amount of resources.

Quote:

Human contestants did not have access to terabytes of data.


Nor did they win the game.


Quote:

Yes, but it works in the opposite direction. The amount of data required to answer a given fraction of possible questions increases disproportionately, asymptotically approaching infinity for 100%.


Only when the domain of possible questions is known and well defined.

Quote:


In some part. However, my claim is somewhat different: I expect that programs developed in the future, which may or may not be improved versions of Watsons or programs developed with its help, will be able to perform one of Watson's functions (answering trivia questions) using considerably smaller amounts of computing resources and delivering at least comparable performance.



I can't argue about the future. I lack clairvoyant abilities necessary for that. As for the present, I maintain, that Watson is a piece of cutting edge engineering and technology. Given the existing hardware (leave alone what existed 5-10 years ago), this is about as good as it gets. Maybe, the database size could have been cut by half by some ingenious programming, maybe half of the CPU power could have been cut at the expense of a few thousand man hours, and a few extra seconds spend on some answers. But that's about it. It can't be made to run on a PC, and it can't use "a fraction" of resources. That's just wishful thinking at this point.
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
February 26th, 2011 at 12:48:35 AM permalink
Quote: weaselman

Are we talking about creating a (global) alphabetical index now? Any other indexes you want to build as well? Or is this the only one you think you'll need?


An alphabetical index was required from the beginning.
Yes,a limited number if indexes, deemed the most critical, in a limited sized database.


Quote: weaselman

I'll remind you the context.
- You said that Watson used too much resources, and that you thought you could use less
- I asked how
- You said you'd use a smaller database by scanning all entries instead of using indexes.
- I said it'd take too long
- You said, Watson searched the database multiple times
- I said it could do that exactly because it was using additional resources (for indexing in particular)
- You said it was using too much resources.


Let me correct.

- I said that Watson used more resources than necessary for the task (of winning the show), that a computer winning Jeopardy does not signify a big step for humanity, and that it could have been done long ago if there was a lot of need for it. The cost for using less computer resources would be using more human effort.
- You asked how
- I said you could categorize the questions and use customized algorithms depending on the question, and use a much smaller database by only including information on subjects the show tends to asks about.
- You said there were no other algorithms possible and the database was the smallest possible for the task
- I provided a couple crude examples of how particular kinds of questions asked on the show could be solved in a different way
- Your response was that a computer could never understand the questions and so this would not be possible
- At some point you insisted that the problem could not be solved with any less resources than Watson used, or at least no less than 1/3 of them; that Watson only stores the information it needs for the show; and that storing a simple paper encyclopedia would require at least a 100TB database:
"Doing that will require at least 100 times as much space as storing unstructured data. Here we are, same 100 terabyte ballpark we had with wikipedia."
- I mentioned that Watson, in fact, does store the entire text of Wikipedia, and its database with all indexes only takes 4TB

Past which point, since you insisted that Watson's resources were the minimum possible amount, I abandoned the point about how much resources exactly would suffice, and into only that considerably less resources than Watson's were required.

As to the presumed circle itself:
- I said Watson had so much resources it could afford searching the database multiple times, for each of its thousands of algorithms
- You said it could do that exactly because it was using additional resources (for indexing in particular)
- I said it would need less resources if it didn't need to search the database multiple times.

There is no circle here. My argument is that Watson has much more resources than required for merely solving Jeopardy questions, even doing it with the same methods, and that less resources will be required with a smaller, better selection of improved algorithms.


Quote: weaselman

Imagine, that you are trying to solve a hard math problem. Do you just grab the right formula and plug in the numbers, or would you expect to try a few wrong approaches first, before you find the answer? If the latter, do you call it all "wasted from theoretical efficiency standpoint"?


Yes, it's wasted effort, in terms of efficiency. If you were a better mathematician, you could arrive at the right formula sooner. But that's not quite what happens in case of Watson, as it isn't inventing new approaches, just trying existing ones.

Imagine, rather, that you are trying to send a message across a noisy channel. At first you try explaining in detail, and you only get "what?" in response. Then you try other things, down to screaming "Guess who I just seen naked!" repeatedly over the phone, trying different intonations, speaking slower and faster. Finally, after a few thousand tries, you get it across.

All the times you tried to send the message, until the time you finally used the correct method, were wasted. Furthermore, even the repeats you used were wasted, because you didn't have any better method to introduce data robustness. Oh, if only you weren't so busy reading Hustler in these boring Morse code classes. Or if it just occurred to you to scream "Golf! Uniform! Echo! Sierra! Sierra!"

You were fighting against the noise of your channel by sending a large number of messages until one got through. The methods you used, both for sending the message and for defeating the noise, were highly inefficient. There exists a method of using the noisy channel capacity to the limit, but you don't know it. In fact, there exists a method of determining the channel's limit, known as Shannon limit, and there exist error correction algorithms that could get your message across in one go using the smallest number of symbols possible while ensuring a preset error probability.

All the effort trying to send the message in different ways, as well as the additional effort used because you didn't know the most efficient methods, was wasted, in theory. In practice, you would perhaps not consider it wasted, because you got the message across the best you could with your limited skill.


Quote: weaselman

Now imagine, that you are working on a complicated equation, and just happen to think "maybe, the answer is 3". You plug 3 into the formula, and, sure enough, it is the answer. Would you stop your work at that point and declare victory or would you continue "wasting" your resources on looking for the analytical solution?


Depends on what my goal is.
If my goal is winning a game show, I would buzz in and declare victory.
If my goal is to develop analytical solutions, I would work on one.


Quote: weaselman

Imagine, that you worked out a complex solution to a complicated problem, and you are not sure that it is correct. Are you more likely to just say "I have a solution, one is enough", or, perhaps, look for the confirmation of your answer by using another method or, at least, checking your math?


Depends. How much time do I have left? How reliable was my initial method?
If my initial method was wrong 49% of the time and right 51% (for a yes/no question), then, certainly, I'd need to check and recheck.
If my initial method was 95% accurate, not always.


Quote: weaselman

To the contrary - not having a comprehensive full text index on the database (to make it smaller as you suggest)


I suggested making the database smaller by including less information in it, fewer articles, shorter articles, etc, not by not indexing it.
Then, of course, a database with less information would need fewer indexes for the same performance as well.


Quote: weaselman

You don't need to play live game to test the algorithms.


Of course. Playing the game was just showing off. It's not what Watson was built for, it's not what it's doing most of the time, it's not what it will be doing in the future.


Quote: weaselman

Get a few thousand questions from the archive, run the algorithms on then, and select the best one(s). Then, unplug the "extra" servers, and play the game. Why do you think they did not do that? They actually did.
There were more algorithms in the beginning.


And there will be fewer in the end.

Watson is not finished. There will be fewer algorithms in the end. Better ones will survive, worse ones will be thrown out, ones that can be improved will be improved.

With fewer algorithms, you need to do fewer searches. With fewer searches, you can afford slower searches. Then you can have a smaller database and less processing power, and still perform as well or better.


Quote: weaselman

You are still living in the 20th century. The "mass market" for desktop applications of tomorrow is in the cloud. It is the market of supercomputers.


The world domination of thin clients is still a long time away, if it even comes.

When a user is trying to tell his computer to change the default gateway of his internet connection, the last thing he wants to hear is "Servers unavailable. Please sod off."


Quote: weaselman

No, you did not. You just said, that you thought it would be possible if such-and-such thing was done, but provided no reasoning as to why such-and-such thing was possible to do nor to why it would indeed be efficient in reducing the amount of resources.


The reasoning I provided to why it would most likely be possible is that it partially relies on algorithms Watson already has.
The reasoning I provided to why it would indeed be efficient in reducing the amount of resources is that:
1) Fewer algorithms->fewer searches per second->smaller database and less processing power are sufficient;
2) It worked in chess, reducing the amount of resources incomparably.


Quote: weaselman

Nor did they win the game.


Really? I thought they did. Against other humans.
You only need to do somewhat better, not orders of magnitude better.


Quote: weaselman

I can't argue about the future. I lack clairvoyant abilities necessary for that.


Never bet on anything? You don't need clairvoyance to estimate what is more probable.

Based on how overkill Watson already is for the much lesser task of winning a game show, based on knowing the past and how we learned to accomplish tasks with less resources as algorithms improved, and based on how we are putting effort in improving these algorithms (Watson being designed for that effort), I would estimate as more probable that the current research will eventually result in algorithms delivering similar performance in answering natural language questions while using a fraction of resources.
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
February 26th, 2011 at 8:11:16 AM permalink
Quote: P90

An alphabetical index was required from the beginning.
Yes,a limited number if indexes, deemed the most critical, in a limited sized database.


Ok. So which indexes do you deem the most critical?

Quote:


- I said that Watson used more resources than necessary for the task (of winning the show), that a computer winning Jeopardy does not signify a big step for humanity, and that it could have been done long ago if there was a lot of need for it. The cost for using less computer resources would be using more human effort.



The "big step for humanity" part is news to me. I never said it was a "big step for humanity", to the contrary, I said repeatedly, that it was rather unimpressive for the amount of buzz it produced.
Regarding "long ago", I thought you took that claim back?
As for "more resources than necessary", yes, that's what you said, and I reminded.

Quote:


- You asked how
- I said you could categorize the questions and use customized algorithms depending on the question, and use a much smaller database by only including information on subjects the show tends to asks about.
- You said there were no other algorithms possible and the database was the smallest possible for the task
- I provided a couple crude examples of how particular kinds of questions asked on the show could be solved in a different way
- Your response was that a computer could never understand the questions and so this would not be possible
- At some point you insisted that the problem could not be solved with any less resources than Watson used, or at least no less than 1/3 of them; that Watson only stores the information it needs for the show; and that storing a simple paper encyclopedia would require at least a 100TB database:
"Doing that will require at least 100 times as much space as storing unstructured data. Here we are, same 100 terabyte ballpark we had with wikipedia."
- I mentioned that Watson, in fact, does store the entire text of Wikipedia, and its database with all indexes only takes 4TB



This is a different line of the discussion. I already responded to that. I have never pretended to give you the exact amount of required memory, kust illustrate that it would not fit into 386. I did not say that it could not be solved with any less resources. I said that it could not be solved like you claimed with a fraction of resources. It could not be solved "long ago", it could not be solved on a 386 computer, it could not be solved on a 32-bit CPU, it could not be solved with a few gig of memory, etc., etc

But, this is all irrelevant. The context of this line of discussion was you trying to embed the full text search into your approach. You wanted to have some structured indexes, and scan the entire database, to answer questions, not covered by the indexes. I told you that scanning the whole database sequentially can't be done in real time. You said that Watson does it, I said that it does not, it uses the tag clouds and full text indexes, that you want to get rid of to save resources.

This was the context.


Quote:

Past which point, since you insisted that Watson's resources were the minimum possible amount,


Ballpark, not amount

Quote:

As to the presumed circle itself:
- I said Watson had so much resources it could afford searching the database multiple times, for each of its thousands of algorithms
- You said it could do that exactly because it was using additional resources (for indexing in particular)
- I said it would need less resources if it didn't need to search the database multiple times.
There is no circle here.



No, the circle was that you said Watson was using less resources, and tried to demonstrate an approach to create a smaller database, that would need to be sequentially scanned, and, when I said it was impossible, you said that Watson does it, to which I said that Watson does it exactly for the reason it has more resources, and you responded that yeah, but it does not need them.
This is a circle, because it ends exactly where it started.


Quote:

My argument is that Watson has much more resources than required for merely solving Jeopardy questions, even doing it with the same methods, and that less resources will be required with a smaller, better selection of improved algorithms.


Yes. And to support that argument, you tried to demonstrate how a smaller database could work, but then faced the necessity of a full scan, which can't be done in real time, and insisted that it could, because Watson does it. But, no, Watson doesn't do full scans, it does not need them exactly because it has more resources.
Now, when you respond to this with something like "but it has too much resources, it could do it a fraction of it", it completes the circle.




Quote:

Yes, it's wasted effort, in terms of efficiency.


No, it isn't

Quote:

If you were a better mathematician, you could arrive at the right formula sooner.


No. I would arrive at a formula sooner if I just happened to guess the right formula. You don't need to be a good mathematician for that, you just need to be lucky. A mathematician needs to look for the solution by exploring different paths. You can't call it wasted effort, because it is the effort that takes you to the goal. If you did not expend that effort, the goal would not be achieved.

Quote:

But that's not quite what happens in case of Watson, as it isn't inventing new approaches, just trying existing ones.


Yeah, that's exactly what mathematicians do most of the time actually. Inventing a new approach is an event, that happens, I dunno, twice in a century, maybe. Most of the time, you just work with the existing methods, combining them, trying different parameters, and different variations to see what better fits the problem you are working on.

Quote:

Imagine, rather, that you are trying to send a message across a noisy channel.


I am afraid, I don't see the analogy.


Quote:

Depends on what my goal is.
If my goal is winning a game show, I would buzz in and declare victory.


And you'd lose, because the equation happens to have six different answers, but you don't know that (since you did not find the solution analytically), and found only one.

Quote:

If my goal is to develop analytical solutions, I would work on one.


The goal is, like I said before, to solve an equation. Actual motivation (game show, world hunger, alien spaceship accident) is irrelevant.


Quote:

Depends. How much time do I have left? How reliable was my initial method?


I don't know, you tell me. When you are solving challenging problems, do you find it a waste to double check your solution by solving it a different way or, at least, by retracing your steps and verifying the technical details?
It does not matter how much time you have, I am not asking if it is possible. I am asking if you think it would be a waste.


Quote:

I suggested making the database smaller by including less information in it, fewer articles, shorter articles, etc, not by not indexing it.



No, you suggested to get rid of tag clouds, and create specialized structured indexes instead. Throwing out information isn't going to reduce the size all that much. Even if you could halve the database with that method, it's peanuts, still same ballpark.

Quote:

Then, of course, a database with less information would need fewer indexes for the same performance as well.


This is wrong. It would not need fewer indexes. The indexes would be smaller if you throw away entire articles, but there would be the same number of them - and it would still be massive.


Quote:

Of course. Playing the game was just showing off. It's not what Watson was built for, it's not what it's doing most of the time, it's not what it will be doing in the future.


The point is that if your assertion about needing multiple algorithms only as a field test was right, Watson would not need to use them in the live show. It was not a test, it was, like you point out correctly, a show off.

Quote:

And there will be fewer in the end.


"In the end" of what? The Universe?

Quote:


Watson is not finished. There will be fewer algorithms in the end. Better ones will survive, worse ones will be thrown out, ones that can be improved will be improved.


That's just pure speculation. Maybe. Maybe not. I can't argue about the future.
This discussion started as a discussion about the past, and the present technology capacities, and was interesting why it remained on topic. Now it seems to be drifting more toward science fiction, which is not really my realm.


Quote:


With fewer algorithms, you need to do fewer searches. With fewer searches, you can afford slower searches.
Then you can have a smaller database and less processing power, and still perform as well or better.


You are making many separate logical mistakes here. First, the searches are run in parallel. If you make them slower, the whole thing will become accordingly slower, even if you have fewer of them, you will get your answer as slowly as the slowest of the searches finds it, and no faster, even if you only have one search. Second, you don't run as many searches as algorithms (most algorithms are applied as filters to the splitting streams of results from the same search), so decreasing the number of algorithms won't reduce the number of searches as mush as you seem to think it would (to reduce the number of searches even by one, you'd have to get rid of the whole class of algorithms, which is really really unlikely to be possible to do without affecting the quality of the results). Third, even if you manage to decrease the number of searches by half, it won't make any impact on the number of required resources that can be called "significant" by any stretch of imagination.


Quote:

The world domination of thin clients is still a long time away, if it even comes.


I bet you, it is closer than the world domination of AIs running on desktop PCs :)

Quote:

When a user is trying to tell his computer to change the default gateway of his internet connection


What does this have to do with anything?



Quote:

Really? I thought they did. Against other humans.


Yes, against other humans, who did not have terabytes of data avaialble to them either.

Quote:

You only need to do somewhat better, not orders of magnitude better.


Yes. If only you knew how to do that ...

Quote:

Never bet on anything? You don't need clairvoyance to estimate what is more probable.


I can estimate what is more probable. I cannot argue about it with enough confidence, and more importantly, I don't find it very interesting to argue about things I don't know.
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
February 26th, 2011 at 9:17:01 AM permalink
Quote: weaselman

Ok. So which indexes do you deem the most critical?


Ones that have the greatest effect on the search speed. If going from an existing database, for instance, trim the least used indexes, or compare search with trimmed indexes, until you arrive at target database size.


Quote: weaselman

Regarding "long ago", I thought you took that claim back?


Just not that long ago. Maybe a decade.


Quote: weaselman

This is a different line of the discussion. I already responded to that. I have never pretended to give you the exact amount of required memory, just illustrate that it would not fit into 386. I did not say that it could not be solved with any less resources. It could not be solved "long ago", it could not be solved on a 386 computer, it could not be solved on a 32-bit CPU, it could not be solved with a few gig of memory, etc., etc


The difference between requiring 100TB for storing Britannica and 4TB for storing more than Wikipedia calls for quite a substantial revision of the estimate. Wikipedia contains over 2 billion words; assuming it comprised about half the data, that's over 4 billion words. Britannica has about 44 million, a 100-fold difference.
As such, storing Britannica with the same amount of indexing as Watson's database used would have only required ~44GB of storage, not 100TB. This is before we account for a smaller database and fewer searches per second being feasible to implement with fewer indexes. While 44GB is not quite 386 territory, it is within the range of 32-bit systems with PAE.


Quote: weaselman

Yes. And to support that argument, you tried to demonstrate how a smaller database could work, but then faced the necessity of a full scan, which can't be done in real time, and insisted that it could, because Watson does it. But, no, Watson doesn't do full scans, it does not need them exactly because it has more resources.


I never actually suggested a non-indexed database, just a reduced amount of indexing (of a much smaller initial data amount). Although I did go overboard at first, suggesting manually added purpose-designed indexes only; as I already said, that was largely because of my mistaken initial impression that a lot more of the show's questions were simple than I realized upon seeing more episodes.


Quote: weaselman

No. I would arrive at a formula sooner if I just happened to guess the right formula. You don't need to be a good mathematician for that, you just need to be lucky. A mathematician needs to look for the solution by exploring different paths.


A good mathematician will need fewer paths. A freshman may need to go through the whole book and a few more books, unable to decide based on experience what approaches are the most fitting.
There are few problems that haven't been solved before in some other way or form; the difficult problem being worked on is most likely similar to at least one already solved problem.


Quote: weaselman

I am afraid, I don't see the analogy.


The way Watson answered show questions involved picking rare bits of data out of a massive amount of noise (created by less effective algorithms that will be discarded later, as were many before them).


Quote: weaselman

I don't know, you tell me. When you are solving challenging problems, do you find it a waste to double check your solution by solving it a different way or, at least, by retracing your steps and verifying the technical details?


I find it a waste to 1,000-check my solution by solving it in 1,000 different ways.

Of course, that is because I can have a higher than 50% confidence in my solutions. If my accuracy was akin to throwing darts blindfolded, I might well need a thousand throws to make sure I hit what I was aiming for.


Quote: weaselman

This is wrong. It would not need fewer indexes. The indexes would be smaller if you throw away entire articles, but there would be the same number of them - and it would still be massive.


No, a smaller database actually needs fewer indexes.

This is especially so when entire logical groups have been removed. If your database does not include chemical information, it will not have indexes by number of carbon rings or nitrile groups. If your database includes chemical information, it will need these indexes.


Quote: weaselman

The point is that if your assertion about needing multiple algorithms only as a field test was right, Watson would not need to use them in the live show. It was not a test, it was, like you point out correctly, a show off.


I agree. It did not need to use as many algorithms as it did in the live show. But, on the other hand, why not use them? It's not like shaving a few milliseconds off the buzz time would have mattered much against humans who have a minimum trigger time of 140ms.


Quote: weaselman

"In the end" of what? The Universe?


Last I heard, Watson hasn't been dismantled yet. This means its work will continue (and at some point will be finished, or carried over to different systems). I don't for a moment believe it has reached its terminal evolution, seeing how it pretty much just started.


Quote: weaselman

That's just pure speculation. Maybe. Maybe not. I can't argue about the future.
This discussion started as a discussion about the past, and the present technology capacities, and was interesting why it remained on topic. Now it seems to be drifting more toward science fiction, which is not really my realm.


Well, it always was at least alternate history. In here, natural language processing has been a low-priority goal for computers for the most of their development. Only recently has it started gaining momentum. As we are proceeding into the future, development in this field will produce better algorithms with lower requirements. The same could be achieved by putting major effort into the field much earlier on, even working within the confines of existing technology.


Quote: weaselman

First, the searches are run in parallel. If you make them slower, the whole thing will become accordingly slower, even if you have fewer of them. Second, you don't run as many searches as algorithms (most algorithms are applied as filters to the splitting streams of results from the same search), so decreasing the number of algorithms won't reduce the number of searches as mush as you seem to think it would.


Perhaps not quite linearly. However, the way I understood it, each of Watson's cores ran four threads, and it was generally one algorithm per thread. Reducing the number of algorithms would, at the very least, cut linearly or near-linearly on CPU count requirements.
I'm aware that most searches in question are parallel, but, once again, one of the key things I proposed as requirements for adapting to a slower system would be running with a smaller database (smaller due to less information), which very much does cut down on time required; or, in this case, compensates for slower RAM.


Quote: weaselman

Yes, against other humans, who did not have terabytes of data avaialble to them either.


Yes. But there was no second Watson-like computer to compete against. Just other humans. Both a 10-1 match and a 4-2 are a victory.


Quote: weaselman

Yes. If only you knew how to do that ...


I don't. People working on Watson do. Today's Watson would probably play stronger than when it played originally. It will improve further.
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
February 26th, 2011 at 11:43:43 AM permalink
Quote: P90

Ones that have the greatest effect on the search speed. If going from an existing database, for instance, trim the least used indexes, or compare search with trimmed indexes, until you arrive at target database size.



You are not going from the existing database. You are going from a tag cloud to a structured search. There is nothing to trim. You need to create some indexes. How many do you think you will need? How will you choose which ones to create? What will they be?
Oh, and they all have the same effect on search speed.


Quote:

Just not that long ago. Maybe a decade.


No, a decade ago terbytes of RAM and gigabit networks were not around.


Quote:


The difference between requiring 100TB for storing Britannica and 4TB for storing more than Wikipedia calls for quite a substantial revision of the estimate.


No, it does not. Same ballpark.

Quote:

Wikipedia contains over 2 billion words;


With revision histories, comments, and different languages - perhaps. You don't need all of that. Counting only the current revisions of the English language articales, ignoring the stubs and disambiguation pages, would be about the same size as Britannica. Maybe, twice as large ... or three times, not more. Same ballpark.
The whole point is that encyclopedia is encyclopedia. You can store Britannica or Wikipedia or choose some other one, you'll result at about the same ballpark estimate for the data storage capacity.


Quote:

I never actually suggested a non-indexed database


We were looking at some particular examples of questions, and you said that answering some of them (australian locations) would involve indexes, while others (other stuff starting with "B") would have to end up searching the whole database.


Quote:

Although I did go overboard at first, suggesting manually added purpose-designed indexes only;


Ok, so you concede that point now? Not manual, and not purpose designed? What is your new approach then?


Quote:

A good mathematician will need fewer paths.


I am a good mathematician. I need a few paths. That was not the question. The question is do you consider exploring those paths a wasted effort or not.

Quote:

A freshman may need to go through the whole book and a few more books, unable to decide based on experience what approaches are the most fitting.



A freshman may not be able to find the solution at all. But that was not the question.

Quote:

There are few problems that haven't been solved before in some other way or form; the difficult problem being worked on is most likely similar to at least one already solved problem.


Exactly. The point is that it is usually similar to a lot more than just one. That's why you have to explore multiple paths, unless you just get lucky and stumble upon the right formula right away. But that's not scientific method, that's just guessing.


Quote:

The way Watson answered show questions involved picking rare bits of data out of a massive amount of noise (created by less effective algorithms that will be discarded later, as were many before them).


No, it did not.


Quote:


Quote: weaselman

I don't know, you tell me. When you are solving challenging problems, do you find it a waste to double check your solution by solving it a different way or, at least, by retracing your steps and verifying the technical details?


I find it a waste to 1,000-check my solution by solving it in 1,000 different ways.


First, that's just because of the way you brain works. It does actually check and recheck your solution over and over at every step, it just does it at a subconscious level, so you are not aware of it.
Second, 1000 or 500 or 50 or 10 - is just a quantitative difference. Qualitatively, the point is that redundancy is not a wasted effort, but just how the thought process works.
And third, the "thousands of algorithms" you read about aren't really (all) redundant. Think of it as one algorithm which consists of applying various text analysis strategies to the question, and taking the weighted average of their answers.


Quote:


Quote: weaselman

This is wrong. It would not need fewer indexes. The indexes would be smaller if you throw away entire articles, but there would be the same number of them - and it would still be massive.


No, a smaller database actually needs fewer indexes.



No, it does not. The number of indexes you need is a function of the complexity of the schema, not of the amount of data it is storing. We are not considering corner cases of a few hundred entries or so, where you just don't need any indexes at all.


Quote:

This is especially so when entire logical groups have been removed. If your database does not include chemical information, it will not have indexes by number of carbon rings or nitrile groups. If your database includes chemical information, it will need these indexes.


Actually, "my" database does not have indexes by number of carbon rings as it is. It just have the tag cloud.
I thought you already gave up the idea of hand-picked structured indexes. Are we now back to discussing it?

Quote:

I agree. It did not need to use as many algorithms as it did in the live show.


This is the opposite of what I said. Not sure why you are saying that you agree.

Quote:

But, on the other hand, why not use them?


Because it's a show off. An AI running on a PC and winning Jeopardy is a lot more of a show off, a lot more impressive, and. in the end of the day, a lot more profitable, than a room full of high end servers.

Quote:


Well, it always was at least alternate history.


Not for me, it wasn't. I took your claims literally, when you said it could have been done on a 16-bit cpu, then on 386, then, on a cell phone, then, maybe on a desktop pc, then on just one of the Watson's servers, then, maybe on half of them ...
It was a long journey up the technology scale, but it never occurred to me that you might be speaking hypothetically or invoking alternate history (except for a short tangent about 6-bit computers).

Quote:

Perhaps not quite linearly. However, the way I understood it, each of Watson's cores ran four threads, and it was generally one algorithm per thread. Reducing the number of algorithms would, at the very least, cut linearly or near-linearly on CPU count requirements.


Only assuming that all of the algorithms were entirely redundant, which is ridiculous. But even then, it is not as cut and dry as you seem to think.

Quote:

I'm aware that most searches in question are parallel, but, once again, one of the key things I proposed as requirements for adapting to a slower system would be running with a smaller database (smaller due to less information), which very much does cut down on time required; or, in this case, compensates for slower RAM.


It has nothing to do with what we were talking about in this line though.
I'll remind you the context again:
You were saying that reducing the number of algorithms would result in fewer searches, meaning that they could be done slower.
My response to that was that, first, fewer searches does not mean they can be slower, because they are parallel, and second, reducing the number of algorithms would not significantly reduce the number of searches.



Quote:

I don't. People working on Watson do.



Did they tell you so? Or are you reading minds too in addition to predicting the future? :)
If you think they know how to do that, what is your theory why they have not done it? Sabotage?
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
February 26th, 2011 at 2:17:12 PM permalink
Quote: weaselman

With revision histories, comments, and different languages - perhaps. You don't need all of that. Counting only the current revisions of the English language articales, ignoring the stubs and disambiguation pages, would be about the same size as Britannica. Maybe, twice as large ... or three times, not more. Same ballpark.


You either... have never seen a paper encyclopedia, or have never seen wikipedia?

Wikipedia used to have an article on every single pokemon, including nonexistent ones (until the Deletionist faction got to it). It still knows more about pokemon than you'd ever want to know even if you were a 7 year old wapanese kid. You can't begin to imagine the volume of information, mostly obscure, contained in there. In the last few years they started trimming it, also destroying its appeal as the potential ultimate first reference on everything, but when I was there, it looked unstoppable, containing more information on many subjects than the largest dedicated sites.

2 billion words is just the readable text of the static sections of current revisions of currently existing mainspace articles. It does not include markup, categories, templates, portals, metaspace, userspace, past revisions, deleted articles, anything else for that matter.


Quote: weaselman

We were looking at some particular examples of questions, and you said that answering some of them (australian locations) would involve indexes, while others (other stuff starting with "B") would have to end up searching the whole database.


Actually, first of all you would at most consider the part of the database that starts with "B". But I agree that a large number of indexes could speed the search up a lot, even for a small database.
Fortunately, upon a revised size, it would not actually be as difficult to have enough indexes all stored in RAM alongside main data as I first thought without considering that Watson didn't use its hard drives.


Quote: weaselman

I am a good mathematician. I need a few paths. That was not the question. The question is do you consider exploring those paths a wasted effort or not.


If there is or has ever been a mathematician that could solve the problem needing fewer paths, then, yes, theoretically they are wasted effort when calculating your efficiency.


Quote: weaselman

First, that's just because of the way you brain works. It does actually check and recheck your solution over and over at every step, it just does it at a subconscious level, so you are not aware of it.
Second, 1000 or 500 or 50 or 10 - is just a quantitative difference. Qualitatively, the point is that redundancy is not a wasted effort, but just how the thought process works.


Just like 4TB or 40GB or 4GB is just a quantitative difference.

A human does not need thousands of solutions. Even considering the brain does some things subconsciously, it's not that fast. A human brain can be considered the benchmark of efficiency for this; something that needs far more effort is proportionally less efficient. If human brain is for instance 10% efficient, then Watson is closer to 0.0001%. And as you know, the step between 0.0001% and 0.001% tends to be an easier one than between 1% and 10%.

While redundancy per se is not necessarily wasted effort, the amount of redundancy often is. There is a minimum required amount of redundancy for achieving an error rate approaching zero. It can be determined based on the signal/noise ratio. Even at same ratio, different noise cancellation algorithms have different efficiency, requiring different amounts of effort for the same result.


Quote: weaselman

And third, the "thousands of algorithms" you read about aren't really (all) redundant. Think of it as one algorithm which consists of applying various text analysis strategies to the question, and taking the weighted average of their answers.


You can think of it this way. That then simply makes it a low efficiency algorithm. Not "wrong", not "bad", just low efficiency. We can tell that because we have a device that solves the problem with just one or two strategies, just don't quite understand how it works.


Quote: weaselman

No, it does not. The number of indexes you need is a function of the complexity of the schema, not of the amount of data it is storing. We are not considering corner cases of a few hundred entries or so, where you just don't need any indexes at all.


A smaller database doesn't just mean taking a big database and emptying most of it. The schema selected depends on database purpose, and, among other things, can need to be more complex for larger databases.


Quote: weaselman

Actually, "my" database does not have indexes by number of carbon rings as it is. It just have the tag cloud.


Ouch. Only treating chemical composition as plaintext is gonna hurt when trying to answer chemistry questions. Good thing Jeopardy doesn't ask them. PubChem probably has to struggle a lot, being asked them all the time.


Quote: weaselman

Because it's a show off. An AI running on a PC and winning Jeopardy is a lot more of a show off, a lot more impressive, and. in the end of the day, a lot more profitable, than a room full of high end servers.


All of the winnings were donated to charities. No profit made.

It would perhaps be more impressive in some ways. [However, IBM is mostly known as a hardware company, so in terms of showing how much power IBM servers with indigenous IBM processors, an Opteron desktop would not be so impressive; however that doesn't matter, I'm not implying any conspiracies]
But, first of all, the research Watson is doing - which may well eventually result in desktops that can win game shows - is pretty new, so I don't think it could have been done so easily. Second, Watson has already been built, the money already spent. Third, I'm quite sure they could in fact win with only some of the servers - but why take apart a system that works? The effort of modifying software and rebuilding databases would not nearly be compensated for by the slight cut in shipping costs.


Quote: weaselman

Not for me, it wasn't. I took your claims literally, when you said it could have been done on a 16-bit cpu, then on 386, then, on a cell phone, then, maybe on a desktop pc, then on just one of the Watson's servers, then, maybe on half of them...
It was a long journey up the technology scale, but it never occurred to me that you might be speaking hypothetically or invoking alternate history (except for a short tangent about 6-bit computers).


Or let's say 36-bit computers, where word size was picked to fit six six-bit characters. But anyway.

How much resources would be needed depends on how efficient the algorithms are. Today, right now, in current timeline, perhaps you could only cut the number of servers in half. A couple decades down the line, or if the research had started a couple decades earlier, probably one of the servers. It's still over 1% of Watson's capacity, enough to run a hundred algorithms, enough memory to store an abridged database, so on. Probably about what can be expected without breakthrough level improvements.

Going from that to a desktop PC is a bigger step, because a major consideration is a limitation on how much of desktop's resources can be used of the desktop. As desktops grow in power, though, and as desktops are a major application for accepting natural language input, the technology will most likely eventually reach them, but it won't be answering questions about smurfs, only enabling communication with computer literacy impaired users.
As for 16-bit CPU (BTW idt I actually mentioned 386 specifically), I took that back, it was based on insufficient understanding of how the show usually asks its questions.


Quote: weaselman

If you think they know how to do that, what is your theory why they have not done it? Sabotage?


They're doing it. Or, rather even, it is what they are doing. It will just take time to learn and implement the improvements.

Well, or they could just say "Well, guys, we've done it, break her up! Dibs on the heatsinks, and about that cabin in Rock Creek..."
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
thecesspit
thecesspit
  • Threads: 53
  • Posts: 5936
Joined: Apr 19, 2010
February 26th, 2011 at 2:46:26 PM permalink
IBM is solidly a services and solutions company now.
"Then you can admire the real gambler, who has neither eaten, slept, thought nor lived, he has so smarted under the scourge of his martingale, so suffered on the rack of his desire for a coup at trente-et-quarante" - Honore de Balzac, 1829
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
February 26th, 2011 at 5:12:42 PM permalink
Quote: P90

You either... have never seen a paper encyclopedia, or have never seen wikipedia?
Wikipedia used to have an article on every single pokemon, including nonexistent ones



What's your point? That my 100TB estimate was wrong? I already said it was. I also said that I never intended it to be precise.
That Britannica is smaller than Wikipedia? Yes, it is, but the difference is not significant, and besides, it lacks lots and lots of information you'd need to play Jeopardy (such as the names of various pokemons among other stuff).

Quote:


Actually, first of all you would at most consider the part of the database that starts with "B".



So, you want to create an alphabetical index on titles? Is that it? Any other indexes? Or is this one enough?
I'll help you, what if the question was about things that end with "E" rather than start with "B"?

Quote:

But I agree that a large number of indexes could speed the search up a lot, even for a small database.
Fortunately, upon a revised size, it would not actually be as difficult to have enough indexes all stored in RAM alongside main data as I first thought without considering that Watson didn't use its hard drives.


You lost me. What are the indexes you want to store in RAM? Are we talking about stuff that Watson used, or are you still pursuing your own approach?

Quote:

If there is or has ever been a mathematician that could solve the problem needing fewer paths, then, yes, theoretically they are wasted effort when calculating your efficiency.


And if not?
Quote:


Just like 4TB or 40GB or 4GB is just a quantitative difference.


It isn't if you are talking about real life hardware


Quote:

A human does not need thousands of solutions.


First, you don't really know that. A human does not need to recognize or acknowledge thousands of solutions. It does not mean he does not need them at all.
And second, so what? The point was to demonstrait that redundant solutions are not wasteful, not to compare exact numbers between human and computers. Human brain functions very differently from a computer. It has already been discussed at lengths.


Quote:


While redundancy per se is not necessarily wasted effort, the amount of redundancy often is. There is a minimum required amount of redundancy for achieving an error rate approaching zero. It can be determined based on the signal/noise ratio. Even at same ratio, different noise cancellation algorithms have different efficiency, requiring different amounts of effort for the same result.


And you simply declare that Watson's redundancy is above that level after reading a wikipedia page? Without seeing it's software, without knowing much about it's architechture, without knowing anything whatsoever about its database or the algorithms it uses, you just see the word "thousand" in some internet article, and immediately build a conclusion, that you expect to be taken seriously by anyone at all? Well ... That's bold ...

Quote:


Quote: weaselman

No, it does not. The number of indexes you need is a function of the complexity of the schema, not of the amount of data it is storing. We are not considering corner cases of a few hundred entries or so, where you just don't need any indexes at all.


A smaller database doesn't just mean taking a big database and emptying most of it. The schema selected depends on database purpose, and, among other things, can need to be more complex for larger databases.


No. Structured databases are always more complex than tag clouds.
And if you want to go down the "schema needs to be selected, that is not complex" road, you are going to have to suggest such a schema. You can start with a list of indexes you want to create, that I was asking you about for a while, and you just keep ignoring it.
The schema is not going to be simple. In fact, if it was not so tremendously (perhaps, infinitely) complex, people would not use tag clouds, because, you are correct, they are inferior from the efficiency standpoint.


Quote:

Ouch. Only treating chemical composition as plaintext is gonna hurt when trying to answer chemistry questions.


Like I just said above, yes, full text search engines are inferior to structured databases whenever you pick one particular class of questions. But you can't build a structured database that would be so kick ass efficient on ALL those classes together.
If I only needed to answer chemistry questions, I would, probably, not use tag clouds. There is the right tool for any task.



Quote:

All of the winnings were donated to charities. No profit made.



Don't be so nearsighted. It is a publicity stunt way more efficient than a Superbowl ad. There is a lot of profit coming IBM's way as a direct result of Watson playing Jeopardy.

Quote:

However, IBM is mostly known as a hardware company,


Known TO WHO? IBM hasn't been a hardware company for decades.

Quote:

so in terms of showing how much power IBM servers with indigenous IBM processors, an Opteron desktop would not be so impressive;


Optetron desktop, no. But how about IBM desktop?


Quote:

But, first of all, the research Watson is doing - which may well eventually result in desktops that can win game shows - is pretty new, so I don't think it could have been done so easily.


No? That what has your point actually been all along?

Quote:

Second, Watson has already been built, the money already spent.


Watson is a bunch of servers, and a computer program. If, as you claim, the program can be made on only one of those servers - there, you have 30 Watsons for the same money.

Quote:

Third, I'm quite sure they could in fact win with only some of the servers - but why take apart a system that works?


Because it is not "a system". It is a bunch of servers in a rack. These things are bult to be extendable and "hot-pluggable". If you need more power, you slide in another server, if you don't need it, you take it out.

Quote:

The effort of modifying software and rebuilding databases would not nearly be compensated for by the slight cut in shipping costs.


Slight cut? I thought you were talking about "a fraction" of resources, not a slight cut.

Quote:

Or let's say 36-bit computers, where word size was picked to fit six six-bit characters.


Why not just say "cyborgs"?

Quote:


How much resources would be needed depends on how efficient the algorithms are. Today, right now, in current timeline, perhaps you could only cut the number of servers in half.


I think, we are done then :)
The point of these whole argument was to refute your claims, that this could have been done in the 60s ... in the 70s ... in the 80s ... on a 16-bit chip ... on a 386 ... on a cell phone ... on a modern desktop ... on just one of Watson's servers ... on a small fraction of Watson's servers ...
So, finally, you say, that, perhaps, the number of servers could be cut in half. Well, I don't dispute that, it, probably, could.
"When two people always agree one of them is unnecessary"
rxwine
rxwine
  • Threads: 212
  • Posts: 12220
Joined: Feb 28, 2010
February 26th, 2011 at 5:48:46 PM permalink
I don't know how big Wikipedia is, but it certainly has been filled with an amazing amount of information of questionable future value. Heh.
There's no secret. Just know what you're talking about before you open your mouth.
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
February 26th, 2011 at 5:52:28 PM permalink
Quote: weaselman

What's your point? That my 100TB estimate was wrong? I already said it was. I also said that I never intended it to be precise.
That Britannica is smaller than Wikipedia? Yes, it is, but the difference is not significant, and besides, it lacks lots and lots of information you'd need to play Jeopardy (such as the names of various pokemons among other stuff).


Only if a 50-fold difference is "not significant". 25 times less index overhead, 50 times less space, now we are getting somewhere.

I doubt Jeopardy ever asked or will ever ask questions about Pokemon, or just anything particularly dorky for that matter. You actually only need the Micropedia section of Britannica, which is 12 volumes out of 30 (the Macropedia consists of just 300 highly in-depth articles), so the remaining 18 volumes would go towards trivia.


Quote: weaselman

So, you want to create an alphabetical index on titles? Is that it? Any other indexes? Or is this one enough?
I'll help you, what if the question was about things that end with "E" rather than start with "B"?


In a minimally-indexed database, a subset of entries that end with "E" would have to be generated out of the main index on spot. When your database doesn't contain millions upon millions of entries, and so the main index is under a megabyte, it is not quite a geological timescale operation.


Quote: weaselman

You lost me. What are the indexes you want to store in RAM?


If you were working with a today's desktop, but a reduced amount of information, you could just store all of them.

You can arrive at searches taking months using hard drives and sequential search through a massively indexed database, but it doesn't work both ways: either you have a massively indexed database and it might not fit into RAM, but then you don't need sequential search, or you have a non-indexed database, but then it's small and fits into RAM.

At 7 bytes per word, counting spaces and punctuation, Britannica would take 308MB. A modern high-end desktop or low-end workstation has 2 CPU, three channels each, of DDR3 memory. Populating it with DDR3-2133 allows for 24 or 48GB. DDR3-2133 has a bandwidth of 17066 MB/s per module, for 102400 MB/s total. To read every byte of a 308-MB database, then, would take 0.003 seconds. Accounting for latencies isn't even necessary, because that task could be pipelined. While I'm not advocating sequential search, the time to perform it is not quite so staggering.

15 years ago, when 300MB was still a good amount of RAM, you would indeed have to choose between a minimally-indexed database in RAM and a well-indexed one on hard drives. PC-66 modules only provided a bit over 1GB/s. For two CPU, each on a single channel, it would then take 0.15 seconds to go through it. Since that's not the end of it, the computer would not be blazing fast. However, to the point of only delivering answer tomorrow - no, not to that point. Of course, database access speed might not be the bottleneck in such a system.



Quote: weaselman

And you simply declare that Watson's redundancy is above that level after reading a wikipedia page?


No, I've read a bit more on it. And I've seen its performance in the show, several parts, including with showing Watson's current "thinking". It was very hit-or-miss, it either had 96-99% confidence, or it had nothing resembling an answer and no confidence.
Such performance is indicative of system that is not being compromised by noise.


Quote: weaselman

Optetron desktop, no. But how about IBM desktop?


What IBM desktop?


Quote: weaselman

Watson is a bunch of servers, and a computer program. If, as you claim, the program can be made on only one of those servers - there, you have 30 Watsons for the same money.


In fact you wouldn't; playing Jeopardy is not the only thing Watson can do, neither is it the most important thing it can do. Some time down the line, with effort, it will most likely be possible to make a specialized program suitable strictly for playing Jeopardy and nothing else, running on only one of these servers, and still winning the game, but it won't be a replacement for Watson.


Quote: weaselman

Because it is not "a system". It is a bunch of servers in a rack. These things are bult to be extendable and "hot-pluggable". If you need more power, you slide in another server, if you don't need it, you take it out.


And the database is stored... where? Or are you suggesting to just remove pieces of it completely arbitrarily? Whoa, that was easier than I thought.


Quote: weaselman

Why not just say "cyborgs"?


Well, if PDP-10 was a cyborg, then we could just say "cyborgs". I even have to agree, it's a fitting name for a cyborg, so let's say "cyborgs".


Quote: weaselman

So, finally, you say, that, perhaps, the number of servers could be cut in half. Well, I don't dispute that, it, probably, could.


*Right now*, with the exact algorithms we have now, specifically the ones Watson used, in the same exact way.

With either more efficient algorithms, which will become available in the future, or strict resource economy, which would be necessary (even at the expense of some playing power) in the past, with a small fraction of the resources. Not necessarily quite as well, not necessarily easily, not necessarily standing up to handicaps, but well enough to play on the show and have a good fighting chance.
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
February 26th, 2011 at 8:25:21 PM permalink
Quote: P90

Only if a 50-fold difference is "not significant". 25 times less index overhead, 50 times less space, now we are getting somewhere.



It's not a 50-fold difference. And it does have information, that you'll need anyway, that Britannica doesn't have.

Quote:

I doubt Jeopardy ever asked or will ever ask questions about Pokemon,


I don't know if it was asked. But, if it wasn't, it doesn't mean it won't be. I don't see any reason why it would not.

Quote:

You actually only need the Micropedia section of Britannica,


If you say so. I don't see any reason to believe that.

In any event, Britannica does not have info on recent political events, on new books, actors, music, movies, the 2000 and 2004 Olypic games, an on, and on, and on ...

Quote:

In a minimally-indexed database, a subset of entries that end with "E" would have to be generated out of the main index on spot.


Exactly.
Now, how about entries, that are books, written in the 90s?
And how about a gazillion other possibilities?

Quote:

and so the main index is under a megabyte,


It isn't under a megabyte. Not even close.

Quote:

it is not quite a geological timescale operation.


yeah, it is.

Quote:

If you were working with a today's desktop, but a reduced amount of information, you could just store all of them.


What are they? Which indexes are you suggesting to store in RAM? How many do you think you'd need?
How many entries do you plan to have in the database? Let's take a closer look at how realistic your new claim is.


Quote:

At 7 bytes per word, counting spaces and punctuation, Britannica would take 308MB. A modern high-end desktop or low-end workstation has 2 CPU, three channels each, of DDR3 memory. Populating it with DDR3-2133 allows for 24 or 48GB. DDR3-2133 has a bandwidth of 17066 MB/s per module, for 102400 MB/s total. To read every byte of a 308-MB database, then, would take 0.003 seconds. Accounting for latencies isn't even necessary, because that task could be pipelined. While I'm not advocating sequential search, the time to perform it is not quite so staggering.


You are not storing words, you are storing documents, there needs to be some overhead. At least double the size, probably more.
You don't have enough info in Britannica, the additional stuff is probably at least about as large.
You don't just need to read every byte, you need to parse, and analyze them. That involves, moving the bytes, to and from the registers, combining them, performing operations on them. You can't channel that, because the operations need to be done sequentially. You need to not only read stuff from memory, but also write back the results. You then need to read the results back to perform additional operations. In two words, you are way off on your estimates.
Maybe, if your database is really minimalistic, your server is really powerful, and you get lucky, you could read "every byte of it" in 10 seconds or so. That's too much already. But you also need time to analyze the question, and to analyze every document you read, and pick the right one.
Give up already, this is hopeless.

Oh, wait ... You DID give already, when you said, you revised your estimate up, and decided it could not be done on a desktop. Why are you arguing the point you have already conceded?



Quote:

No, I've read a bit more on it.


Why are you keeping it secret then?

Quote:

And I've seen its performance in the show, several parts, including with showing Watson's current "thinking". It was very hit-or-miss, it either had 96-99% confidence, or it had nothing resembling an answer and no confidence.


Yes, that's actually a very good result. It would be much, much worse if it had many answers with about 50% confidence. THAT indeed, probably could be done on a PC. But it would be completely useless.

Quote:

Such performance is indicative of system that is not being compromised by noise.


Yes, it is.


Quote:

In fact you wouldn't; playing Jeopardy is not the only thing Watson can do, neither is it the most important thing it can do.


What are the other things you are referring to?


Quote:

And the database is stored... where?
Or are you suggesting to just remove pieces of it completely arbitrarily? Whoa, that was easier than I thought.


Didn't you say, you only need half a Megabyte for the whole database? You could easily fit a few hundred of copies on each server, no need to remove anything, arbitrarily or not.


Quote:

*Right now*, with the exact algorithms we have now, specifically the ones Watson used, in the same exact way.


Yes, exactly. Right now. Not in the sixties or seventies, not even a decade ago. Now.
But not "in the same exact way". If it was used in the same exact way, it would need the same exact amount of resources, not half of it.
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
February 27th, 2011 at 2:56:25 AM permalink
Quote: weaselman

It's not a 50-fold difference. And it does have information, that you'll need anyway, that Britannica doesn't have.


2 billion/44 million? Well, yes, 45.6-fold difference to be accurate.

Quote: weaselman

I don't know if it was asked. But, if it wasn't, it doesn't mean it won't be. I don't see any reason why it would not.


The show wants to appeal to its audience, 18-49 year old Americans that watch TV in general and game shows in particular.
Pokemon is a subject only wapanese know in depth, and they'll be watching anime anyway. As such, a question involving it will not be understood by general viewership, and will not help the show.
The probability of such a question being asked is therefore extremely low, and since you have to limit database size, it has to go.


Quote: weaselman

If you say so. I don't see any reason to believe that.


Macropedia contains on the average only 40 articles per volume, covering topics in depth. Since in-depth knowledge is not sought, it would not be needed, or at most could be greatly trimmed.

Quote: weaselman

In any event, Britannica does not have info on recent political events, on new books, actors, music, movies, the 2000 and 2004 Olypic games, an on, and on, and on ...


On Olympic games it does, at least the 2007 release.
But yes, it's only a base to cover encyclopedic knowledge. As for covering trivia, perhaps one way would be to invite all past participants of the show, as well as assisting staff, to write down or help pick things they deem to be of relevance to the show.


Quote: weaselman

What are they? Which indexes are you suggesting to store in RAM? How many do you think you'd need?
How many entries do you plan to have in the database? Let's take a closer look at how realistic your new claim is.


Text size: Presumably "baby Watson" would be to the big version what Britannica is to wikipedia. 1/45 the text, then.
Indexing: Let's say roughly 1/2 the ratio of indexes to base data that Watson used.
That arrives at 4,000/45/2~=44GB of data.


Quote: weaselman

You are not storing words, you are storing documents, there needs to be some overhead. At least double the size, probably more.


That falls under the overhead and indexes part.
No, I'm afraid I don't follow how storing txt documents, indexes not included, would take twice the space. Or even with main index included.


Quote: weaselman

You don't just need to read every byte, you need to parse, and analyze them. That involves, moving the bytes, to and from the registers, combining them, performing operations on them. You can't channel that, because the operations need to be done sequentially.


Which are done in data registers with far less latency than main RAM. This is the kind of work that lends itself well to pipelining.


Quote: weaselman

You need to not only read stuff from memory, but also write back the results. You then need to read the results back to perform additional operations.


You are again trying to have it both ways: sequential search the entire database, and perform operations on documents you have picked. That doesn't work both ways. A simple sequential search, performed to find relevant documents and nothing else, only consists of reading the memory, pipelined, into registers, comparing data in the registers, then performing either no operation if they don't match or performing operations if they match. On documents where nothing matches, you have no or minimal memory writes, and it is the very kind of operation where memory can reach very close to its maximum throughput.

As to documents where you do hit a match, they will involve more work, but that's not important, because they would have to be processed in a highly indexed database as well. Sequential search is the price you pay for saving on indexes.


Quote: weaselman

Oh, wait ... You DID give already, when you said, you revised your estimate up, and decided it could not be done on a desktop. Why are you arguing the point you have already conceded?


Not on an ancient desktop, no.


Quote: weaselman

Yes, that's actually a very good result. It would be much, much worse if it had many answers with about 50% confidence. THAT indeed, probably could be done on a PC. But it would be completely useless.


Perhaps so. What would not be useless is many answers with 75-90% confidence. It's a massive drop in performance, and in requirements as well, and the machine would lose some answers, but it could be good enough.


Quote: weaselman

What are the other things you are referring to?


Right now, adaptation of the system for medical purposes is being considered, for instance.
I think it does not require much debate to decide of how little worth a system geared for nothing but answering trivia would be for it.


Quote: weaselman

Didn't you say, you only need half a Megabyte for the whole database? You could easily fit a few hundred of copies on each server, no need to remove anything, arbitrarily or not.


No, I don't think I ever said that. Closer to half a gigabyte without indexes, but only if it was custom-written. Current Watson's database is based on a data dump. They basically fed it nearly everything of even remote usefulness they could get in digital text form, and even that, with all the indexes, didn't come close to filling the RAM alone.


Quote: weaselman

But not "in the same exact way". If it was used in the same exact way, it would need the same exact amount of resources, not half of it.


You are making a presumably intentional basic logical error. Watson beating Jeopardy establishes that a Watson can beat Jeopardy. It does not establish that anything less than Watson can not beat Jeopardy. On the contrary, the large margin by which Watson beat Jeopardy indicates that a lesser system could also beat Jeopardy, with a smaller margin.
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
February 27th, 2011 at 11:49:37 AM permalink
Quote: P90


That falls under the overhead and indexes part.
No, I'm afraid I don't follow how storing txt documents, indexes not included, would take twice the space. Or even with main index included.


What structure are you planning to use to store the documents? Is it just one long wall of text? How will you tell one document from another? Two empty lines or something? How will you navigate to document 34567? Read through the first 34566? How about separating the title or the summary from the rest of the document? It adds up. Trust me.

Quote:


Which are done in data registers with far less latency than main RAM. This is the kind of work that lends itself well to pipelining.



Try this. Find a large (half a gig or so) file on your computer, maybe, a movie or a large application data file. Load it into a text editor.
Scroll to the end and find some character sequence at the bottom, long enough to be unique within the document. Now, scroll back to the top, and use the Search function to find that sequence. Note the time it take your editor to find it. This is the lower limit of what you can expect for your database to respond. The real response time will be much, much higher, because you are going to be looking for many more than just one word, and also need to analyze and process results as you go, and do several passes.



Quote:

You are again trying to have it both ways: sequential search the entire database, and perform operations on documents you have picked. That doesn't work both ways.


Exactly. But you do need to do both things.

Quote:

A simple sequential search, performed to find relevant documents and nothing else, only consists of reading the memory, pipelined, into registers, comparing data in the registers, then performing either no operation if they don't match or performing operations if they match


You have to be able to distinguish between separate documents, you have to be looking for more than one word. You have to be evaluating the structure of the document (a match on a term in different locations of the document, or in different grammatical forms can mean very different things), you have to evaluate the proximity and the relative order of matched terms in the document. You have to weigh matched terms based on the frequency of the word, the length of the document. You have to apply (sometimes really complex) mathematical models, such as SVN to each match to evaluate its relevance, you have to analyze the commonality of the matched terms by analyzing their frequencies in the entire database. You have repeatedly sort and prune the current list of candidates to maintain the levels of relevance. You have to generate many different lists to be intersected or joined, depending on the particular query plan. Etc., etc.

But before all that, just try the experiment I suggested above. That will be enough for you to understand that the approach you are suggesting is completely infeasible.

Quote:


Not on an ancient desktop, no.


You actually have conceded that the modern desktop would not work either. Your last position was half of Watson's servers.


Quote:

Perhaps so. What would not be useless is many answers with 75-90% confidence.


Absolute number does not matter. You can just pick the answer with the highest confidence, call it 100%, and then normalize everything else. Or, if you have a bunch of high-confidence outliers, pick the level around the majority of answers, and call it 99%.
You can always make your target confidence as high or as low as you want. In fact, this is more or less, exactly how this is done.
When Watson tells you it is 99% confident in one of the answers, it just means, that it is much more confident in this one, than in other candidates. It does not mean at all that the answer has 99% chance to be correct.
What does matter is the large gap between the confidence level between the right answers and the wrong answers. If you found the right answer, the confidence needs to be high, compared to the cases when you did not, and, perhaps, more importantly, it needs to be significantly higher than the confidence of all the other candidates. If you did not find the right answer, the confidence needs to be low enough to be able to conclude that the answer is wrong.



Quote:

Right now, adaptation of the system for medical purposes is being considered, for instance.


What is the difference? You just need to replace the wikipedia with a medical knowledge base. It is the same system, the same program, just different inputs.

Quote:

I think it does not require much debate to decide of how little worth a system geared for nothing but answering trivia would be for it.



Not really. To you trivia and medical knowledge is very different, for a computer, it's just zeroes and ones. It does not matter what the terms actually mean to you, as long as the computer knows how to analyze the question and retrieve the matching article based on it.
There would, probably, be some tweaking and fine tuning of the algorithms needed to be done, but in principle, it is the same exact system.




Quote:

No, I don't think I ever said that.
Closer to half a gigabyte without indexes, but only if it was custom-written.



Yeah, you did:
Quote: P90

At 7 bytes per word, counting spaces and punctuation, Britannica would take 308MB.



Nothing about "custom written".

Quote:

Current Watson's database is based on a data dump.
They basically fed it nearly everything of even remote usefulness they could get in digital text form, and even that, with all the indexes, didn't come close to filling the RAM alone.


No? It did not? Than what's your point?
I'll remind the context. We were talking about why they did not unplug half of the servers if the resources were really not needed (they could just pick the best algorithm(s), according to you, during field test stage). You objected they could not do it because the servers to be unplugged were needed to hold the database. But that can't be so if your other claims about the necessary size for the database were true.


Quote:

You are making a presumably intentional basic logical error.


I am? You said, that if Watson was doing everything the same exact way, it would need half the resources. I said, that if it was doing everything the same exact way, it would need the same exact resources.
Who is making the error?

Quote:

Watson beating Jeopardy establishes that a Watson can beat Jeopardy.


Only in a very narrow, and nearsighted view of a skeptic. In reality, it establishes a much wider and meaningful point.

Quote:

It does not establish that anything less than Watson can not beat Jeopardy.


Not that fact by itself. But combined with the fact that it was worked on for several years by a team, consisting of arguably some of the brightest and most creative minds in the modern computer science with virtually unlimited resources, it kinda does ...
Besides, speaking of logical errors, this is not actually the point. You said that it could use half of its resources if it did everything the same exact way. That can only be true if the other half of resources is intentionally wasted. Why would that happen? Sabotage?

Quote:

On the contrary, the large margin by which Watson beat Jeopardy indicates that a lesser system could also beat Jeopardy, with a smaller margin.


It's not quite as binary as you imagine. Given the complete knowledge of the past episode, you can certainly, pick and throw away 99% of the database, and still beat that same episode, maybe, with even higher margin. But claiming that there is a systematic, general way to prune the database while having it still suitable for answering unkown and unanticipated questions is quite another thing. The possibility of it in no way follows from the high margin in the past episode. At all.
"When two people always agree one of them is unnecessary"
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
February 27th, 2011 at 12:00:32 PM permalink
oops. duplicate.
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
February 27th, 2011 at 1:49:08 PM permalink
Well, I'm quite tired, so apologies in advance...

Quote: weaselman

What structure are you planning to use to store the documents? Is it just one long wall of text? How will you tell one document from another? Two empty lines or something? How will you navigate to document 34567? Read through the first 34566? How about separating the title or the summary from the rest of the document? It adds up. Trust me.


I trust you, but I also have little choice but to trust my eyes and memory first. The last time I worked with a simple but sizable database on a 80286 (it was a compressed sprite animation system, using variable color encoding with a 6 to 40 color palette and 2.6 to 5.3 bits per pixel, in 3 to 6 pixels per word), my overhead was less than 1/64. The documents are not separated at all, rather pointers to them are stored in an array of variables. There is not and will probably never be any use in separating documents, neither with empty lines, nor with titles, nor in any other way, as long as they are stored on non-tape medium in at least a minimally indexed form. Which just an array pointing to their locations qualifies as. It's not even a wall of text, it's a wall of what the computer doesn't care what it is, because it only reads from specified pointers, a specified number of words.


Quote: weaselman

Try this. Find a large (half a gig or so) file on your computer, maybe, a movie or a large application data file. Load it into a text editor.
Scroll to the end and find some character sequence at the bottom, long enough to be unique within the document. Now, scroll back to the top, and use the Search function to find that sequence. Note the time it take your editor to find it.


Oh. You'll be surprised, but I actually did it when writing the last post, was going to mention it, but there was no context to. Except I was searching for the word "penis", fathoming that it probably wouldn't be indexed, using Windows search, across my entire drive. I timed the search. Searching the entire 160-GB OCZ Vertex 2 drive for penises took 38 seconds.

I'm relieved to report only a few penes have been found, so I can feel free to welcome our computer overlords with only a slight increase in the attention paid to my six.


Quote: weaselman

You have to be able to distinguish between separate documents, you have to be looking for more than one word.


To the first part, no, not really; though yes to the second part. Distinguishing is useless, you just need to return if you found it, and where it is. Every location is unambiguously related to a document. Looking for multiple words is little different from looking from one, you just check for multiple acceptable letters. You have to do it anyway if using the pesky 8-bit format with its capital letters.


Quote: weaselman

You have to be evaluating the structure of the document (a match on a term in different locations of the document, or in different grammatical forms can mean very different things), you have to evaluate the proximity and the relative order of matched terms in the document.


Only when you have a match. And you have to do it whether it's highly indexed or not.

Quote: weaselman

You have to weigh matched terms based on the frequency of the word, the length of the document. You have to apply (sometimes really complex) mathematical models, such as SVN to each match to evaluate its relevance, you have to analyze the commonality of the matched terms by analyzing their frequencies in the entire database. You have repeatedly sort and prune the current list of candidates to maintain the levels of relevance. You have to generate many different lists to be intersected or joined, depending on the particular query plan. Etc., etc.


...etc, etc - but you only have to do it only once you have a match, and you have to do it whether your database is overindexed or minimally indexed.

Quote: weaselman

But before all that, just try the experiment I suggested above. That will be enough for you to understand that the approach you are suggesting is completely infeasible.


It was a pleasant surprise to discover that even things as rare and sought-after as a good dong can be found on my rather low-end drive in reasonable time, much less in RAM.


Quote: weaselman

You actually have conceded that the modern desktop would not work either. Your last position was half of Watson's servers.


It's not any exact figure. How much resources is needed depends on how good the alglorithms are and how good a performance is required.


Quote: weaselman

Absolute number does not matter. You can just pick the answer with the highest confidence, call it 100%, and then normalize everything else. Or, if you have a bunch of high-confidence outliers, pick the level around the majority of answers, and call it 99%.


Actually, confidence is a very specific figure of merit. It's not in any way arbitrary, and there are methods and formulas for determining it precisely. So you can't just call it, it's as real as length, weight or density. It's not a comparative metric, but an absolute one, too.


Quote: weaselman

What is the difference? You just need to replace the wikipedia with a medical knowledge base. It is the same system, the same program, just different inputs.


Not quite. For one, in a medical knowledge base, you will have to use custom indexes, as opposed to a generic tag cloud that can suffice in a trivia engine. Another thing is that all the effort spent making a custom trivia database would go to waste when transitioning to the medical field.


Quote: weaselman

Yeah, you did:
Nothing about "custom written".


No, not specifically. I took Britannica as an example of one of the largest encyclopedias, a good measure of a book containing more knowledge than the average person will ever learn. It would suck anyway for Jeopardy, because it's Encyclopedia Britannica. You'd need one based on American school curriculum, children's books, entertainment and other popular knowledge.


Quote: weaselman

No? It did not? Than what's your point?
I'll remind the context. We were talking about why they did not unplug half of the servers


Because their IQs exceeded zero. Why in the world would anyone with an IQ of 1 or above break up a system that works fine, to have to rebuild or at least transfer the databases, to reduce the confidence in winning, to have to realign the file system, to have to run a system that has never been run before in that exact form, all for a gain of - precisely - zero?


Quote: weaselman

I am? You said, that if Watson was doing everything the same exact way, it would need half the resources. I said, that if it was doing everything the same exact way, it would need the same exact resources.
Who is making the error?


I'm afraid whoever is presuming that (A>B)&&(C<A)=>(C<B).


Quote: weaselman

You said that it could use half of its resources if it did everything the same exact way. That can only be true if the other half of resources is intentionally wasted.


Not quite. Since the performance demonstrated exceeded the requirements by far, using less resources would reduce the margin, but still keep it within the winning range.


Quote: weaselman

Not that fact by itself. But combined with the fact that it was worked on for several years by a team, consisting of arguably some of the brightest and most creative minds in the modern computer science with virtually unlimited resources, it kinda does ...


I'm afraid it hardly begins to - the team was not tasked with trimming Watson to the minimum possible performance level.

If I wanted to kill someone, I would use a .338 Lapua Magnum with 250-grain Scenar copper solid, Federal 215M primer and 98.5 grains of Norma MRP2, shooting AI Arctic Warfare .338 with a 56-mm Zeiss Victory Diarange. Well, probably I'd just use a hammer, but if I was doing it legally, that's what I would use.

Now, knowing that, would you stand confident in your safety while being shot at with much weaker .30-06, because since someone would settle for nothing less than .338 Lapua, surely .30-06 must present no danger to your health and well-being?
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
February 27th, 2011 at 3:52:20 PM permalink
Quote: P90


I trust you, but I also have little choice but to trust my eyes and memory first. The last time I worked with a simple but sizable database on a 80286 (it was a compressed sprite animation system, using variable color encoding with a 6 to 40 color palette and 2.6 to 5.3 bits per pixel, in 3 to 6 pixels per word), my overhead was less than 1/64.


Pixels are very different from unstructured text.

Quote:

The documents are not separated at all, rather pointers to them are stored in an array of variables.


That's a part of the database too, isn't?

Quote:

There is not and will probably never be any use in separating documents,


If you are looking for articles on a given subject, you certainly do need them separated from other articles, that happen to be stored next to them. You will need to analyze the documents your terms matched. Where the document starts and ends is important.
You also need to analyze frequencies of hits per document. If you found two terms close to each other, you need to know if they are in the same document or in two different documents.

Quote:

neither with empty lines, nor with titles, nor in any other way,


You do need to distinguish between title hits and document body hits. You need to know where the title ends.

Quote:

as long as they are stored on non-tape medium in at least a minimally indexed form.



We were talking about an unindexed form. If you want them indexed, I have to ask again what indexes do you want to have, and how many of them it is going to be.

Quote:

Which just an array pointing to their locations qualifies as.


That helps to find a document by index, but is useless in figuring out which document a given piece of text belongs to.

Quote:

It's not even a wall of text, it's a wall of what the computer doesn't care what it is,


Computer may not care, but the programmer certainly needs to.

Quote:

because it only reads from specified pointers, a specified number of words.


I thought we were talking about reading the database byte-by-byte directly. Indirect reference navigation is a (quite large) additional cost.


Quote:

Oh. You'll be surprised, but I actually did it when writing the last post, was going to mention it, but there was no context to. Except I was searching for the word "penis", fathoming that it probably wouldn't be indexed, using Windows search, across my entire drive. I timed the search. Searching the entire 160-GB OCZ Vertex 2 drive for penises took 38 seconds.



That is not what you are suggesting though, and not what I asked you to do. The word "penis" is certainly indexed in windows search, just like any other word. It does not know the meaning of it, and isn't afraid of profanities. If you have the word in any of the documents on your drives, it certainly is in the index.
This is not the text I suggested though. We are not talking about search index navigation, the point is to check how quickly a computer can scan a large chunk of memory byte by byte. Try opening a large (half a gig or so) file in a text editor, and search it for the "penis" or anything else. See how long it takes. Keep in mind, that this is not an estimate for access time in your hypothetical database, just the absolute, unattainable lowest limit for it. But even this limit will be way too high to come close to finding real time answers by that method.

Quote:


To the first part, no, not really; though yes to the second part. Distinguishing is useless, you just need to return if you found it, and where it is.


You need to return the documents you found. You can't return the documents without being able to distinguish one document from another.

Quote:

Every location is unambiguously related to a document.


Yes, it is, but telling what that document actually is by tracing that unambiguous relationship is a separate task, with its own cost.


Quote:

Looking for multiple words is little different from looking from one, you just check for multiple acceptable letters. You have to do it anyway if using the pesky 8-bit format with its capital letters.


Yes, but checking for multiple matches is additional memory access, and additional cpu cycles, that you need to account for. It has to be done for every byte you are looking at. It does add up, really quickly.


Quote:

Only when you have a match. And you have to do it whether it's highly indexed or not.


When it is indexed, you do not have to read the whole database. Most of the time, you do not even need to read the whole document (the terms are indexed with lots of necessary information already attached to them - position within the document, relative frequency, document length, etc.).
Besides, we are not talking about indexed vs. non-indexed right now, we are talking about structured vs. unstructured. With your approach, you can't analyze the document when you found a match, because you don't have the document, just a wall of text.


Quote:

It was a pleasant surprise to discover that even things as rare and sought-after as a good dong can be found on my rather low-end drive in reasonable time, much less in RAM.


Yes. That is exactly because it is not using any of the "optimizations" you suggested. It does not care how often a term is sought-after, because it does not share your illusion that it can anticipate your future questions. Everything is indexed, that's what makes it fast.
I still urge you to do the experiment with the text editor though to get a taste of the real-life memory-scan speeds.

Quote:

It's not any exact figure. How much resources is needed depends on how good the alglorithms are and how good a performance is required.


I am not saying it's an exact figure. We are talking ballpark here. The algorithms are as good as we have them. In the ideal world, we would just have an algorithm that guesses the right answer to every question without needing to search anything at all. But we don't have it, so, it's not worth talking about.


Quote:

Actually, confidence is a very specific figure of merit.


Yes, it is, but not in absolute value.

Quote:

It's not in any way arbitrary, and there are methods and formulas for determining it precisely.


Yes, there are methods, and yes, there are formulas. The absolute values are completely arbitrary though, and subject to normalization. (Normalization is a formula too, BTW :))

Quote:

So you can't just call it, it's as real as length, weight or density. It's not a comparative metric, but an absolute one, too.


Yeah, you can actually. This is how it is always done. Weight or density? No, completely different. It is not an absolute metric by any stretch of imagination.

Quote:

Not quite. For one, in a medical knowledge base, you will have to use custom indexes, as opposed to a generic tag cloud that can suffice in a trivia engine.


No, you don't. It is the same exact problem. You may choose to use some structured indexes to optimize for specific cases, but the "meat" of the algorithm is still a tag cloud. Same thing with the trivia. Computer does not know or care what kind of knowledge is in the database, it is simply trained to retrieve relevant entries. Medical, technical, scientific or any other kind - does not matter.

Quote:

Another thing is that all the effort spent making a custom trivia database would go to waste when transitioning to the medical field.


Not really. The effort of developing a concept, and a method is what's valuable. The database itself is a cheap byproduct.


Quote:

No, not specifically. I took Britannica as an example of one of the largest encyclopedias, a good measure of a book containing more knowledge than the average person will ever learn.


What was the purpose of your taking it? Didn't you want to estimate the size your minimalistic database would require


Quote:

Because their IQs exceeded zero. Why in the world would anyone with an IQ of 1 or above break up a system that works fine, to have to rebuild or at least transfer the databases, to reduce the confidence in winning, to have to realign the file system, to have to run a system that has never been run before in that exact form, all for a gain of - precisely - zero?


Because that would make for a more efficient, useful and profitable system. Why would it "have never been run"? Of course, they could test run it as much as they wanted.


Quote:

I'm afraid whoever is presuming that (A>B)&&(C<A)=>(C<B).


And who would that be? This is not the formula in question at all. What we are talking about is
(A=C) && (B==D) && (A=B) => (C>D)
This is your claim (Approach "A" requires the amount of resources "C", Approach "B" requires the amount of resources "D", Approach "A" is identical to "B" ... but B somehow uses less resources).
I am saying is this cannot be true.


Quote:

Not quite. Since the performance demonstrated exceeded the requirements by far, using less resources would reduce the margin, but still keep it within the winning range.


You have not demonstrated that the number of questions answered can be systematically and reliably made a function of resources used. It is not obvious.
You can throw away 99% of the database, and still win that same episode, but that is not the requirement. The requirement is to be able to answer unanticipated questions.
Note also, the the game consisted of three episodes. While Watson has aced the second and (to the lesser extent) the third, the first one ended in a tie. There was no margin there at all.


Quote:

I'm afraid it hardly begins to - the team was not tasked with trimming Watson to the minimum possible performance level.


Do you know that for a fact? Why wasn't it tasked with that? Having worked on a great number of various software products in many different fields, I can tell you from my experience, that minimizing the amount of resources is always one of the goals, and acceptance criteria.

Quote:



If I wanted to kill someone, I would use a .338 Lapua Magnum with 250-grain Scenar copper solid, Federal 215M primer and 98.5 grains of Norma MRP2, shooting AI Arctic Warfare .338 with a 56-mm Zeiss Victory Diarange. Well, probably I'd just use a hammer, but if I was doing it legally, that's what I would use.

Now, knowing that, would you stand confident in your safety while being shot at with much weaker .30-06, because since someone would settle for nothing less than .338 Lapua, surely .30-06 must present no danger to your health and well-being?



No, I would not. Because I don't know you as a best-in-the-world killing machine, striving for efficiency. I do know that IBM's engineers are ones of the best in the world though, and I do know that efficiency is one of the goals of good engineering.
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
February 27th, 2011 at 8:14:00 PM permalink
Quote: weaselman

Pixels are very different from unstructured text.
That's a part of the database too, isn't?


That's a part of the index structure.


Quote: weaselman

If you are looking for articles on a given subject, you certainly do need them separated from other articles, that happen to be stored next to them. You will need to analyze the documents your terms matched. Where the document starts and ends is important.
You also need to analyze frequencies of hits per document. If you found two terms close to each other, you need to know if they are in the same document or in two different documents.


You need, but it's not the sequential search function that has to do it, it's the function above that is processing search output. The search function itself only needs to know where the word is in the memory, and it knows that because it asked for that data.


Quote: weaselman

You do need to distinguish between title hits and document body hits. You need to know where the title ends.


The title ends where the index ends. You don't need to paste it all over the place.
Even in a minimally-indexed database, you have at least the main index for access.
Since every location in the memory is unambiguously mapped to an entry, there is no "lost document" case.


Quote: weaselman

That is not what you are suggesting though, and not what I asked you to do. The word "penis" is certainly indexed in windows search, just like any other word. It does not know the meaning of it, and isn't afraid of profanities. If you have the word in any of the documents on your drives, it certainly is in the index.


It's a medical term, not a profanity. I'm quite sure I have indexing turned off, but perhaps it didn't work. Something isn't right, however. How come then I can use close to 95% of my drive's capacity, rather than 1% (1% for the documents and 99% for indexes) - does that mean indexing might indeed be possible using less than 100x the actual data size?


Quote: weaselman

This is not the text I suggested though. We are not talking about search index navigation, the point is to check how quickly a computer can scan a large chunk of memory byte by byte.


I'm afraid it's been a while since I worked with low-level languages, so I'm not sure how exactly to map a gig of memory while ensuring OS doesn't interfere in any way and without resorting to OS functions; last I heard, you aren't even supposed to be allowed to do it these days.

You read the memory and search it word by word, by the way, not byte by byte. "Bytes" don't even physically exist within the computer, they are a logical construct of arbitrary length, it's words that actually are down there.


Quote: weaselman

Try opening a large (half a gig or so) file in a text editor, and search it for the "penis" or anything else. See how long it takes.


Doesn't open. Editors smart enough to open a file that size without locking up know to leave it on disk. Editors dumb enough to load it all into memory, first of all, cause the OS to start paging, and then most likely use naive string search anyway.


Quote: weaselman

You need to return the documents you found. You can't return the documents without being able to distinguish one document from another.


You need to alert a higher level function (running on another core, in a practical implementation) that you found an inclusion and where, possibly also alongside a cached block of data. Returning a list of documents wouldn't be enough anyway; articles in Britannica go up to 300 pages in length.
Now if I had clairvoyance I'd say you are going to start asking me about how exactly that function is to be written, and I'll confess right away I'm not familiar enough with multi-threaded programming to know how exactly to get the CPU to leave specified fragments in L3 cache and access them with the thread on another core while ensuring it doesn't interfere with the first thread.


Quote: weaselman

Yes, it is, but telling what that document actually is by tracing that unambiguous relationship is a separate task, with its own cost.


It has its own cost, more precisely that cost consists of recording the current document id. Since search is single-threaded, at any given moment you are searching in a specific document and know what it is.


Quote: weaselman

Yes, but checking for multiple matches is additional memory access, and additional cpu cycles, that you need to account for. It has to be done for every byte you are looking at. It does add up, really quickly.


It does add up - to a very slight increase in complexity. The complexity of a search for multiple fixed strings is proportional to the sum of length of input text, length of patterns sought, and match count. Since the input text is overwhelmingly massive, for practical intents and purposes, it is what determines the search complexity, multiple patterns only increasing time slightly due to more matches.

Cache and register access, not main memory access, by the way.


Quote: weaselman

Yes. That is exactly because it is not using any of the "optimizations" you suggested. It does not care how often a term is sought-after, because it does not share your illusion that it can anticipate your future questions. Everything is indexed, that's what makes it fast.


Well, if indexing everything takes so little space that I still have 95% of the drive available - yay that indexing method, then. Let's add it to the database. It's enormous 50,000-fold overindexing I've been against, a minimally indexed database is what I always welcomed.


Quote: weaselman

I still urge you to do the experiment with the text editor though to get a taste of the real-life memory-scan speeds.


Well, perhaps. Could you suggest how to get direct memory access without having to jump through hoops? It used to be pretty easy under DOS/4G, but it couldn't access anywhere near that much RAM anyway. Maybe I should just use a ramdrive, create a file and fgrep it.


Quote: weaselman

I am not saying it's an exact figure. We are talking ballpark here. The algorithms are as good as we have them. In the ideal world, we would just have an algorithm that guesses the right answer to every question without needing to search anything at all.


Not really, it would still have to search. But yes, with the perfect algorithm it would do very little searching.


Quote: weaselman

Quote:

Another thing is that all the effort spent making a custom trivia database would go to waste when transitioning to the medical field.

Not really. The effort of developing a concept, and a method is what's valuable. The database itself is a cheap byproduct.


Only a versatile method is what's valuable, and only a text dump database is a cheap byproduct.

A custom-written trivia database is a very expensive byproduct.


Quote: weaselman

What was the purpose of your taking it? Didn't you want to estimate the size your minimalistic database would require


Yes, I did. Britannica is a good estimate of the amount of information a very well-read person can be expected to know.


Quote: weaselman

Because that would make for a more efficient, useful and profitable system. Why would it "have never been run"? Of course, they could test run it as much as they wanted.


After the show, during the show, or instead of the show?

When you are going for a drive with one passenger, do you take the rear seats out of your car?


Quote: weaselman

Note also, the the game consisted of three episodes. While Watson has aced the second and (to the lesser extent) the third, the first one ended in a tie.


Not quite. There was two games and one practice round. Watson won the practice round on Jan 13, then got improved, then won the first game round, then the second game round, on Feb 14-16.


Quote: weaselman

Do you know that for a fact? Why wasn't it tasked with that? Having worked on a great number of various software products in many different fields, I can tell you from my experience, that minimizing the amount of resources is always one of the goals, and acceptance criteria.


Because Watson's hardware is a sunk cost in the first place.
Aside from that, why would you start cutting hardware from your system before even making sure it works as is?

A decade down the line, I'm quite sure they will deliver systems with considerably lowered resource requirements per comparable performance. When it comes down to actual practical applications and expanding the reach. Right now it's just a rough testbed.


Quote: weaselman

No, I would not. Because I don't know you as a best-in-the-world killing machine, striving for efficiency.


Which I'm not, so I just figure what is good on the range is good for business. I think I would even apply that logic to computers if I had one.
Now, if the best-in-the-world killing machine used handloaded .338 rounds - that caliber is quite popular with snipers and Arctic Warfare was developed as a military rifle in the first place, so it's hardly a long stretch of imagination - would that make you confident enough to laugh in the face of 180-grain Bergers instead of 250-grain Scenars?


Quote: weaselman

I do know that IBM's engineers are ones of the best in the world though, and I do know that efficiency is one of the goals of good engineering.


So once they decide they can remove one server, how exactly do they go about that - take it apart, take Friday off and go sell recovered pieces on the flea market?
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
February 28th, 2011 at 5:48:50 AM permalink
Quote: P90

That's a part of the index structure.



Which is a part of the database.

Quote:


You need, but it's not the sequential search function that has to do it, it's the function above that is processing search output.


Does not matter which function it is. The point is you need that information, it must be present in the database.

Quote:


The title ends where the index ends.


What index? I thought you were not going to have any indexes>

Quote:

You don't need to paste it all over the place.


You mean, the title is in the index (alphabetical?), and not in the document?
So, now we are talking about two scans? First search the titles, and then the documents?

Quote:

Even in a minimally-indexed database, you have at least the main index for access.


You don't need an index for access. Speaking of wasted resources, this is the biggest waste. If the index isn't helping your query, it is a waste, not only of space, but also of time. Accessing data through the index is slower than accessing it directly.


Quote:

It's a medical term, not a profanity. I'm quite sure I have indexing turned off, but perhaps it didn't work.


When you turn off indexing, it stops updating the index with the new data, but the old index stays. I think. I don't know very much about Win Doze. But I assure you, it did not read your entire hard drive in 38 seconds. If you want to see how long reading takes, find a large file (about a gig), and copy it to a different drive. Multiply the time by the number of gigabytes in your hard drive, it is linear.


Quote:

Something isn't right, however. How come then I can use close to 95% of my drive's capacity, rather than 1% (1% for the documents and 99% for indexes) - does that mean indexing might indeed be possible using less than 100x the actual data size?


Yes, if all you are doing is looking for simple terms hits, not analytical queries, the index is relatively compact. It also does not index binary files, which is, probably, 99% of stuff on your hard drive. And how do you know how large the index is?


Quote:


I'm afraid it's been a while since I worked with low-level languages, so I'm not sure how exactly to map a gig of memory while ensuring OS doesn't interfere in any way and without resorting to OS functions; [q/]
I can assure you, the OS will do a much better job that you ever could with a low level language, even if it is just a Micro Soft. Just trust it.
last I heard, you aren't even supposed to be allowed to do it these days.


Depends on what you call "map" and what you call "memory" :)
You can map a file to a memory region via an kernel call (still don't know what is wrong with that), but that is not what you want, as it will simply use logical address space for referencing data in the file, but the data itself will still be on disk.
You can also allocate (system call) a large chunk of memory, and read the file from disk (system call), putting data into memory. That is what the text editor test I suggested would do.
You can't do stuff completely without using kernel calls as long as your program is running under an OS. I can't begin to imagine why you would ever want to do that.


Quote:

You read the memory and search it word by word, by the way, not byte by byte.


How do you know where the word ends?

Quote:

"Bytes" don't even physically exist within the computer, they are a logical construct of arbitrary length, it's words that actually are down there.


Down where? And what do you mean by "words"? Two bytes? It is not quite how you describe. If you are talking about RAM, individual bytes are addressable and accessible, there is nothing wrong with looking at them one-by-one (if it is speed you are worried about, most modern CPUs do have a read-ahead cache to optimize reading and writing small chunks of data). It is when you do data manipulation that the notion of words comes to play. But in those cases, it use usually "double-word" (8 bytes) that matters on the modern architectures, not a word. But speaking of what "physically exists", if anything at all does, it is bytes, not words. Words are used for alignment to delineate the physical bytes into logical chunks of data.
And, if you are talking about reading hard drives, than neither bytes nor words exist there - in that case you operate in terms like pages and clusters.

But I fail to see what any of this has to do with anything. You just need to read all the bytes stored in your "database" and sequentially compare them to a bunch of terms you are holding in hand. This is the task. I told you how to estimate the lower bound on the time it will take you. Just open the editor and do the search.


Quote:

Doesn't open. Editors smart enough to open a file that size without locking up know to leave it on disk.
Editors dumb enough to load it all into memory, first of all, cause the OS to start paging, and then most likely use naive string search anyway.


Paging is fine, that's not a problem.
"Naive string search" is exactly what you want. I mean, you don't want to use it for your minimalistic AI, what you want to use is much more complicated, but it will be enough to just run the naive search (which is much, much faster) to quickly and convincingly demonstrate that your approach will never work.



Quote:

You need to alert a higher level function (running on another core, in a practical implementation) that you found an inclusion and where, possibly also alongside a cached block of data.


And what will the higher level function do then?

Quote:

Returning a list of documents wouldn't be enough anyway; articles in Britannica go up to 300 pages in length.


You can't return the list of documents, because you don't have the documents, just a wall of text. To return a list of documents you need to split your wall into documents.
It is not enough to just return the list of documents anyhow. There will be thousands, probably hundreds of thousands of matches, you will need to analyze them for relevance (which terms have hit a particular document, how many of them, how close were the hits to each other, what grammatical forms they are in, how long is the document, how frequent are the hits, what part of the document they are in, etc., etc, etc.), pick the most promising ones, and then "read" them all to look for the actual answer to the question (chances are, it is not in the title). There is absolutely no way you can get away without reading the whole documents.
If Britannica's articles are indeed 300 pages long, you are probably much better off with Wikipedia to begin with. You can't search sequentially, that's obvious, so the total database size is of secondary importance, but analyzing 300 pages per article on the fly is going to be brutal anyhow.

Quote:

Now if I had clairvoyance I'd say you are going to start asking me about how exactly that function is to be written,


Actually, no, not in this case. In this case it is obvious that it cannot be written. If the data is missing, there is no way to write a function that operates it.

Quote:

It has its own cost, more precisely that cost consists of recording the current document id.


No, recording the id is just what is the document. I am talking about figuring out where it is.

Quote:

Since search is single-threaded, at any given moment you are searching in a specific document and know what it is.


Yes, you know what it is, but you don't know where it starts and where it ends. How do you plan on knowing the id BTW? Is it going to be included into the document text itself (as a dreaded piece of structure)?


Quote:

The complexity of a search for multiple fixed strings is proportional to the sum of length of input text, length of patterns sought, and match count. Since the input text is overwhelmingly massive, for practical intents and purposes, it is what determines the search complexity, multiple patterns only increasing time slightly due to more matches.



It is not as simple. The increase in complexity is indeed linear but it is much steeper than you imagine. You don't need twice as much time to search for two patterns, more like ten times as much.

Quote:

Cache and register access, not main memory access, by the way.


How much stuff do you plan to stick into the cache? How many registers do you plan on using?


Quote:

Well, if indexing everything takes so little space that I still have 95% of the drive available - yay that indexing method, then. Let's add it to the database. It's enormous 50,000-fold overindexing I've been against, a minimally indexed database is what I always welcomed.


I explained earlier why you wrong in this estimate. Only text files are indexed, of which you have just a small fraction of your hard drive.

Quote:

Well, perhaps. Could you suggest how to get direct memory access without having to jump through hoops?


The editor knows how to do it, don't worry.

Quote:


It used to be pretty easy under DOS/4G, but it couldn't access anywhere near that much RAM anyway. Maybe I should just use a ramdrive, create a file and fgrep it.



Yeah, that will work too. It is the same thing in principle, but will be a a bit slower because it'll use kernel functions to read pages from disk, which is slower than looking directly at memory pages, but for the purposes of a rough estimate, it is fine either way. Your database scan is going to be way slower than either of these methods anyhow.


Quote:

Not really, it would still have to search. But yes, with the perfect algorithm it would do very little searching.


No, search (looking at irrelevant documents) is a "waste of resources" in your terminology. The ideal algorithm would have to just know the answer.


Quote:

Yes, I did. Britannica is a good estimate of the amount of information a very well-read person can be expected to know.


Then why did you object to my using your estimate earlier?

Quote:

After the show, during the show, or instead of the show?


Before the show

Quote:

When you are going for a drive with one passenger, do you take the rear seats out of your car?


No, I don't. Why would I?


Quote:

Not quite. There was two games and one practice round. Watson won the practice round on Jan 13, then got improved, then won the first game round, then the second game round, on Feb 14-16.


The round played on Feb 14. was a tie. I know it won the game in the end, but the fact that it nearly lost the first round suggest that the margin you are talking about is not at all as profound and as universal as you make it look. Sometimes it wins with a large margin, sometimes it does not, it is expected, given that the questions it will be asked could not be anticipated at the time it was created.



Quote:

Because Watson's hardware is a sunk cost in the first place.


No, it isn't. It is a bunch of perfectly good general purpose servers, that can be used for any other purpose, not just to create another Watson (which, by itself, isn't a bad use of it).

Quote:

Aside from that, why would you start cutting hardware from your system before even making sure it works as is?


No, you'd do it after making sure it works. You'd do it, because you are so convinced that the resources are being wasted.


Quote:

Now, if the best-in-the-world killing machine used handloaded .338 rounds - that caliber is quite popular with snipers and Arctic Warfare was developed as a military rifle in the first place, so it's hardly a long stretch of imagination - would that make you confident enough to laugh in the face of 180-grain Bergers instead of 250-grain Scenars?


I said "the best in the world killing machine" striving for efficiency. If you are going to quote me, at least, quote the entire sentence.


Quote:

So once they decide they can remove one server, how exactly do they go about that - take it apart, take Friday off and go sell recovered pieces on the flea market?


No, just shove it into another rack and use it for something else.
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
February 28th, 2011 at 6:29:48 AM permalink
What text editor do you recommend for the experiment? I have Notepad, which doesn't work on large files, Wordpad, which does, a few textmode tools, or I could download something you suggest. Just to make sure it's not doing the search wrong.
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
February 28th, 2011 at 7:03:07 AM permalink
I think, wordpad should be fine. If you want to get fancy, you could try installing emacs or vim.
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
March 4th, 2011 at 12:53:27 PM permalink
Well, I indeed tried the experiment. Sorry for delaying the reply, but basically I got my own doubts resolved and lost most of the interest.

I considered using the actual 2010 Britannica, but sadly it no longer uses UTF-8. As such, two samples were used:
1) 1911 Britannica. It's old, but it's openly and readily available as plaintext, ensuring repeatability. Used as 29 volumes, it takes 252 megabytes in total. For text editors it was merged into one file, grep had to use a directory.
2) A semi-binary file in order to test performance on mixed data. The file takes 234 megabytes, about 1/10 of it being text.

With a two-gig virtual drive created in fast RAM, holding temp and swap files among other things, opening the semi-binary file still proved to be a near impossibility. Notepad refused to open it via GUI and just locked up when called. Wordpad took minutes and consumed over 700MB of RAM. Britannica was less challenging, but still opened for over a minute and required 550MB for Notepad (which still refused to open it via GUI) and 680MB for Wordpad, both of which still lagged hopelessly.
A search for "alexander bill" in Wordpad took over half a minute. Strangely, Notepad needed just seconds. Vim and FAR's built-in editor were in between, much slower than Notepad, but faster than Wordpad, although Vim's opening lag was short and FAR's shorter still. While at that, Vim consumed just 270MB of RAM, while FAR had a record-breaking RAM consumption of 1.27GB for editing, though nothing for viewing.

This would indeed suggest that the task at hand was nigh insurmountable, and I was just about to surrender... Well, not really, I had actually tried grep before, which was too quick to measure, looking instant to the naked eye. What is of note is the vast differences between editors in terms of opening lag, memory consumption and search speed.


Since testing grep speed on my PC is pointless, I decided to up the challenge. For that I used a machine that will ensure repeatability - a netbook. Powered by the mighty Intel Atom, this low-end model has 1GB of RAM, which falls second to it costing less than a desktop case and working a whole day on one charge (and I made sure it ran on battery power for the test, too). Most users dislike them for being what they believe is excruciatingly slow, but slower is better for our purposes. Needless to say, trying to even open the file with text editors was hopeless on that machine, making proper tools not only the best, but the only option.

The following lines were used:
logtime SearchBegin
fgrep -i -a -H -n -r -w --before-context=3 --after-context=3 -F --file=txtin brit >report
logtime SearchEnd

Here txtin contains words looked for and brit is the directory with Britannica.

A search for a small sample (six words, gambling related) produced a 176 kB output document. Lines from it look like this:
Quote:

brit/Britannica08.txt:33135:by R. F. Foster, 1903.) The most popular form of pure gambling
brit/Britannica08.txt-33136-with dice at the present day, particularly with the lower classes in
brit/Britannica08.txt:33137:America, is Craps, or Crap-Shooting, a simple form of Hazard, of
brit/Britannica08.txt-33138-French origin. Two dice are used. Each player puts up a stake
brit/Britannica08.txt-33139-and the first caster may cover any or all of the bets. He then
brit/Britannica08.txt-33140-shoots, i.e. throws the dice from his open hand upon the table
brit/Britannica08.txt-33141-If the sum of the dice is 7 or 1 1 the throw is a nick, or natural,


As can be seen, it includes the name of the document it's found in, the line where the word is found (byte offset can be specified too), and selects a few lines, as preferred, for the context. This is all you really need from the first phase of the search, implemented with bog-standard grep.

Now, is that it? Of course no. But we've got over the long part - searching the massive database and retrieving relevant material. Next comes the hard part - analyzing the results to find which word is the answer. That part is not so much about resources as it is about algorithms. As such, the minimum resources limitation is primarily determined by the initial search.

An alternate search command was used with the semi-binary file. It is almost as long (234MB), but differs in that it is just one file (thus is less affected by the file system) and that it mixes text with binary data.
fgrep -i -a -H -n -r -w -F --file=txtin nv.esm >report3
This produced a much larger output file of 1,631 kB, due to frequent matches and long lines. Despite that, as can be seen in the time log, it actually was searched through a lot faster.

The third test was a large dictionary test, seeing how you were concerned that searching would have to be done for every word. In this example, txtin4 contained 1,000 words. In order to avoid everything matching, the dictionary was cut from the end of the official scrabble long word list.
Running it on Britannica with -i still resulted in enormous output, and so -i was turned off:
fgrep -a -H -n -r -w --before-context=1 --after-context=6 -F --file=txtin4 brit >report4
That still retrieved 70 matches (and 34 kB), such as for instance:

Quote:

brit/Britannica27.txt-217520-
brit/Britannica27.txt:217521:VESUVIANITE, a rock-forming mineral of complex com-
brit/Britannica27.txt-217522-position. It is a basic calcium and aluminium silicate con-
brit/Britannica27.txt-217523-taining small amounts of iron, magnesium, water,
brit/Britannica27.txt-217524-fluorine, etc., and sometimes boron; the ap-
brit/Britannica27.txt-217525-proximate formula is H2Ca6(Al,Fe)sSi5Oi8. It
brit/Britannica27.txt-217526-crystallizes in the tetragonal system, but often
brit/Britannica27.txt-217527-exhibits optical anomalies, and the optical sign


As the log will show, dictionary size did not increase search time at all, just as is to be expected with Aho–Corasick algorithm employed. On the contrary, it took somewhat less time, most likely due to fewer matches and hence smaller output.

For the fourth and last test, the same large dictionary search was run on the semi-binary file:
fgrep -i -a -H -n -r -w -F --file=txtin4 nv.esm >report6
As can be seen, -i is on in this case, but additional context was not delivered. Due to file's structure, 'lines' can be overly long.
The search produced 26 matches, forming a 132-kb file.

Finally, here are some timelog excerpts:
Quote:

03\03\2011 23:39:24.421 SearchBegin
03\03\2011 23:39:26.781 SearchEnd
03\03\2011 23:39:37.546 SearchBegin
03\03\2011 23:39:39.906 SearchEnd
03\03\2011 23:44:45.718 SearchBeginLargeDic
03\03\2011 23:44:47.046 SearchEndLargeDic
03\03\2011 23:44:50.843 SearchBeginLargeDic
03\03\2011 23:44:52.109 SearchEndLargeDic
03\03\2011 23:52:40.593 SearchBeginBinary
03\03\2011 23:52:41.750 SearchEndBinary
03\03\2011 23:52:44.156 SearchBeginBinary
03\03\2011 23:52:45.328 SearchEndBinary
03\03\2011 23:53:31.093 SearchBeginBinaryLargeDic
03\03\2011 23:53:32.531 SearchEndBinaryLargeDic
03\03\2011 23:53:33.421 SearchBeginBinaryLargedic
03\03\2011 23:53:34.859 SearchEndBinaryLargedic
03\04\2011 18:43:45.031 CatBegin
03\04\2011 18:43:47.421 CatEnd
03\04\2011 18:47:25.265 CatBeginBinary
03\04\2011 18:47:27.875 CatEndBinary


While this is a small sample, and there was a lot of variance, it does represent relative numbers to work with.

Now, how does this compare to raw RAM speed? To test that, I ran a simple cat brit\* >out command to test RAM read-write speed. It was also done with both samples. As one can see, it takes about as long as a search. In case of the semi-binary file, the search actually took considerably less time than a linear copy procedure, since the algorithm was able to hop through binary parts much quicker.

The above is also part of the reason why using a big PC for this test would be pointless. On the netbook, search time is consistent between trials. On the PC, the search is actually executed faster using Vertex 2 rather than a theoretically much faster ramdrive (hampered by its driver), and CPU and controllers cache everything to the point that further trials only take tens of milliseconds. With extended samples, however, the pattern remains: a full directory search, be it with a small or large dictionary, takes almost exactly as much time as simply reading through the file (copying it to RAM).
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
March 4th, 2011 at 4:36:31 PM permalink
So, you are seeing about 3 seconds spent through about a quarter of the total data amount. This is about the ballpark I had in mind.
Are we in agreement now that your "naive" full scan approach would not work for real time queries?

A few notes:
1. Aho–Corasick is linear on the total length of the patterns being searched. It will take proportionally longer to search a long pattern list than it does to search a short list. The reason you are seeing a decrease in the search times in your log is because there is a sifignificantly lower number of matches (it is linear on that too).

2. grep (and all its flavors) is stream oriented. It does not "copy file to RAM", then scan it, and spit out the output. The file is read sequentially, piped through grep, and whatever makes it out is written to disk. All three pieces happen in parallel. Reading usually does not take any time at all (I assume, we are talking about a linux box), because the entire contents is already in the kernel cache (or if it wasn't initially, it will be after your first search). Writing might indeed cost something, but see the net point.

3. Writing is buffered and goes into kernel cache at first anyway, so it does not slow down the search (though, it may indeed impose the low limit, depending on how fsync is configured in your kernel - I am assuming it is a linux box). Subtracting the time cat takes from the time of the search won't tell you very much. Besides, your 'cat' command measures the time to write entire input to disk - nothing like what the actual search did (writing lots of output gets expensive faster than linearly, because it ends up pushing the stuff you are reading out of the cache). Try replacing >report with >/dev/null in your command - that will eliminate writing (you won't be able to see the output, but who cares).

4. You are wrong in your assumption that analyzing the results is less computationally expensive than searching, but that doesn't matter, since searching itself has been demonstrated to require an infeasible amount of time.
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
March 4th, 2011 at 7:43:45 PM permalink
Quote: weaselman

So, you are seeing about 3 seconds spent through about a quarter of the total data amount.


I am seeing 1.5-2.5 seconds through the entire data amount. Indexes that would be used to speed up search do not count towards data size, on the contrary they would allow to reduce the time spent.

1.5-2.5 seconds (even if it's only part of the process) would be a perfectly acceptable time under most circumstances. Turning ponder off was an artificial handicap based on the system overperforming as it was. In human-computer matches, in all other games it's normal to have ponder on. In this case it would be both pre-call and post-call.


Quote: weaselman

1. Aho–Corasick is linear on the total length of the patterns being searched. It will take proportionally longer to search a long pattern list than it does to search a short list. The reason you are seeing a decrease in the search times in your log is because there is a sifignificantly lower number of matches (it is linear on that too).


Not quite, neither theoretically nor practically. Aho-Corasick algorithm is linear on the sum of input, pattern and matches. Not on either, but only on the total. This is a crucial distinction - the difference between 10 and 1000 words is defining in searching through one page, but insignificant in searching through a 40 million word database.
With a large database, input is the overwhelming part, and as a result it only takes a fraction of a percent more time to look for 1,000 words than for 1 word (as long as it doesn't result in copying the whole input). It is how the algorithm works in theory, and it is how it works in practice here. All that measurably affects the running time is the data size.


Quote: weaselman

2. grep (and all its flavors) is stream oriented. It does not "copy file to RAM", then scan it, and spit out the output.


Yes, of course. I even mentioned this, when pointing that grepping through the file took less time than simply copying it to RAM. This was especially pronounced searching through a PC drive.


Quote: weaselman

Reading usually does not take any time at all (I assume, we are talking about a linux box), because the entire contents is already in the kernel cache (or if it wasn't initially, it will be after your first search).


It would apply to a very small file being searched, but not here. An Atom based netbook doesn't have much space for caching, especially once half its memory has been taken for the ramdrive; even then it wouldn't matter, since it's RAM either way. The CPU has no cache to speak of in the context.


Quote: weaselman

3. Writing is buffered and goes into kernel cache at first anyway, so it does not slow down the search (though, it may indeed impose the low limit, depending on how fsync is configured in your kernel - I am assuming it is a linux box).


Well, it's a netbook - there isn't enough space to cache even a 1.6-MB file. Larger outputs (the above is only a basic selection of the trials) still kept time moderate until the point where they got outright massive (considerable fraction of the database).

On the PC it appeared to be a factor, but it wasn't used in the tests, being too fast and sophisticated for the purpose, producing highly inconsistent times in tens and hundreds of milliseconds.


Quote: weaselman

4. You are wrong in your assumption that analyzing the results is less computationally expensive than searching, but that doesn't matter, since searching itself has been demonstrated to require an infeasible amount of time.


It actually wasn't originally my assumption; rather it was your mention that reducing the number of algorithms would not reduce the time significantly. But even leaving that aside, language analysis algorithms are not anywhere nearly as dependent on brute force as searches. There is no millions of pages to go through, they just apply a function to the input. Since the input is relatively small, no massive-scale operations are involved. It's a situation where finesse is more important than cycles.

Also, keep in mind the machine in question was only equivalent to a very old desktop, and it wasn't really working near its full capacity, being slowed down by its mish-mash of software not designed to work together and routing everything through OS functions once again not designed for the purpose. With software built up properly, focused on a single purpose, optimized for it, and with no regard for the battery, even that would deliver much faster times than it did in this jury-rigged configuration. Would it be enough to beat champions, questionable even with far better programming, but an average person would easily be beaten as long as the language analysis algorithms don't pick "and" as the answer.
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
March 5th, 2011 at 10:19:09 AM permalink
Quote: P90

I am seeing 1.5-2.5 seconds through the entire data amount. Indexes that would be used to speed up search do not count towards data size, on the contrary they would allow to reduce the time spent.


No, I mean 1911 Britannica does not have all the info you'd need for Jeopardy. I estimate the amount that you'd need to add as about 3 times as much.

Quote:

1.5-2.5 seconds (even if it's only part of the process) would be a perfectly acceptable time under most circumstances.


I don't know what is "most circumstances", it's certainly not acceptable to play Jeopardy.
Besides, don't forget, that this time is only a rough, absolute, unattainable lower boundary estimate. The real search will take longer (you need to find documents, containing all terms, or some terms with only certain exceptions, you'll need to locate document boundary, you'll need to do stemming to search for different grammatical forms, probably, also employ some thesaurus as well, you'll need to analyze the relative frequencies of terms and matches, and their proximity within document, etc., etc.), you will likely require more than one search, etc. And, more importantly, this is only the first step of the process we are discussing, that is already taking unacceptably long. We have not even approached the question of what happens after the sets of matches are found - sorting, intersecting, analyzing, weighing, re-sorting, re-intersecting, analyzing again etc., etc.


Quote:

Not quite, neither theoretically nor practically. Aho-Corasick algorithm is linear on the sum of input, pattern and matches.


You do know what associativity is, right?


Quote:

Yes, of course. I even mentioned this, when pointing that grepping through the file took less time than simply copying it to RAM. This was especially pronounced searching through a PC drive.


You never copied the file to RAM (it was in RAM all the time, because of the kernel cache).
The reason grep took less time than cat was that it was not writing as much stuff to disk.

Quote:


It would apply to a very small file being searched, but not here. An Atom based netbook doesn't have much space for caching, especially once half its memory has been taken for the ramdrive; even then it wouldn't matter, since it's RAM either way.


Ah, I did not realize you were using RAM drive. Try running the same thing without it - chances are it will be faster. Linux filesystem is very good managing file cache. Kernel cache is also accessed much faster than regular user-level RAM.
Was the output file on the RAM drive too?


Quote:


Well, it's a netbook - there isn't enough space to cache even a 1.6-MB file.


I am pretty sure you are mistaken. How much memory does it have?
Try free -m. What does it say?


Quote:

It actually wasn't originally my assumption; rather it was your mention that reducing the number of algorithms would not reduce the time significantly.


When you run them on 3000 CPUs, it would not.

Quote:

But even leaving that aside, language analysis algorithms are not anywhere nearly as dependent on brute force as searches.


Searches are normally not dependent on brute force either. If something does not depend on brute force, it does not mean it is free though.

Quote:

Also, keep in mind the machine in question was only equivalent to a very old desktop,


Oh, no ... Not even close.
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
March 5th, 2011 at 1:11:45 PM permalink
Quote: weaselman

Besides, don't forget, that this time is only a rough, absolute, unattainable lower boundary estimate.


Not exactly. It's a practical, attained figure, with a very crude and inefficient, jury-rigged implementation. The actual time in a proper implementation would be lower due to the lack of delays from HDD emulation, and further lower due to the benefits of at least minimal indexing.

In practical terms, 2 seconds is a reasonable time scale. A computer operating as a black box would have about 2-3 seconds between hearing the question and having to deliver the answer. It would have to decide whether to take this one before buzzing, but then again that could be decided while the question is being delivered. That puts it well within the realm of at least theoretical possibility.


Quote: weaselman

The real search will take longer (you need to find documents, containing all terms, or some terms with only certain exceptions, you'll need to locate document boundary,


All of these are of little effect. The search already found documents, with known boundaries. Filtering results through a match matrix is not implemented in the grep function, but it's a trivial task.


Quote: weaselman

you'll need to do stemming to search for different grammatical forms, probably, also employ some thesaurus as well


Part of pre-search processing, and it's a very brief and simple process compared to the search itself. In search terms, this just means a larger dictionary.


Quote: weaselman

you'll need to analyze the relative frequencies of terms and matches, and their proximity within document, etc.,


Post-search results processing. Once again, it's done on a small volume of data actually retrieved, as opposed to the large total database.


Quote: weaselman

And, more importantly, this is only the first step of the process we are discussing, that is already taking unacceptably long. We have not even approached the question of what happens after the sets of matches are found - sorting, intersecting, analyzing, weighing, re-sorting, re-intersecting, analyzing again etc., etc.


It depends on how many sets of matches are found. If it's a very large number, then this would be a long task indeed. But for a moderate number of matches, it comes down to a relatively small number of cycles, at least in terms of machines that can crack a billion operations per second. Same thing as why flying a spacecraft takes so much less processing power than playing a movie. It's volume computers have trouble with.


Quote:

Quote:

Not quite, neither theoretically nor practically. Aho-Corasick algorithm is linear on the sum of input, pattern and matches.


You do know what associativity is, right?


I do. What of it? You misunderstood my point - I was pointing out that doubling the dictionary size does not double search time, only increases it by a small fraction.


Quote:

The reason grep took less time than cat was that it was not writing as much stuff to disk.


Among other things. Though both were writing to the ramdrive.


Quote:

Ah, I did not realize you were using RAM drive. Try running the same thing without it - chances are it will be faster. Linux filesystem is very good managing file cache. Kernel cache is also accessed much faster than regular user-level RAM.
Was the output file on the RAM drive too?


Yes, on the ramdrive as well. Running a search off the HDD was extremely slow. It's pretty fast (a second) on the desktop, but that uses a SSD reading/writing a quarter gigabyte per second, right around the database's size.


Quote:

I am pretty sure you are mistaken. How much memory does it have?


A gigabyte in total. Over half of that (540MB) taken by the ramdrive, some taken by other things. I long removed the drive, so won't be reproducing the circumstances. But that doesn't matter much, since it's already working off a RAM drive; not a lot of faster things out there. Also, I did multiple repeated tests, alternating between different files, and the time taken didn't change significantly on the netbook, neither for cat nor for the search.


Quote:

When you run them on 3000 CPUs, it would not.


But you only need 3,000 CPU to run 3,000 (actually more like 12,000) algorithms concurrently. If you only run one ultimate algorithm, or, say, a dozen well evolved ones, you don't need 3,000 CPU.


Quote:

Oh, no ... Not even close.


Well, does a decade count as old enough? Atom has ~3,300 MIPS at full speed, ~2,200 underclocked (power saving). Athlon 1.2, available in 2000, had ~3,500 MIPS. Either is a small fraction of a modern desktop CPU's power (150,000 MIPS). Of course, IPS isn't all there is to it, but nevertheless, it's far removed from a modern desktop.
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
March 5th, 2011 at 1:44:51 PM permalink
Quote: P90

Not exactly. It's a practical, attained figure, with a very crude and inefficient, jury-rigged implementation. The actual time in a proper implementation would be lower due to the lack of delays from HDD emulation,


That's not what I am talking about. The real life implementation would be much more complex, because the primitive query you were running would be useless for its purposes. The whole point of the test was a quick estimate of how fast a large chunk of memory can be scanned. It establishes the lower bound of any sequential search, but in no way provides an estimate for the actual time it would take.

Quote:

and further lower due to the benefits of at least minimal indexing.


Minimal indexing does not benefit general purpose queries at all. If anything, it makes them slower. You can create an index to speed up a particular narrow class of queries, or you can create two indexes to speed up two such classes, but the vast majority of queries you need to run does not benefit from the presence of those indexes in any way.

Quote:

In practical terms, 2 seconds is a reasonable time scale.


Not in my book, it isn't.

Quote:

A computer operating as a black box would have about 2-3 seconds between hearing the question and having to deliver the answer.


First, no, it would not. Second, the actual time your query would require on the real-life data set is more like 8-10 seconds. Third, again, we are talking about the absolute lower bound, not the actual time, which would still be significantly more. Fourth, the scan is only the first step, even after it has been completed, the computer would still be a long way from actually having an answer to the question.


Quote:

All of these are of little effect.


No, they are not. You are way off on this one.

Quote:

The search already found documents, with known boundaries.


No, the boundaries are not known in your approach. And the documents, identified (if they were indeed identified which they are not) are not quite the ones that are actually needed.

Quote:

Filtering results through a match matrix is not implemented in the grep function, but it's a trivial task.


It is anything but trivial. Trust me, if it was trivial, it would be implemented.


Quote:

Part of pre-search processing, and it's a very brief and simple process compared to the search itself. In search terms, this just means a larger dictionary.


That's not how it is done in practice. Are you suggesting to generate every possible grammatical form of every term of the query?
There are hundreds, if not thousands of possibilities. Not only it will blow up your query beyond any reason, but it will also take significant time (comparable to the search time itself, probably, more) to create the dictionary in the first place.



Quote:

Post-search results processing. Once again, it's done on a small volume of data actually retrieved, as opposed to the large total database.


The volume of data is not actually that small, as you have a chance to see in one of your examples. For that reason, this, again, is not how it is done in practice. In your case, the problem is further complicated by the fact that the results of the search are lines of text containing matches, not actual documents. At the very least, you will need a series of secondary scans to locate the actual documents for analysis.

Quote:


It depends on how many sets of matches are found. If it's a very large number, then this would be a long task indeed. But for a moderate number of matches, it comes down to a relatively small number of cycles, at least in terms of machines that can crack a billion operations per second.


Sure. Or, if only one match is found, you are done right away. But what's of it?
Reducing the number of potential matches is a big problem. That is exactly why things like frequency analysis, positional matching, taking the number of matches, and the number of matched terms per document, cannot be done as "post-search" processing.

Quote:

I do. What of it? You misunderstood my point - I was pointing out that doubling the dictionary size does not double search time, only increases it by a small fraction.


I did not say it doubled it. It is a linear increase though.



Quote:

But you only need 3,000 CPU to run 3,000 (actually more like 12,000) algorithms concurrently.


Provided that the algorithms themselves are not parallelizable, which is very unlikely.

Quote:

If you only run one ultimate algorithm, or, say, a dozen well evolved ones, you don't need 3,000 CPU.



Except that "one ultimate algorithm" is (in the universe we live in) comprised of running 3000 sub-algorithms and weighing the results.


Quote:

Well, does a decade count as old enough? Atom has ~3,300 MIPS at full speed, ~2,200 underclocked (power saving). Athlon 1.2, available in 2000, had ~3,500 MIPS. Either is a small fraction of a modern desktop CPU's power (150,000 MIPS). Of course, IPS isn't all there is to it, but nevertheless, it's far removed from a modern desktop.


MIPS isn't really telling you anything when talking about CISC architecture (which both AMD and Intel are). The devil is in the details of exact realization of the complex instructions on the RISC core (which has evolved a long way in the last decade), and, perhaps, even more importantly for the case at hand, in the bus and memory speeds. I am assuming, you are running some dual-channel DDR3 in your netbook. There was nothing comparable to that ten years ago.
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
March 5th, 2011 at 5:29:57 PM permalink
Quote: weaselman

That's not what I am talking about. The real life implementation would be much more complex, because the primitive query you were running would be useless for its purposes. The whole point of the test was a quick estimate of how fast a large chunk of memory can be scanned. It establishes the lower bound of any sequential search, but in no way provides an estimate for the actual time it would take.


It establishes the time a sequential search would take for this data set in this implementation (which is a crude one). What is being searched for exactly does not significantly affect search time, even with a 1000-word dictionary. It's actual time for almost any search, further overstated due to ramdrive's inefficiency, so it's more like the upper bound, seeing how the algorithm's theoretical potential is much higher.


Quote: weaselman

First, no, it would not. Second, the actual time your query would require on the real-life data set is more like 8-10 seconds.


Encyclopedia Britannica is a real-life data set.


Quote: weaselman

No, the boundaries are not known in your approach. And the documents, identified (if they were indeed identified which they are not) are not quite the ones that are actually needed.


They are identified. The first part of the search output. In this case it was volume name, but it could just as well be article name.


Quote: weaselman

It is anything but trivial. Trust me, if it was trivial, it would be implemented.


It is trivial and it has long been implemented. Just not in fgrep, which isn't meant to do anything remotely near this.


Quote: weaselman

That's not how it is done in practice. Are you suggesting to generate every possible grammatical form of every term of the query?


Rather the reverse, reduce the word to its base form. Then add its irregular variations that can not be accounted for in a simple way. Fortunately, English has few forms for each word and most verbs are regular.


Quote: weaselman

In your case, the problem is further complicated by the fact that the results of the search are lines of text containing matches, not actual documents. At the very least, you will need a series of secondary scans to locate the actual documents for analysis.


It's not scans, it's simple file access - the filename is in the beginning of the reported line. But again, the reason it retrieves lines is because grep is designed to do that. It's not part of the base search algorithm itself; the function could just as easily retrieve whole documents, a set number of symbols, anything else.

You really seem to just be either inventing problems or working under the assumption that I'm trying to actually build a fgrep-based question answering machine. I'm not, grep was only used as a demonstration of time it takes for the search.


Quote: weaselman

Except that "one ultimate algorithm" is (in the universe we live in) comprised of running 3000 sub-algorithms and weighing the results.


And it would run 300,000 sub-algorithms if it was designed in 2030. We've been over this; the number of algorithms required for noise averaging depends exponentially on individual algorithm quality. As much as just evolutionary selection and development of algorithms will lead to improving individual algorithm quality, leading to queries being completed with less resources. This is one of the primary goals of the research.


Quote: weaselman

I am assuming, you are running some dual-channel DDR3 in your netbook. There was nothing comparable to that ten years ago.


Haha, I wish. But that wouldn't really be a netbook. It doesn't even support dual channel memory. No, it's single channel DDR2-400, and it's further underclocked together with FSB when running on battery power.

It's actually slower due to underclocking and poor timings than DDR-333 desktops had a decade ago. So it's really quite a good stand-in for a decade-old desktop on all counts.

Now, on triple-channel DDR3 in the desktop... but that works out too close to a dummy run of the command with empty dictionary on empty files and is too inconsistent to take a measurement.
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
March 5th, 2011 at 5:56:41 PM permalink
Quote: P90

It establishes the time a sequential search would take for this data set in this implementation (which is a crude one). What is being searched for exactly does not significantly affect search time, even with a 1000-word dictionary.


Yeah, it does. Not "what" as in "what particular terms", but "what" as in what question is asked. ('all terms' as opposed to 'at least one term', 'number of matches' as opposed to 'at least one match', 'terms close to each other' as opposed to 'any term anywhere', 'close to beginning weighs higher' as opposed to 'any match', etc.).

Quote:

Encyclopedia Britannica is a real-life data set.


I meant a real life dataset suitable for playing Jeopardy.


Quote:

They are identified. The first part of the search output. In this case it was volume name, but it could just as well be article name.


Do you mean creating one file per article? That will take significantly longer to search.

Quote:

It is trivial and it has long been implemented. Just not in fgrep, which isn't meant to do anything remotely near this.


What "trivial implementation" do you have in mind?

Quote:

Rather the reverse, reduce the word to its base form. Then add its irregular variations that can not be accounted for in a simple way. Fortunately, English has few forms for each word and most verbs are regular.


How is it "the reverse"? It is not about how complicated it is to generate all the variations, the problem is that there is a very large number of them.

Quote:

It's not scans, it's simple file access [- the filename is in the beginning of the reported line.


File access is a scan (of the filesystem).

Quote:

But again, the reason it retrieves lines is because grep is designed to do that. It's not part of the base search algorithm itself; the function could just as easily retrieve whole documents, a set number of symbols, anything else.


No, it could not do it "as easily". It would have to backtrack, fastforward, to determine the boundaries, switch context, navigate indexes to locate the document, copy memory around to generate the result, etc., etc.

Quote:

You really seem to just be either inventing problems or working under the assumption that I'm trying to actually build a fgrep-based question answering machine. I'm not, grep was only used as a demonstration of time it takes for the search.


No, to the contrary, my point is exactly that fgrep test is nothing close to the actual search algorithm, it is a much simpler, and, consequently, much faster way to scan a large chunk of memory. It is a good demonstration that the real algorithm cannot be any faster than this, but it is in no way an indication that real algorithm's performance can be close to fgrep's.

Quote:

And it would run 300,000 sub-algorithms if it was designed in 2030.
We've been over this;


Exactly. We've been over my refusal to discuss futuristic hypothetical algorithms. I am talking about today's technology.

BTW, can you point me to the text of the Britannica you were using for your testing? I am curious to play around with it for myself ...
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
March 5th, 2011 at 8:09:14 PM permalink
Quote: weaselman

Yeah, it does. Not "what" as in "what particular terms", but "what" as in what question is asked. ('all terms' as opposed to 'at least one term', 'number of matches' as opposed to 'at least one match', 'terms close to each other' as opposed to 'any term anywhere', 'close to beginning weighs higher' as opposed to 'any match', etc.).


Grep already has an option for reporting the number of matches. Almost all of these things are really just tweaks to how exactly to do the output or what to exclude from it. Even separating terms close to each other from any-anywhere would only be a matter of comparing byte offsets. These things are not a matter of search algorithm itself, just output routine. The search routine already provides all the information required to determine this and filter the output if required.


Quote: weaselman

What "trivial implementation" do you have in mind?


Get matched term from the search routine, increment appropriate position in the match matrix. At the end of each document (if they aren't "files", have a document-begin identifying keyword, or see below for another option), verify if the match matrix fits the condition set.


Quote: weaselman

How is it "the reverse"? It is not about how complicated it is to generate all the variations, the problem is that there is a very large number of them.


Really? How many variations of the word "reverse" are there, for instance?


Quote: weaselman

No, it could not do it "as easily". It would have to backtrack, fastforward, to determine the boundaries, switch context, navigate indexes to locate the document, copy memory around to generate the result, etc., etc.


It already does all of the things required in order to deliver the matched line and context (lines before and lines after). Except now the boundaries are defined by the line-break symbol, you'd use document-break.

In fact, if you just keep every document as one paragraph, grep will be doing this very thing already. Problem solved.


Quote: weaselman

BTW, can you point me to the text of the Britannica you were using for your testing? I am curious to play around with it for myself ...


Here it is.
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
weaselman
weaselman
  • Threads: 20
  • Posts: 2349
Joined: Jul 11, 2010
March 6th, 2011 at 8:36:14 AM permalink
Quote: P90

Grep already has an option for reporting the number of matches.


Total number, not per document, not per term.

Quote:

Almost all of these things are really just tweaks to how exactly to do the output or what to exclude from it.


No, they aren't. They amount to an entirely different query algorithm.

Quote:

Even separating terms close to each other from any-anywhere would only be a matter of comparing byte offsets.


Offsets from where?

Quote:

These things are not a matter of search algorithm itself, just output routine.


No, they are exactly the matter of the algorithm

Quote:


Get matched term from the search routine, increment appropriate position in the match matrix. At the end of each document (if they aren't "files", have a document-begin identifying keyword, or see below for another option), verify if the match matrix fits the condition set.


Search matrix? Which search matrix?


Quote:

Really? How many variations of the word "reverse" are there, for instance?


I dunno. Do you?

Quote:


It already does all of the things required in order to deliver the matched line and context (lines before and lines after). Except now the boundaries are defined by the line-break symbol, you'd use document-break.


You said before that you did not want the documents separated. I have trouble keeping up with you conceding various points sometimes :)
Quote:


Here it is.


You said you used 29 volumes, I see 32 there.
"When two people always agree one of them is unnecessary"
P90
P90
  • Threads: 12
  • Posts: 1703
Joined: Jan 8, 2011
March 6th, 2011 at 9:51:06 AM permalink
Quote: weaselman

Total number, not per document, not per term.


Per document (if files are documents; no reason it can't be done per line or any other separator). Not per term though.
But, once again, you don't have to switch to searching each term separately to return matches per term. All you need to do is add a couple lines of code for the output into the part executed when a match is found. It's still going to be about as fast.


Quote: weaselman

Offsets from where?


Offsets from file start. Would not be difficult to add offsets from line start either.


Quote: weaselman

Search matrix? Which search matrix?


Match matrix. I thought you wanted to only report results that include all of terms A,B,C..., any of terms D,E,F..., none of terms G,H,I..., et cetera? Then algorithm's output per document would be something like 1,1,2,0,3,1,0,0,0, which you'd have to compare against your conditions.


Quote: weaselman

I dunno. Do you?


Yes. "Reverse", "reversed", "reverses". Or "revers", more broadly, without whole-word-only checking.


Quote: weaselman

You said before that you did not want the documents separated. I have trouble keeping up with you conceding various points sometimes :)


You don't need them separated. You can have them separated or not have them separated, at preference. This will not affect neither output nor complexity (to a measurable extent). If they are not separated, the output routine will get document name from the pointer index, updating it when memory offset overrolls the beginning of the next document; if they are separated, it will get document name from the search routine encountering document-break symbols.

Keeping documents separated with line-break is just how you would achieve it while using bog-standard grep, without bothering to dig up the code and recompile it. Since grep was originally intended for searching through program code and plain text, delivering output by line was the most convenient option. And it originally worked with files in file systems, where document is synonymous with file. But that is features of a particular implementation.


Quote: weaselman

You said you used 29 volumes, I see 32 there.


29 of the original 1911, another 3 are from a later edition (1922).

I could get more data, from there or from other sources, but the netbook was severely slowed down after creating a ramdrive the required size (2x data size, to test copying and compare two datasets) as it is - substantially larger it just couldn't do without issues.
Resist ANFO Boston PRISM Stormfront IRA Freedom CIA Obama
  • Jump to: