Indizierte Suche

M_Kay · 18. Januar 2011

Hallo allerseits,

ich hätte da mal eine Frage bzgl. einer indizierten Suche, die ich mit PHP + MySQL 5 implementiert hab:

Ich habe 3 Tabellen:

1. documents (id, text)
Diese Tabelle enthält Text-Dokumente (ca. 300.000). Sie werden indiziert und die Wörter werden in folgender Tabelle abgelegt:

2. mywords(id, word)
Sie enthält ca 150.000 Begriffe.

3. myindex (doc_id, word_id)
Diese Tabelle enthält die Zuordnung der Wörter aus der mywords-Tabelle zu den Dokumenten in der documents-Tabelle. Sie stellt also meinen Suchindex dar und enthält ca 5 Millionen Einträge.

Wenn ich jetzt ein Document suchen will anhand eines Begriffes, könnte ich z.B. folgendes machen:

Code:

SELECT `myindex`.doc_id
FROM mywords
JOIN `myindex` ON `myindex`.word_id = mywords.id
WHERE mywords.word LIKE 'emmi%'

Durch die Verwendeten Primary Keys/Indizes ist das auch schön schnell. Die Antwort kommt im 0,000X-Sekunden-Bereich.

Mein Problem ist jetzt, wie ich es schaffe die Suche durch weitere Suchwörter einzuschränken, ohne, dass die Geschwindikeit stark drunter leidet. Mein erster Versuch waren weitere Joins, z.B.:

Code:

SELECT `myindex`.doc_id

FROM mywords AS words1
JOIN `myindex` ON `myindex`.word_id = words1.id

JOIN mywords AS words2
JOIN `myindex` AS index2 ON index2.word_id = words2.id

JOIN mywords AS words3
JOIN `myindex` AS index3 ON index3.word_id = words3.id

WHERE 1
	AND words1.word LIKE 'emmi%'
	AND words2.word LIKE 'sam%'
	AND words3.word LIKE 'bree%'
	AND `myindex`.doc_id = index2.doc_id
	AND `myindex`.doc_id = index3.doc_id

Für 2 Suchwörter hat die Suche noch 2 Sekunden gedauert, bei den 3 Wörtern schon 35 Sekunden (... auch wenn PMA etwas anderes anzeigt) und die CPU war komplett ausgelastet.

Meine nächste Idee ging in Richtung verschachtelter Abfragen:

Code:

SELECT `myindex`.doc_id
FROM
	(
		SELECT `myindex`.doc_id
		FROM
			(
				SELECT `myindex`.doc_id
				FROM mywords
				JOIN `myindex` ON `myindex`.word_id = mywords.id
				WHERE mywords.word LIKE 'emmi%'
			) AS results1
		JOIN `myindex` ON `myindex`.doc_id = results1.doc_id
		JOIN mywords ON `myindex`.word_id = mywords.id
		WHERE mywords.word LIKE 'sam%'
	) AS results2
JOIN `myindex` ON `myindex`.doc_id = results2.doc_id
JOIN mywords ON `myindex`.word_id = mywords.id
WHERE mywords.word LIKE 'bree%'

Für die 3 Suchworte komme ich damit auf ca. 3 Sekunden Suchzeit, was auch noch nicht wirklich prickelnd ist.

Mit EXISTS dauert die Suche auch ein paar Sekunden:

Code:

SELECT *
FROM `myindex` AS in2
JOIN mywords AS wo2 ON in2.word_id = wo2.id
WHERE 1
	AND wo2.word LIKE 'emmi%'
	AND EXISTS
	(
		SELECT *
		FROM `myindex` AS in1
		JOIN mywords AS wo1 ON in1.word_id = wo1.id
		WHERE 1
			AND in1.doc_id = in2.doc_id
			AND wo1.word LIKE 'sam%'
	)
	AND EXISTS
	(
		SELECT *
		FROM `myindex` AS in1
		JOIN mywords AS wo1 ON in1.word_id = wo1.id
		WHERE 1
			AND in1.doc_id = in2.doc_id
			AND wo1.word LIKE 'bree%'
	)

Mit folgender Suche läuft es bisher am schnellsten:

Code:

SELECT `myindex`.doc_id
FROM mywords
JOIN `myindex` ON `myindex`.word_id = mywords.id
WHERE
	(
		mywords.word LIKE 'emmi%' OR
		mywords.word LIKE 'sam%' OR
		mywords.word LIKE 'bree%'
	)
GROUP BY `myindex`.doc_id HAVING COUNT( word_id )=3

Damit komme ich auf jeden Fall unter eine Sekunde, wenn mySQL schon ein bisschen gecached hat.

Jetzt meine Frage: Habt ihr noch weitere Ideen für die Such-Queries? Also wie man das ganze noch beschleunigen könnte?

Gruss,
M_Kay

Yaslaw · 19. Januar 2011

Wie sehen denn die Indexe auf den Tabellen aus?
Was gibt EXPLAIN() zurück?

Alle SQLs ausser dem letzten sind Schrott und verbieten sich von selbst
Das folgende müsste etwa gleich schnell sein

PHP:

SELECT 
	`myindex`.doc_id
FROM 
	(
		SELECT
			id
		FROM
			mywords
		WHERE
	        word LIKE 'emmi%' OR
	        word LIKE 'sam%' OR
	        word LIKE 'bree%'
	) AS mywords_selected
	LEFT JOIN  myindex
		ON myindex.word_id = mywords_selected.id
GROUP BY 
	`myindex`.doc_id 
HAVING 
	COUNT(word_id )=3

in mywords würde ich einen eindeutigen Index auf die Felder id und word setzen. Ev. kannst du da ein INDEX ONLY SCAN Zugriff auf die mywords hinkriegen. Sprich, er durchsucht nur den Index und nicht braucht keinen Tabellenzugriff. Min. bei Oracle geht das. Obs bei MySQL auc geht weiss ich grad nicht.
Im EXPLAIN() müsste dann unter TYPE INDEX stehen und unter EXTRA müsste USING INDEX stehen. (http://dev.mysql.com/doc/refman/5.1/de/explain.html).
ggf. mal mit FORCE INDEX() den entsprechenden Index erzwingen.

Ebenfalls ein eindeutiger Index auf die 2 id-Feldern in der myindex setzen.

So, das waren alles nur Ideen....

M_Kay · 20. Januar 2011

Danke für deine Antwort, yaslaw.
Die Indizes hatte ich vergessen zu erwähnen. Die sehen so aus:

Code:

myindex:
UNIQUE KEY `word_id` (`word_id`,`doc_id`),
KEY `word_id_2` (`word_id`)

mywords:
PRIMARY KEY (`id`),
UNIQUE KEY `word` (`word`)

Die Indizes werden auch genutzt, sonst würde alles noch wesentlich länger dauern:

Code:

select_type	table 		type	possible_keys		key 		key_len	ref				rows	Extra 
SIMPLE 	mywords 	range 	PRIMARY,word 		word		202		NULL			54		Using where; Using myindex; Using temporary; Using filesort
SIMPLE 	myindex 	ref 	word_id,word_id_2 	word_id 	4		db.mywords.id	10		Using myindex

Der Such-Query selbst braucht jetzt so ca. 0,05 sec.

Nun möchte ich dazu aber auch noch die Informationen zu diesen Dokumenten laden. Dies habe ich einmal versucht, indem ich einfach weitere Joins dazugenommen habe und einmal den Such-Query als Unter-Query:

1. Such-Query als Sub-Query:

Code:

SELECT x y z usw
FROM (
	SELECT `index`.doc_id
	FROM index_words
	JOIN `index` ON `index`.word_id = index_words.id
	WHERE (
		index_words.word LIKE 'emmi%' OR
		index_words.word LIKE 'sam%'
	)
	GROUP BY `index`.doc_id HAVING COUNT( word_id )='2'
) AS searchindex
JOIN documents ON documents.id = searchindex.doc_id
JOIN documents_infos ON documents.id=documents_infos.doc_id
JOIN documents_moreinfos ON documents.time=documents_moreinfos.time
LIMIT 0,21

Der Explain:

Code:

id 	select_type 	table 				type 	possible_keys 		key 		key_len 	ref 										rows 	Extra
1 	PRIMARY 		<derived2> 			ALL 	NULL 				NULL 		NULL 		NULL 										2203 	Using temporary; Using filesort
1 	PRIMARY 		documents 			eq_ref 	PRIMARY,time 		PRIMARY 	4 			searchindex.doc_id 							1 		Using where
1 	PRIMARY 		documents_moreinfos	eq_ref 	time 				time 		21 			db.documents.time 	1
1 	PRIMARY 		documents_infos 	eq_ref 	doc_id 				doc_id 		4 			db.documents.id 							1 		Using where
2 	DERIVED 		mywords 			range 	PRIMARY,word 		word 		202 		NULL 										54 		Using where; Using myindex; Using temporary; Using filesort
2 	DERIVED 		myindex 			ref 	word_id,word_id_2 	word_id 	4 			db.mywords.id 								10 		Using myindex

2. Mit direkten Joins sieht der explain denke ich besser aus, allerdings ist diese Abfrage wesentlich langsamer:

Code:

SELECT x y z usw
FROM index_words
JOIN `index` ON `index`.word_id = index_words.id
JOIN documents ON documents.id = `index`.doc_id
JOIN documents_infos ON documents.id=documents_infos.doc_id
JOIN documents_moreinfos ON documents.time=documents_moreinfos.time
WHERE (
	index_words.word LIKE 'emmi%' OR
	index_words.word LIKE 'sam%'
)
GROUP BY `index`.doc_id HAVING COUNT( word_id )='2'
LIMIT 0,21

Der Explain:

Code:

id 	select_type 	table 				type 	possible_keys 		key 	key_len 	ref 						rows 	Extra
1 	SIMPLE 			mywords 			range 	PRIMARY,word 		word 	202 		NULL 						54 		Using where; Using myindex; Using temporary; Using f...
1 	SIMPLE 			myindex 			ref 	word_id,word_id_2 	word_id 4 			db.mywords.id 				10 		Using myindex
1 	SIMPLE 			documents 			eq_ref 	PRIMARY,time 		PRIMARY 4 			db.myindex.doc_id 			1 		Using where
1 	SIMPLE 			documents_infos 	eq_ref 	doc_id 				doc_id 	4 			db.myindex.doc_id 			1 	 
1 	SIMPLE 			documents_moreinfos eq_ref 	time 				time 	21 			db.documents.time 	1

Habt ihr vllt eine Idee, wie es noch schneller geht? Indizes werden ja immer genutzt.

Indizierte Suche

M_Kay

Mitglied

Yaslaw

alter Rempler

M_Kay

Mitglied

Neue Beiträge