Regulärer Ausdruck für HTML select-Tag

zer0 · 29. September 2011

Hallo,

Ich habe ein HTML Quellcode mit einem select-Tag. Darin sind mehrere option-Tags, und hier muss ich den letzten Tag auswählen und den Wert rausparsen.

Aber mein Regex will einfach nicht klappen

Das ist ein Auschnitt aus dem HTML:

HTML:

<select name="offlinePeriod" class="listbox">
								
	<option value="0" selected="selected">01-Aug-2009 till 31-Aug-2009</option>
								
	<option value="1">01-Sep-2009 till 30-Sep-2009</option>
								
	<option value="2">01-Oct-2009 till 31-Oct-2009</option>
								
	<option value="3">01-Nov-2009 till 30-Nov-2009</option>
								
	<option value="4">01-Dec-2009 till 31-Dec-2009</option>
								
	<option value="5">01-Jan-2010 till 31-Jan-2010</option>
								
	<option value="6">01-Feb-2010 till 28-Feb-2010</option>
								
	<option value="7">01-Mar-2010 till 31-Mar-2010</option>
								
	<option value="8">01-Apr-2010 till 30-Apr-2010</option>
								
	<option value="9">01-May-2010 till 31-May-2010</option>
								
	<option value="10">01-Jun-2010 till 30-Jun-2010</option>
								
	<option value="11">01-Jul-2010 till 31-Jul-2010</option>
								
	<option value="12">01-Aug-2010 till 31-Aug-2010</option>
								
	<option value="13">01-Sep-2010 till 30-Sep-2010</option>
								
	<option value="14">01-Oct-2010 till 31-Oct-2010</option>
								
	<option value="15">01-Nov-2010 till 30-Nov-2010</option>
								
	<option value="16">01-Dec-2010 till 31-Dec-2010</option>
								
	<option value="17">01-Jan-2011 till 31-Jan-2011</option>
								
	<option value="18">01-Feb-2011 till 28-Feb-2011</option>
								
	<option value="19">01-Mar-2011 till 31-Mar-2011</option>
								
	<option value="20">01-Apr-2011 till 30-Apr-2011</option>
								
	<option value="21">01-May-2011 till 31-May-2011</option>
								
	<option value="22">01-Jun-2011 till 30-Jun-2011</option>
							
	<option value="23">01-Jul-2011 till 31-Jul-2011</option>
								
	<option value="24">01-Aug-2011 till 31-Aug-2011</option>
								
</select>

Und ich brauche draus nun den Wert des letzten option-Tags, also 24!

So sieht mein Regex aus:

Code:

<option value=\"([0-9]{2})\">.*?<\\/option>\\s*<\\/select>

Aber das wird nichts bei mir. Kann mir jemand beim Regex aushelfen, oder kennt jemand eine Java-Libary für den Umgang mit HTML?

CPoly · 29. September 2011

Java:

Pattern p = Pattern.compile("<option value=\"([0-9]+)\">.*?</option>\\s*</select>", Pattern.MULTILINE);
Matcher m = p.matcher(html);

if(m.find()) {
	System.out.println(m.group(1));
}

Edit: Jetzt sehe ich erst, dass dein Ausdruck identisch ist :-D

zer0 · 29. September 2011

Ich hab's zwar jetzt geschafft, das ist aber nicht die Musterlösung die ich wollte.

Würde gerne eine elegantere Lösung haben, falls das jemand schaftt würde ich mich sehr freuen!

Java:

            // Zuerst nach dem select-Block suchen
            String periodRegex = "<select name=\"offlinePeriod\" class=\"listbox\">(.*?)<\\/select>";
            Pattern p = Pattern.compile(periodRegex, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
            Matcher m = p.matcher(content);

            if (m.find()) {
                String periodString = m.group(1).trim();
                // Danach nach dem letzten option-Tag suchen
                String optionRegex = "<option[^>]+value=\"([0-9]+)\"[^>]+>([^>]+)<\\/option>$";

                p = Pattern.compile(optionRegex, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
                m = p.matcher(periodString);

                if(m.find()) {
                    String period = m.group(1);
                    return;
                }
            }

Thomas Darimont · 29. September 2011

Hallo,

wie wärs denn mit XPath anstatt eines Regex?

Java:

package de.tutorials.training;

import java.io.InputStream;

import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.xml.sax.InputSource;

public class HtmlDomElementExtraction {
	public static void main(String[] args) throws Exception {
		String xpath = "(//select[@name='offlinePeriod' and @class='listbox']/option)[last()]/text()";
		InputStream htmlResource = HtmlDomElementExtraction.class.getClassLoader().getResourceAsStream("test.html");
		
		String result = evaluateXPath(xpath, htmlResource);
		
		System.out.println("Result: " + result);
	}

	private static String evaluateXPath(String xpath, InputStream htmlResource)
			throws XPathExpressionException {
		return XPathFactory
				.newInstance()
				.newXPath()
				.compile(xpath)
				.evaluate(
						new InputSource(htmlResource));
	}
}

Well-formed (x)HTML kann man auch mit den normalen XML Java Boardmitteln (DOM/SAX/StAX Parser) verarbeiten.

Für nicht well-formed HTML kannst du auch javax.swing.text.html.HTMLDocument oder (besser) eine der hier gelisteten libs verwenden:
http://java-source.net/open-source/html-parsers

Gruß Tom

Regulärer Ausdruck für HTML select-Tag

zer0

Erfahrenes Mitglied

CPoly

Mitglied Weizenbier

zer0

Erfahrenes Mitglied

Thomas Darimont

Erfahrenes Mitglied

Neue Beiträge