Terrier Core

PorterStemmer doesnt match expected output by Porter himself

Details

  • Type: Bug Bug
  • Status: Resolved Resolved
  • Priority: Major Major
  • Resolution: Fixed
  • Affects Version/s: None
  • Fix Version/s: 3.0
  • Component/s: None
  • Description:
    Hide
    Martin Porter's website provides some test cases for stemming. Our Porter stemmer predates the Porter stemmer in Java, as it was hand-coded by Gianni. It has some known points of difference from Porter's algorithm.

    Below are a list of terms that our current stemmer stems differently from Porters:
    These are just the terms starting with "a"

    "abruption", "acquisition", "addiction", "addition", "additions", "admission",
    "admonition", "adoption", "affection", "affections", "affliction", "afflictions",
    "allusion", "ambition", "ambitions", "apparition", "apparitions",
    "apprehension", "apprehensions", "ascension", "aspersion", "assumption", "assumptions",
    "attention", "attraction", "attribution"

    For these terms, it seems that we either remove one character too much, or that we don't remove at all.
    Show
    Martin Porter's website provides some test cases for stemming. Our Porter stemmer predates the Porter stemmer in Java, as it was hand-coded by Gianni. It has some known points of difference from Porter's algorithm. Below are a list of terms that our current stemmer stems differently from Porters: These are just the terms starting with "a" "abruption", "acquisition", "addiction", "addition", "additions", "admission", "admonition", "adoption", "affection", "affections", "affliction", "afflictions", "allusion", "ambition", "ambitions", "apparition", "apparitions", "apprehension", "apprehensions", "ascension", "aspersion", "assumption", "assumptions", "attention", "attraction", "attribution" For these terms, it seems that we either remove one character too much, or that we don't remove at all.

Activity

Hide
Craig Macdonald added a comment - 28/Jan/10 7:53 PM - edited

Changing the stemmer will potentially destroy usability of all TRv3 indices we have currently. Everyone else has to re-index anyway.

Here are the options:

  • do nothing
  • add Porter's actual stemmer as another non-default option
  • add Porter's stemmer as default option - we should do this at a major version change. Also, it would be good to see if performance was positively impacted for any of our test collections.
Show
Craig Macdonald added a comment - 28/Jan/10 7:53 PM - edited Changing the stemmer will potentially destroy usability of all TRv3 indices we have currently. Everyone else has to re-index anyway. Here are the options:
  • do nothing
  • add Porter's actual stemmer as another non-default option
  • add Porter's stemmer as default option - we should do this at a major version change. Also, it would be good to see if performance was positively impacted for any of our test collections.
Hide
Rodrygo L. T. Santos added a comment - 18/Feb/10 6:18 PM

I second the idea of having Porter's correct implementation as the default option (and maybe provide the current one as a deprecated version, just for backwards compatibility). Also, as we discussed, this is the best opportunity for correcting this, since indices will change anyway with TRv3. The only disadvantage is indeed to have to rebuild our own TRv3 indices.

Show
Rodrygo L. T. Santos added a comment - 18/Feb/10 6:18 PM I second the idea of having Porter's correct implementation as the default option (and maybe provide the current one as a deprecated version, just for backwards compatibility). Also, as we discussed, this is the best opportunity for correcting this, since indices will change anyway with TRv3. The only disadvantage is indeed to have to rebuild our own TRv3 indices.
Hide
Craig Macdonald added a comment - 05/Mar/10 10:18 AM

Resolved.

I have replaced PorterStemmer and WeakPorterStemmer with Porter's own implementation.
TRv2 implementations have become TRv2PorterStemmer and TRv2WeakPorterStemmer. If you have indices based on these, you need to update your property files NOW.

Show
Craig Macdonald added a comment - 05/Mar/10 10:18 AM Resolved. I have replaced PorterStemmer and WeakPorterStemmer with Porter's own implementation. TRv2 implementations have become TRv2PorterStemmer and TRv2WeakPorterStemmer. If you have indices based on these, you need to update your property files NOW.

People

Dates

  • Created:
    28/Jan/10 7:48 PM
    Updated:
    05/Mar/10 5:34 PM
    Resolved:
    05/Mar/10 10:18 AM