Saturday, December 26, 2009
Perfect Syncopation
Look at this:
Well, I've just run the word analysis tool on Livy Ab Urbe Condita Book 2. The important thing to note is that out of eighteen thousand words, only 20 weren't parsed and found in the dictionary. That's pretty much amazing.
How did this happen? Well, two things had to happen. First, I ignore capitalized words that weren't located in the dictionary. Essentially, I'm ignoring proper names and place names. Second, I programmed Numen's ability to parse syncopated perfect verbs: laudasse (laudavisse), norat (noverat), et cetera.
I still have a bit of testing to do to make sure I didn't break anything, but this was one of the few major hurdles that I needed to overcome to get a nearly perfect parsing engine!
18204 total word(s)But what does it mean???
17369 word(s) found
20 word(s) not found
815 word(s) ignored
0.11% of words not found
4.48% of words ignored
3264 unique word(s)
Well, I've just run the word analysis tool on Livy Ab Urbe Condita Book 2. The important thing to note is that out of eighteen thousand words, only 20 weren't parsed and found in the dictionary. That's pretty much amazing.
How did this happen? Well, two things had to happen. First, I ignore capitalized words that weren't located in the dictionary. Essentially, I'm ignoring proper names and place names. Second, I programmed Numen's ability to parse syncopated perfect verbs: laudasse (laudavisse), norat (noverat), et cetera.
I still have a bit of testing to do to make sure I didn't break anything, but this was one of the few major hurdles that I needed to overcome to get a nearly perfect parsing engine!
Labels: accuracy, features, paradigms, verbs
Thursday, November 5, 2009
IE8 Flashcard Bug Fixed
Salvete omnes!
I fixed a small bug that affected flashcard decks in Internet Explorer 8 (and presumably earlier versions). If you couldn't create a flashcard deck in that browser, it should be fixed now!
As a side note, I've been working on a big project with Livy and Vergil. I've essentially been editing all the mistakes and unfound words in those authors. This is especially useful in Livy because we have a corpus of about 1 million words! So the accuracy of this dictionary is creeping up to the highest possible levels! With the exception of proper names and place names, I'll ballbark its accuracy with common classical authors at about 95%.
Also, thanks to the people who have been reporting errors and bugs! It's really helpful to have your feedback!
I fixed a small bug that affected flashcard decks in Internet Explorer 8 (and presumably earlier versions). If you couldn't create a flashcard deck in that browser, it should be fixed now!
As a side note, I've been working on a big project with Livy and Vergil. I've essentially been editing all the mistakes and unfound words in those authors. This is especially useful in Livy because we have a corpus of about 1 million words! So the accuracy of this dictionary is creeping up to the highest possible levels! With the exception of proper names and place names, I'll ballbark its accuracy with common classical authors at about 95%.
Also, thanks to the people who have been reporting errors and bugs! It's really helpful to have your feedback!
Labels: accuracy, bugs, internet explorer, livy, vergil
Wednesday, October 14, 2009
J's and U's Updated / Speed Increases
I mentioned a few weeks ago that I planned on making I's/J's and U's/V's look the same on the back-end, while preserving their traditional orthographies on the front-end. I've just completed this task!
My main motivation for making this update is because certain passages stored in The Latin Library reflect the older conventions of using J's for consonantal I's or U's for both consonantal and vocalic V's. Numen's parsing engine was having trouble recognizing forms like jecit (iecit) and uuius (vivus). So now as a result -- after a bit of work -- the engine is updated and now recognizes more possibilities than ever. Incidentally, internally J's are stored as I's and U's are stored as V's.
Another project I completed at the same time is an order-of-magnitude speed improvement for parsing. I was trying to figure out ways to make the engine faster and I discovered a shortcut that boosts speed tremendously. When parsing a word, the engine used to spend between 250ms and 500ms parsing each word! That was always disappointing to me, but I had gotten around the problem by caching the results. Now, however, word parsing takes about 25ms!
Why bother improving the speed? Because soon I will be implementing word lists and frequency lists! A word list, of course, is just a "mini-lexicon" that defines only the words in your chosen passage, and a frequency list is a list of words in order of how often they appear in a passage. The word list will be helpful to quickly work on vocabulary for a passage, and a frequency list will help Latin students study more effectively by giving them the most frequent words first. I'm very excited about this feature, but I don't anticipate it will be done before January 10th (giving me the winter holiday to work on it).
That's all for now!
My main motivation for making this update is because certain passages stored in The Latin Library reflect the older conventions of using J's for consonantal I's or U's for both consonantal and vocalic V's. Numen's parsing engine was having trouble recognizing forms like jecit (iecit) and uuius (vivus). So now as a result -- after a bit of work -- the engine is updated and now recognizes more possibilities than ever. Incidentally, internally J's are stored as I's and U's are stored as V's.
Another project I completed at the same time is an order-of-magnitude speed improvement for parsing. I was trying to figure out ways to make the engine faster and I discovered a shortcut that boosts speed tremendously. When parsing a word, the engine used to spend between 250ms and 500ms parsing each word! That was always disappointing to me, but I had gotten around the problem by caching the results. Now, however, word parsing takes about 25ms!
Why bother improving the speed? Because soon I will be implementing word lists and frequency lists! A word list, of course, is just a "mini-lexicon" that defines only the words in your chosen passage, and a frequency list is a list of words in order of how often they appear in a passage. The word list will be helpful to quickly work on vocabulary for a passage, and a frequency list will help Latin students study more effectively by giving them the most frequent words first. I'm very excited about this feature, but I don't anticipate it will be done before January 10th (giving me the winter holiday to work on it).
That's all for now!
Labels: accuracy, database, development, features, frequency lists, google cache, orthography, parsing engine, performance, slowness, vergil, word lists
Thursday, October 1, 2009
I's and J's and U's and V's
So one problem with Numen is that it doesn't recognize the different possibilities when dealing with I's and J's and U's and V's. As you know, the J and the U were not Classical Latin letters. There has been a lot of back-and-forth over the past 200 years -- some editors prefer the originals and some prefer the modern versions.
But how should Numen deal with this issue? Internally, the computer is more precise and less forgiving than a human, and so in order to provide highly sensitive and accurate searches, the data needs to be "normalized". For example, I recently normalized verbs for consistency by changing all deponent verbs into their active forms and simply marking them as deponent with a data flag. Now, when you search for a deponent verb, the flashcard still shows something like sequor but internally it's stored as sequo. The reasoning here is simple: deponent verbs, regardless of their dictionary form and traditional morphology, still have active participles and their imperfect/pluperfect subjunctives are still formed from active infinitives.
But what about the I's and J's? Those are easy. Convert all the J's to I's, and most Latin readers won't have a problem -- this has been the convention for quite some time now. But then what about the V's and U's? Should I convert all the U's to V's? The opposite is true here: most Latinists would be mildly irritated by this form: uiuus (vivus).
The solution, which would be similar to the one for the deponent problem, would be to mark internally everything with I's and V's but then show the contemporary I's and U's and V's to the end users. That way, the computer can do accurate searches, but users get the information they are used to.
So, in the coming weeks, Numen will undergo this under-the-hood transformation. For the most part, users will never even notice -- except in one area. Searching for uiuus will be the same as searching for vivus!
Labels: accuracy, active, deponents, morphology, orthography, passive
