machine learning - Log likelihood to implement Naive Bayes for Text Classification -


I am implementing Neve Bays algorithm for text classification. I have ~ 1000 documents for training and 400 documents for the test. I feel that I have implemented the training correctly, but I am confused in the test part. Here's what I have done:

In my training ceremony:

  vocabulary shapes = gateline cutters inclication (); // Get all the unique words in the entire archive spamModelAre [glossary size]; NonspamModelArray [vocabularySize]; For each training_file {class = GetClassLabel (); // 0 and 1 for spam; non-spam documents = GetDocumentID (); Countertelling docs ++; If (class == 0) {counterTotalSpamTrainingDocs ++; } For each word in the document {freq = GetTermFrequency; // How often does this word appear in this document? Id = GetTermID; // if the unique ID of the word (square = 0) {// spam spam modell [id] + = freak; TotalNumberofSpamWords ++; // Number of total numbers marked as spam in the training docs} other {// non-spam nonspamModelArray [ID] + = freq; TotalNumberofNonSpamWords++; // training the total numbers marked as non-spam docs}} I size vocabulary {spam Moodelare [i] = spam Moodelare [i] / Kulnmberf Sfamsbd; NonspamModelArray [i] = nonspamModelArray [i] / totalNumberofNonSpamWords; } // Pre-CounterTotalSpamTrainingDocs / CounterTelting Docs; // Calculates the pre-feasibility of spam documents   

I think that I understand and implement the training correctly, but I'm not sure I can apply the test I am parting properly here, I am trying to go through each exam document and logP (spam | d) and logP (non-spam) for each document.

  glossary = getUniqueTermsInCollection; // all meet i Unique words in the entire collection for each test_file: document = getDocumentID; Logprobabilipspam = 0; Logprobabilityfonspam = 0; For each word in the document {freq = GetTermFrequency; // How often does this word appear in this document? Id = GetTermID; // unique ID words / Logpi (w1w2 .. wn) = C (wj) a "logP (wj) Logprobabilitispam + = freq * log (Spammodelarey [id]); LogProbabilityofNonSpam + = freq * log (NonspamModelArray [ID]) ;} // to // now I'm calculating the probability of being spam this document if log (Logprabbiliti for Nonspam + log (1-Preeprb) & gt; Logprobabiliti Ofspam + log (Preeprb)) {// RGMX [Logc (I | CK) + LogP (CK)] newclass = 1; // Not spam; Other {newclass = 0; // spam}} / / To   

My problem is: I am accurate 1 and 0 (want to return probability of each class rather than the spam / non-spam). I for example want to see my Newclass = 0.8684212 so I can apply threshold later. but I am confused here, I can calculate the probability for each document? can I use logProbabilities to calculate? < P>

Features one The probability of the data described by the set { F1 , F2 , ..., F } in the class C , nyve According to Bayes prospects, the model is

  P (C | F) = P (C) * (P (F1 | C) * P (F2 | C) * ... * P (FNC)) / P (F1, ..., FN)   

You have all the conditions (in logarithmic form), 1 / p (except F1 , ..., < the em> fn ) do not use the word because they Knives Beys is not used in Clasifayriyr, which are implemented (strictly, the classifier.)

you features The frequencies must also be collected, and calculated by them

  P (F1, ..., Fn)) = (F1) * ... * P (FN)    

Comments

Popular posts from this blog

qt - switch/case statement in C++ with a QString type -

python - sqlite3.OperationalError: near "REFERENCES": syntax error - foreign key creating -

Python's equivalent for Ruby's define_method? -