2015-04-13 :-(
_ 午後
1300 労働
_ [tdiary][mecab][形態素解析]tdiary に DiaryContainer が実装されたので日記の過去 1 年間の単語の出現頻度を数えてみた
環境
- 最近の NetBSD
- 最近の Mecab
- 最近の Mecab 辞書 MeCab 用の新語辞書 mecab-ipadic-neologd を公開しました
- 最近の tdiary ( tdiary-core/diary_container.rb at master - tdiary/tdiary-core )
% uname -rsm NetBSD 7.99.7 i386
% ruby200 -v ruby 2.0.0p247 (2013-06-27 revision 41674) [i486-netbsdelf]
% mecab -v mecab of 0.996
コード
テキトーに書く。
# -*- coding: utf-8; -*- require 'MeCab' require 'cgi' require 'tdiary' require 'tdiary/config' require 'lib/tdiary/diary_container' WORD_LENGTH_MIN = 3 def main(argv) frequency ||= {} frequency.default = 0 mecab_tagger = MeCab::Tagger.new ("-Ochasen -d /usr/pkg/lib/mecab/dic/ipadic") start_year = argv[0] start_month = argv[1] cgi = CGI::new(accept_charset: "UTF-8") conf = TDiary::Config.new(cgi) diary = TDiary::DiaryContainer.new(conf, start_year, start_month) diary.diaries.each { |k, v| parse = mecab_tagger.parse( v.to_src ) parse.split("\n").grep(/固有名詞/).each {|morpheme| word = morpheme.split("\t")[0] frequency[ word ] += 1 if word.length >= WORD_LENGTH_MIN } } frequency.sort{ |a, b| a[1] <=> b[1] }.reverse.each {|word, count| puts "#{word} #{count}" } end main(ARGV)
tdiary-core/index.rb の隣に設置したり -I をテキトーに書く。うーん。やはり辞書だよなあ... akahoshitakuya ってなんだ
% ruby200 -I.:./lib hoge3.rb 2014 04 | head -30 images 156 png 122 img 64 flickr 46 medium 45 file 37 write 36 http 36 usr 35 sysbuild 34 JPG 28 root 28 com 27 src 21 TARI 17 drwxr 17 IMG 17 NetBSD 15 stopped 14 book 14 akahoshitakuya 14 var 14 www 13 Nov 11 set 10 lib 10 nbmake 10 release 10 the 10 build 9