几乎所有的博客都可以在线阅读,或者通过 RSS 订阅源进行阅读。RSS
订阅源是一个包含博客及其所有文章条目信息的简单的 XML 文档。 程序中使用了
feedparser 第三方模块,可以轻松地从任何 RSS 或 Atom
订阅源中得到标题、链接和文章的条目。完整代码如下:
''' Created on Jul 14, 2012 @Author: killua @E-mail: [email protected] @Homepage: http://www.yidooo.net @Decriptioin: Counting the words in a Feed feedparser:feedparser is a Python library that parses feeds in all known formats, including Atom, RSS, and RDF.It runs on Python 2.4 all the way up to 3.2. dataset: http://kiwitobes.com/clusters/feedlist.txt You can download feeds from this list. Maybe some feeds you can access in China. '''
import feedparser import re
#Get word from feed defgetwords(html): #Remove all the HTML tags text = re.compile(r"<[^>]+>").sub('', html)
#Split words by all non-alpha characters words = re.compile(r"[^A-Z^a-z]+").split(text)
#Convert words to lowercase wordlist = [word.lower() for word in words if word != ""]
return wordlist
#Returns title and dictionary of word counts for an RSS feed defgetFeedwordcounts(url): #Parser the feed d = feedparser.parse(url) wordcounts = {}
#Loop over all the entries for e in d.entries: if'summary'in e: summary = e.summary else: summary = e.description
words = getwords(e.title + ' ' + summary) for word in words: wordcounts.setdefault(word, 0) wordcounts[word] += 1
return d.feed.title, wordcounts
if __name__ == '__main__': #count the words appeared in blog blogcount = {} wordcounts = {}
feedFile = file('resource/feedlist.txt') feedlist = [line for line in feedFile.readlines()]
for feedUrl in feedlist: try: title, wc = getFeedwordcounts(feedUrl) wordcounts[title] = wc for word, count in wc.items(): blogcount.setdefault(word, 0) if count > 1: blogcount[word] += 1 except: print'Failed to parse feed %s' % feedUrl
wordlist = [] for w, bc in blogcount.items(): frac = float(bc) / len(feedlist) if frac > 0.1and frac < 0.5: wordlist.append(w)
#Write the result to the file datafile = file('blogdata', 'w') #Write result's head datafile.write('Blog') for word in wordlist: datafile.write('\t%s' % word) datafile.write('\n') #Write results for blogname, wc in wordcounts.items(): print blogname datafile.write(blogname) for word in wordlist: if word in wc: datafile.write("\t%d" % wc[word]) else: datafile.write("\t0") datafile.write('\n')
1. Suppose that a Selection Sort of 80 items has completed 32
iterations of the main loop. How many items are now guaranteed to be in
their final spot (never to be moved again)?
A) 16 B) 31 C) 32 D) 39 E) 40
2. Which Synchronization mechanism(s) is/are used to avoid
race conditions among processes/threads in operating
systems?
A) Mutex B) Mailbox C) Semaphore D) Local procedure call
3. There is a sequence of n numbers 1, 2, 3,.., n and a stack
which can keep m numbers at most. Push the n numbers into the stack
following the sequence and pop out randomly. Suppose n is 2 and m is 3,
the output sequence may be 1, 2 or 2, 1, so we get 2 different
sequences. Suppose n is 7 and m is 5, please choose the output sequences
of the stack:
A) 1, 2, 3, 4, 5, 6, 7
B) 7, 6, 5, 4, 3, 2, 1
C) 5, 6, 4, 3, 7, 2, 1
D) 1, 7, 6, 5, 4, 3, 2
E) 3, 2, 1, 7, 5, 6, 4
4. What is the result of binary number 01011001 after
multiplying by 0111001 and adding 1101110?
A) 0001 0100 0011 1111
B) 0101 0111 0111 0011
C) 0011 0100 0011 0101
5. What is output if you compile and execute the following
code?
voidmain() { int i = 11; intconst *p = &i; p++; printf("%d", *p); }
11 B) 12 C) Garbage value D) Compile error E) None of above
6. Which of following C++ code is correct?
A) int f() { int a = new int(3); return a; } B) int f() {
int a[3] = {1, 2, 3}; return a; } C) vector f() { vector v(3);
return v; } D) void f(int ret) { int a[3] = {1, 2, 3}; ret = a;
return; }
7. Given that the 180-degree rotated image of a 5-digit
number is another 5-digit number and the difference between the numbers
is 78633, what is the original 5-digit number?
A) 60918 B) 91086 C) 18609 D) 10968 E) 86901
8. Which of the following statements are true?
A) We can create a binary tree from given inorder and preorder traversal
sequences.
B) We can create a binary tree from given preorder and postorder
traversal sequences.
C) For an almost sorted array, insertion sort can be more effective than
Quicksort.
D) Suppose T(n) is the runtime of resolving a problem with n elements,
T(n) = Θ(1) if n = 1; T(n) = 2T(n/2) + Θ(n) if > 1; so T(n) is Θ(n
log n).
E) None of the above.
9. Which of the following statements are true?
A) Insertion sort and bubble sort are not effcient for large data
sets.
B) Quick sort makes O(n^2) comparisons in the worst case.
C) There is an array: 7, 6, 5, 4, 3, 2, 1. If using selection sort
(ascending), the number of swap operation is 6.
D) Heap sort uses two heap operations: insertion and root
deletion.
E) None of above.
10. Assume both x and y are integers, which one of the
followings returns the minimum of the two integers?
A) y ^ ((x ^ y) & ~(x < y))
B) y ^(x ^ y)
C) x ^ (x ^ y)
D) (x ^ y) ^ (y ^ x)
E) None of the above
11. The Orchid Pavilion (兰亭集序) is well known as the top
of "行书" in history of Chinese literature. The most fascinating
sentence "Well I know it is a lie to say that life and death is the same
thing, and that longevity and early death make no difference! Alas!"
("周知一死生为虚诞,齐彭殇为妄作。") By counting the characters of the
whole content (in Chinese version), the result should be 391 (including
punctuation). For these characters written to a text file, please select
the possible file size without any data corrupt.
A) 782 bytes in UTF-16 encoding
B) 784 bytes in UTF-16 encoding
C) 1173 bytes in UTF-8 encoding D) 1176 bytes in UTF-8 encoding
E) None of the above
12. Fill the blanks inside class definition
classTest { public: ____ int a; ____ int b; public: Test::Test(int _a, int _b) : a(_a) {b = _b;} }; int Test::b; int _tmain(int argc, __TCHAR *argv[]) { Test t1(0, 0), t2(1, 1); t1.b = 10; t2.b = 20; printf("%u %u %u %u", t1.a, t1.b, t2.a, t2.b); }
Running result: 0 20 1 20
A) static/const
B) const/static
C) --/static
D) const static/static
E) None of the above
13. A 3-order B-tree has 2047 key words, what is the maximum
height of the tree?
A) 11 B) 12 C) 13 D) 14
14. In C++, which of the following keyword(s) can be used on
both a variable and a function?
A) static B) virtual C) extern D) inline E) const
abdcccd B) abdd C) abcc D) abddcccd E) Access Violation
16. Consider the following definition of a recursive
function, power, that will perform exponentiation.
intpower(int b, int e) { if (e == 0) return1; if (e %2 == 0) returnpower (b * b, e / 2); return b * power(b * b, e / 2); }
Asymptotically (渐进地) in terms of the exponent e, the number of
calls to power that occur as a result of the call power(b, e) is* A)
logarithmic B) linear C) quadratic D) exponential
17. Assume a full deck of cards has 52 cards, 2 black suits
(spade and club) and 2 red suits (diamond and heart). If you are given a
full deck, and a half deck (with 1 red suit and 1 black suit), what's
the possibility for each one getting 2 red cards if taking 2
cards?
A) 1/2, 1/2 B) 25/102, 12/50 C) 50/51, 24/25 D) 25/51, 12/25 E) 25/51,
1/2
18. There is a stack and a sequence of n numbers (i.e., 1, 2,
3, ..., n). Push the n numbers into the stack following the sequence and
pop out randomly. How many different sequences of the number we may get?
Suppose n is 2, the output sequence may be 1, 2 or 2, 1, so we get 2
different sequences.
A) C_2n^n
B) C_2n^n - C_2n^(n + 1)
C) ((2n)!) / (n + 1)n!n!
D) n!
E) None of the above
19. Longest Increasing Subsequence (LIS) means a sequence
containing some elements in another sequence by the same order, and the
values of elements keeps increasing. For example, LIS of {2, 1, 4, 2, 3,
7, 4, 6} is {1, 2, 3, 4, 6}, and its LIS length is 5. Considering an
array with N elements, what is the lowest time and space complexity to
get the length of LIS?
A) Time: N^2, Space: N^2
B) Time: N^2, Space: N
C) Time: NlogN, Space: N
D) Time: N, Space: N
E) Time: N, Space: C
20. What is the output of the following piece of C++
code?
''' Created on Jun 30, 2012 @Author: killua @E-mail: [email protected] @Homepage: http://www.yidooo.net Data set download from : http://www.grouplens.org/system/files/ml-100k.zip MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data set consists of: * 100,000 ratings (1-5) from 943 users on 1682 movies. * Each user has rated at least 20 movies. * Simple demographic info for the users u.data -- The full u data set, 100000 ratings by 943 users on 1682 items. Each user has rated at least 20 movies. Users and items are numbered consecutively from 1. The data is randomly ordered. This is a tab separated list of user id | item id | rating | timestamp. The time stamps are unix seconds since 1/1/1970 UTC u.item -- Information about the items (movies); this is a tab separated list of movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not; movies can be in several genres at once. The movie ids are the ones used in the u.data data set. '''
from math import sqrt
defloadMovieData(path = "./data/"): """ Load movie data from u.data and u.item @param path: Data set path """ #Get movie's title movies = {} for line inopen(path + '/u.item'): (movieId, movieTitle) = line.split('|')[0:2] movies[movieId] = movieTitle
#Load Data movieData = {} userData = {} for line inopen(path + '/u.data'): (userId, itemId, rating, timestamp)=line.split('\t') movieData.setdefault(movies[itemId], {}) movieData[movies[itemId]][userId] = float(rating) userData.setdefault(userId, {}) userData[userId][movies[movieId]] = float(rating)
return (movieData, userData)
defeuclidean(data, p1, p2): "Calculate Euclidean distance" distance = sum([pow(data[p1][item]-data[p2][item],2) for item in data[p1] if item in data[p2]])
return1.0 / (1 + distance)
defpearson(data, p1, p2): "Calculate Pearson correlation coefficient" corrItems = [item for item in data[p1] if item in data[p2]]
n = len(corrItems) if n == 0: return0;
sumX = sum([data[p1][item] for item in corrItems]) sumY = sum([data[p2][item] for item in corrItems]) sumXY = sum([data[p1][item] * data[p2][item] for item in corrItems]) sumXsq = sum([pow(data[p1][item], 2) for item in corrItems]) sumYsq = sum([pow(data[p2][item],2) for item in corrItems])
defgetSimilarItems(movieData, n = 20, similarity=pearson): """ Create a dictionary of items showing which other items they are most similar to. """
results = {} for movie in movieData: #Get n items which most similar to movie matches = [(similarity(movieData, movie, otherMovie),otherMovie) for otherMovie in movieData if movie != otherMovie] matches.sort() matches.reverse() results[movie] = matches[0:n]
return results
defgetRecommendationsItems(userData, user, similarItems, n = 10): """ Get recommendations items for user """ userRatings = userData[user] itemScores = {} totalSim = {}
# Loop over items rated by this user for (item, rating) in userRatings.items(): # Loop over items similar to this one for (simValue, simItem) in similarItems[item]: # Ignore if this user has already rated this item if simItem in userRatings: continue # Weighted sum of rating times similarity itemScores.setdefault(simItem, 0) itemScores[simItem] += simValue * rating # Sum of all the similarities totalSim.setdefault(simItem, 0) totalSim[simItem] += simValue
# Divide each total score by total weighting to get an average rankings = [(score / totalSim[item], item) for (item, score) in itemScores.items() if totalSim[item] != 0] rankings.sort() rankings.reverse()
# A dictionary of movie critics and their ratings of a small # set of movies movies = {'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5, 'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5, 'The Night Listener': 3.0}, 'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5, 'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0, 'You, Me and Dupree': 3.5}, 'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0, 'Superman Returns': 3.5, 'The Night Listener': 4.0}, 'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0, 'The Night Listener': 4.5, 'Superman Returns': 4.0, 'You, Me and Dupree': 2.5}, 'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0, 'You, Me and Dupree': 2.0}, 'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5}, 'Toby': {'Snakes on a Plane':4.5, 'You, Me and Dupree':1.0, 'Superman Returns':4.0}}
defeuclidean(data, p1, p2): "Calculate Euclidean distance" distance = sum([pow(data[p1][item]-data[p2][item],2) for item in data[p1] if item in data[p2]])
return distance
defpearson(data, p1, p2): "Calculate Pearson correlation coefficient" corrItems = [item for item in data[p1] if item in data[p2]]
n = len(corrItems) if n == 0: return0;
sumX = sum([data[p1][item] for item in corrItems]) sumY = sum([data[p2][item] for item in corrItems]) sumXY = sum([data[p1][item] * data[p2][item] for item in corrItems]) sumXsq = sum([pow(data[p1][item], 2) for item in corrItems]) sumYsq = sum([pow(data[p2][item],2) for item in corrItems])