Blog arrow Tutorials/Publications

Feeds

Login Form






Lost Password?
No account yet? Register
Tutorials/Publications
Web Analytics DatawareHouse Design Print E-mail
Tuesday, 23 October 2007

Outline

Introduction
Web Analytics datawarehouse are analytic oriented (OLAP). They are not transactional (OLTP). Data size: <1 Terabytes. For larger data sets, Netezza or Datallegro datawarehouse solutions is recommended for better performance.

Image

STEPS
  1. Plan a data warehouse - built a datamart
  2. Define metadata: Processes, Records, Dimensions, and Facts.
  3. Which is the business process in focus?
  4. What is the level of granularity of the Fact tables
    1.     Consists primarily of Dimensions keys.
    2.     Algorithms & procedures stored in ETL
    3.     Types of Fact Tables: Snapshots, Summary, Transactions
  5. Dimensions of the Fact (average 10 dimensions): Time, WebUser, Product, ProductBrand. i.e. registration date, brand applied, product selected, zip code, etc...
  6. Find the relevant facts, i.e. Which segments (age, zipcode, income,...) have the most conversion? for what brands?


Image
Star Schema: Facts in center, Dimensions are satellites. Choosing a star schema over 3NF (third normal form tables) provides ready information for analysts - after record update, aggregate update, and index refresh. Choose ETL (or in house created) tools that supports dimension chage, generation of surrogate keys, and aggregates update.

Web Dimensions & Facts:  Partial list includes: page, referrer, time, session, customer (email address), location, time on site. The Fact table for a clickstream datawarehouse that supports ROI analysis, usability analysis, forecasting,  and other traffic PKIs would include both web and non-web information.

SawMill: to cut out a few steps - SawMill Enterprise transform weblogs to MySQL tables (and maps the fields to bits to shrink log size from 1 to 2 order of magnitude depending on weblog format). SawMill works great for weblog related datamarts that doesn't need the full blown features of BI tools. It can be used for adhoc queries in addition to BI tools such as Pentahoe - also see discussion on SawMill on the home page of this website.

To acheive optimal query performance in MySQL 5.x Star Schema data warehouses , setting database parameters including the following is key to optimization:
  1. MySQL.ini
  2. Table Design: Optimized for speed, space, or ease of maintenance. i.e. MyISAM and MERGE
  3. Index Design: each column need to be indexed - any dimension can be restricted or refereneced
  4. Data Loading: Processing sequence
  5. Analyze Table: key distribution, cardinality of keys, table statistics
  6. Query Structure
  7. Use MySQL commands Explain, Benchmark, and others to optimize queries, i.e. Select with joins

MySQL.ini
Of these variables, key_buffer_zise and table_cached are most important, i.e. table_cached should be close to the number of open tables.

bulk_insert_buffer_size >= 16MB
key_buffer_size >= 25-50% RAM
innodb_additional_mem_pool_size >= 4MB
innodb_autoextend_increment >= 64MB
innodb_buffer_pool_size >= 25-50% RAM
innodb_file_per_table = TRUE
innodb_log_file_size = 1/N of buffer pool
innodb_log_buffer_size = 4-8 MB
myisam_sort_buffer_size >= 16MB
query_cache_size = 0
sort_buffer_size >= 4MB


Implementation options: SawMill, Pentahoe, Brighthouse (Infobright). WhereScape Red (methodology).

Reference
Cox Communication MySQL database - 2 billion row database (now 4 billion rows)
http://www.mysql.com/why-mysql/case-studies/mysql-cox-casestudy.pdf

Enterprise Data Warehousing with MySQL
http://www.mysql.com/why-mysql/white-papers/mysql_wp_for_data_warehousing.pdf

MySQL Administrator Best Practices
http://dev.mysql.com/tech-resources/articles/mysql-administrator-best-practices.html

Data Webhouse
Ralph Kimball
http://www.intelligententerprise.com/db_area/archives/1999/991611/warehouse.jhtml
http://kimballgroup.com/html/articlesArchitecture.html

Clickstream Data Mart
http://www.intelligententerprise.com/011205/418warehouse1_1.jhtml?_requestid=1146117

Building and Building and Optimizing Data Warehouse "Star Schemas" with MySQL
Bert Scalzo, Ph.D.
http://www.mysqluc.com/presentations/mysql05/scalzo_bert.pdf

OpenSource ETL for MySQL: CloverETL, Enhydra Octopus, Apatar, Talend.
Others ETL: DataMirror,

Jasper Reports (Java)
RLib (C Python/PHP bindings)
jFreeReport/Chart (Java)
DataViz (Java)

OLAP
Pentahoe - Mondrian (Java)
JPivot (Java/JSP)
BEE (Perl)


Netezza Performance Server® Appliance: An Architectural Comparison
http://www.netezza.com/documents/whitepapers/Architectural_Comparison_0305.pdf
 
Usability Check List Print E-mail
Sunday, 21 October 2007
Usability Check List
  1. Cognitive model
    1. Goal/Process - Where is the bottle neck?
    2. Website Feng Shui - art and science of goal/user process flow
      1. Design/Color Theory 
        1. neutral vs personalization
        2. warm & cold balance
        3. The zen of eye movement.
      2. ClosetSpace organization - when there are more than 5 choices, use folders, outline, expandable trees.
      3. Tufty Visualization as navigation model (i.e. US map with cold to warm dimension)
      4. Persuasion Architecture and six Sigma (Futurenowinc.com)
      5. Neuro-linguistic marketing - learning before buying, infomercial, and parallel .org website
  2. ClickZ Marketing checklist, i.e. recency, frequency
  3. Nielsen/Norman , Human Factors International Checklist
  4. Section 508 Disability Checklist (Adobe study on web user disability) 
  5. Web analytics model. 
    1. Continuous quality improvement - PKIs, score card management
    2. Serving content based on segmentation - user classification/clusters
 
Google - DoubleClick Acquisition discussion Print E-mail
Tuesday, 02 October 2007
Even without the DoubleClick acquisition, according to Game Theory and Auction Theory, the Adword bidding arena is about to get into another gear as tools such as SpyFu, Compete, and Alexa making the keyword bidding process, and other SEO and SEM processes transparent. Like English auctions where prices are revealed, as the number of bidders get larger, the bids approaches the market price. Auctions or bidding are used because sellers don't know the real value of the goods - prices change depending on market conditions - prices are what buyers willing to pay.

In the web avertising arena, when there are a small number of bidders for top positions, the sellers (Google, Yahoo, Ask, etc...) always get better than market prices. - which is the current state of Adwords. Imagine if Adwords used the Dutch auction where bids are offered as Google drops the Ad prices, or First-Price/Sealed Bid auctions where bids are offered to highest bidder at certain time of the day. Let's not jump into that just yet.

With the Acquisition of DoubleClick, what will change? Currently DoubleClick DART helps advertisers with buying ad spaces and helps publishers minimize unsold space inventory. DoubleClick does a number of housekeeping processes on top of ad serving. How will these processes changed? The bottom line (search engines') will drive the changes.

What about another model, namely ISEDN.org where there is a flat fee for keywords and fair rotation of advertisement among all buyers. It certainly will eliminate click frauds, but is it realistic? For some industries, non profit, and public services, it might work well. The traffic report on Alexa for ISEDN.org shows that the idea is not catching on - traffic steadily declines with only a few exeptions.

Read Google's views on government, policy, and politics
 
Testing Apache Webserver running on a guest Fedora Linux OS Print E-mail
Thursday, 20 September 2007
TUTORIAL: Testing Apache Webserver running on a guest Fedora Linux OS

This tutorial shows the steps for testing if Apache is running.

 
STEPS
  1. Create a Fedora Linux virtual machine (guest) on Windows XP(host) with Vmware.
  2. Install Fedora Linux. This is now the guest OS.
  3. Start Linux, than log in as a regular user. Start a shell to input commands
  4. Change to root (su - root)
  5. Start Apache by typing at the command prompt: service httpd start. Test if service is running by typing: telnet localhost 80 (see image 1 below). GNOME or KSE  GUI tools can also be used to configure Apache.

  6. Image
  7. Find the IP address of this server by typing: ifconfig (this is equivalent to typing at the commnad prompt in Windows: ipconfig  - see image 2 with the black background below) Note that this is a different IP address than that generated by Windows in the host machine.

  8. Image
  9. By default Apache is set to listen to all available IP addresses (GNOME or KSE GUI tools can be used to verity this)
  10.  Start FTP service by typing in: /etc/init.d/vsftpd start
  11. Create an index.html test page and upload to /var/www/html (see image 3 below).

  12. Image
  13. Open a brower on host Windows XP and type in the URL found in step 6. The test page should appear (see image 1)

 
Testing Tibco Rendezvous Messaging with VMware Workstation Print E-mail
Wednesday, 29 August 2007
TUTORIAL: Testing Tibco Rendezvous Messaging with VMware Workstation
  1. Download and install VMware workstation. Create a guest operating system. The video below shows two operating systems: Windows XP and Windows 2003  Enterprise both running on the same computer.
  2. Download and install Tibco Rendezvous in both host and guest operating systems (click on the first radio button, not the PGM. button for the steps below)
TEST STEPS
  1. Navigate to the directory /TIBCO/TIBRV/bin and execute trdprvd.exe on both operating systems. This will open a Windows command window on each operating system. Leave them open.
  2. Open a browser and type in http://localhost:7580/
  3. Click on "Current Log" to see what IP addresses show up.
  4. On one operating system, open a new command windows. Then type in: "tibrvlisten weather". The video below shows this is done in the top command windows.
  5. On the other operating system, open a new command windows. Then type in: "tibrvsend weather nice today". The video below shows this is done in the bottom command windows. NOTE the instant the enter button is hit, the top window register the message.
  6. Now type in: "tibrvsend sport kool" and hit enter. NOTE that nothing is registered on the top windows as it only listen to the topic "weather".
  7. Try sending another weather message and see the same result.
If you can't see the video very well, you can send a request to youtube for larger image-size video.


 
Web Services Print E-mail
Tuesday, 14 August 2007
Image Image"
REPRINT
Driving Commerce Value Systems with the Web Services Paradigm
" by Tom Tuduc, Archived Feature Article in SYS-CON Web Services Journal.

In 2002, Tom predicted the proliferation of loosely coupled webservices and Web 2.0 mashups in this feature article: "... Integration schemes, becoming so elegantly simple yet powerfully utilitarian (and nearly free), give rise to increasingly profound changes in the structure and nature of value systems."
And: "...Strategy defines how all the elements of business work together. The Web Services Paradigm is at the center of business strategy by deciding what business processes are linked and how they are linked. For example, business processes can be linked at compile time or at runtime, or be substituted by other processes (first-order systems) or by a composite framework of processes (second-order systems)."
Read more...
 
Blog on Truemors.com Print E-mail
Sunday, 01 July 2007
In his blog, Guy Kawasaki itemized the cost of creating http://truemors.com/
http://blog.guykawasaki.com/2007/06/by_the_numbers_.html

For less than $12,000 the Web 2.0 www.truemors.com was created. What comes next, B2B sites? Publishers? Markets? WebArches predits that within 2 years, the cost for any websites will be within reach of an entrepreneur. Note, however, the costs of various factors go into creating a great website vary widely. $12,000 can create www.truemors.com, but truemors.com doesn't have video, audio, flash, elearning, etc. In addition, to create Web interactivity, Flash, video, dynamic content, and mashups, maintaining a website is just as costly as creating it.
 
Security Enterprise Architecture for Healthcare Print E-mail
Wednesday, 11 August 2004
Image  Image Security Enterprise Architecture for Healthcare (SEAFH)
by Tom Tuduc. Published in the ISSA Journal August 2004 Issue. Excerpt: "To comply with HIPAA and to provide best services under limited resources and time, healthcare organizations often have to make decisions that involve conflicting objectives. For example, what is the ROI of an identity management system that requires significant staff time and resources to integrate with directory servers and domain authentication systems? While a simpler single sign-on solution is sufficient in the short term. Another example is to determine whether 20 percent more data availability is equivalent to 5 percent less data integrity? How healthcare organizations should perform due-diligence, risk and quantitative ROI assessments, create policies, processes, and best practices? This paper employs a framework and methodology which draw from IT best practices that will help all healthcare organizations."  
Read more...
 
Homeland Security Analysis Print E-mail
Thursday, 22 January 2004
Image Image"Homeland Security Quantitative Risk Analysis. Published in the ISSA Journal January 2004 Issue. Excerpt: "Imagine a Homeland Security central control cockpit. If the terrorist alert level moves from green to yellow, orange, or red in a particular region of the US, what knobs and dials should be changed and by how much? Homeland Security is complex and deals with uncommon and/or hypothetical uncertainties. What are the appropriate risks and risk methodologies decisions? What are the qualitative and quantitative modeling methodologies needed to consider thousands of infrastructure variables and intelligence with different credibility and accuracy?"
Read more...
 
State of California Data Center Consolidation Hearing Print E-mail
Friday, 28 November 2003
Image
Read more...
 
Integration Print E-mail
Friday, 27 June 2003

"Today, shrinking business cycles compel enterprises to adapt quickly. A sluggish economy and a history of IT mega-projects with mixed outcomes demand faster proven results from IT companies (vendors). Both of these developments point to a need for real-time adaptive solutions. Identifying future-proof IT architectures that are malleable to changing business processes is the quest of emerging and established companies in the new IT arena. These IT companies provide solutions and tools that merge Application Development, BPM, EAI, Data Integration, Enterprise Architecture, and Web services. Across software categories, successful companies adopt the SOA and leverage Rapid-Productivities to help enterprises to: 1) do more with less and 2) implement real-time business and 3) raise the bar in adaptability. This article defines Rapid-Productivities and explores a number of these companies; it provides insights for enterprises selecting pre-integrated packages versus implementing their own instantly integrated solutions."

Read more...
 
Semio Patent Overview Print E-mail
Wednesday, 11 September 2002
By Tom Tuduc

The overall patented process can be summarized as follows: Textual data is processed to generate lexicon and indexes. This includes text gathering, lexicon extraction, comparing text to lexicon, and generating indexes. The lexicon consists of phrases and words conveying the content of the documents, usually nouns and bigrams or two-word phrases. The benefit with indexes and maps is that users can retrieve textual data from a very large repository of text, while still be able to view the actual textual data.

Introduction:

- Problem: in 2001, an average company has 2,500 Gb of data to sort through, which is 25 times more than in 1998. Of the 2,500 Gb of data, about 100Gb is real text, the rest is graphics, tags, etc.
- Portals, external and internal, need to integrate structured data on centralized servers and unstructured data on dispersed servers. Portals are particularly suitable to solve the unstructured data.
- The solution is to combine technologies including directory, taxonomy, and search.
- Categorization of data is a key for a successful portal. Other important features include presentation, personalization, collaboration, process, publishing, and distribution, and integration.

Key Points:

Directories are practical, reflect day-to-day practice, but don't have formal ontologies and can be messy and unstable. Directories are listings of associations between pieces of information. They are relatively flat and can exhibit taxonomy-like relationships but not as deep and consistent as a taxonomy system.

Taxonomy is a formal classification of information. The creation of taxonomy can be formalized and automated based on the content found the information. The Semio taxonomy can crawl and analyze a gigabyte of text-based information in one hour; it can also use imported lexicons of key words, bigrams and phrases and user-defined thesauri (see Note below).

Browsing a taxonomy that organizes information using categories and sub-categories can be the most efficient way to search. This is different from keyword or phrases search in top-down systems including directories. For example, a search for "bank" will miss "savings and loan" and may retrieve documents about a "river bank". The key differentiation is that in a top-down system, users do not have access to the original documents and cannot sample words within these documents to come up with effective keywords to retrieve relevant documents. Browsing for information using taxonomy is a bottom-up approach allowing users to view phrases of actual documents using content-based roadmap of the documents.

In practice, ontologies with categories and subcategories can be inconsistent if a single view is constructed. Complementary views of information provide more consistent hierarchy of categories. A document-driven matrix can be used to build dynamic and complementary views. Categories can also be reshuffled according to certain criteria, or crossing of two taxonomies.

The Semio methodology employs both taxonomy and directory. A taxonomy is created by indexing documents using conceptual categorization against ontologies (see Note). Directories are created by routing of documents in the most relevant directory folders using weighted tags (see Note). Often it's practical to use categories to fill the directory folders. Note that ontologies refers to static information discovery, where taxonomy generation means dynamic and automatic discovery of information.

The key differentiation is that directories are results, not starting points, where other software providers start from existing directories. Another key differentiation is that ontologies are real ontologies: conceptual and explicit. While most categorization solutions sort document into categories considering the document as a whole, Semio works on concepts within the documents. Semio taxonomy enables the selection of relevant concepts, then the documents referencing these concepts, ranked by relevance.

Comparing with manual directories, structured queries, and inductive classification, conceptual categorization has the most benefits: low setup cost, high accuracy, high flexibility, low admin cost, granularity at the concept level, high depth, and high user control.

There are products providing solutions where there is a low level of expectations, knowledge and information can be incidental, and the scope can be shallow and narrow. Semio is focusing on solutions where there is a high level of expectations, knowledge and information is critical, and the scope has to be deep and broad.

Analysis: The overall patented process can be summarized as follows: Textual data is processed to generate lexicon and indexes. This includes text gathering, lexicon extraction, comparing text to lexicon, and generating indexes. The lexicon consists of phrases and words conveying the content of the documents, usually nouns and bigrams or two-word phrases. Then each piece of text may be processed using this lexicon to generate an index. Filter words can be specified by users to eliminate indexes not containing filter words. Then clusters of the remaining indexes are formed where phrases with various degrees of relationship are grouped together. Clusters can be displayed graphically as maps where clusters with different degrees of relationship are place on different maps. The users can view these maps to select relevant clusters along with abstracts of each piece of textual data containing these selected clusters. The benefit with indexes and maps is that users can retrieve textual data from a very large repository of text, while still be able to view the actual textual data.