Analyzing Reviews and Code of Mobile Apps for Better Release Planning

  • Published on
    12-Apr-2017

  • View
    89

  • Download
    0

Transcript

  • Analyzing Reviews and Code of Mobile Apps for

    Better Release Planning

    Adelina Ciurumelea, Andreas Schaufenbhl, Sebastiano Panichella, Harald C. Gall

    software evolution & architecture lab

  • 2

    Extremely Popular Apps8,087,067 reviews3,505,905 reviews38,742,600 reviews

  • 3

    Open Source Apps62,707 reviews

  • 4

    The number of reviews is large compared to the available development resources.

  • 5

    reviews contain valuable feedback directly from the users

    users often report bugs, user experience and request features

    the review content influences the number of downloads

    Importance of reviews

  • 6

    INFORMATIVE NON-INFORMATIVE

    AR-Miner: Mining informative reviews for developers from mobile app marketplace N. Chen, J. Lin, S. Hoi, X. Xiao, and B. Zhang

  • 7

    BUG FEATURE REQUEST

    Release planning of mobile apps based on user reviews L. Villarroel, G. Bavota, B. Russo, R. Oliveto, and M. Di Penta\

    OTHER

  • 8

    BUGFEATURE REQUEST

    the developer has to manually analyse the unstructured groups of reviews, understand what they talk about and extract actionable change tasks

    what does a particular cluster talk about? Does it talk about the UI or about the performance of the app, etc.?

  • 9

    What are the mobile specific topics users talk about in their reviews?

  • 10

    manual analysis of ~1600 reviews

  • 11

    Hmmm...

    Mm No

    This is IT

    Nope Nopity nope

    not all reviews are useful

  • 12

    Hmmm...

    Mm No

    This is IT

    Nope Nopity nope

    Sucks Way to many errors

    0 stars Garbage.

    problem bro

    Garbage Bla bla bla

    not all reviews are useful

    some are even offensive

  • 13

    Pretty close to perfect, this app is way better than any comic book

    reader I've ever used. It's small, it operates fast, and the interface is

    incredibly clean and simple.

    others can provide valuable information for the developer

  • 14

    Pretty close to perfect, this app is way better than any comic book reader I've ever used. It's small,

    it operates fast, and the interface is incredibly clean and

    simple.

    Resources

    Usage

  • 15

    For info (in case dev not already aware!), there is a graphical

    glitch when scrolling output in marshmallow on a nexus 5.

    Compatibility

    Usage

    Complaint

  • 16

    Building the taxonomy

    feature extraction: TF-IDF scores and 2 and 3-grams counts

    Content analysis in 2 passes:

    start with an empty list of categories

    analyse each review and add a new category (including definition and keywords) if necessary

    label the review with all the matching categories

    second pass: revisit the list of reviews and label

    them with the appropriate categories

  • 17

    Category Description

    Compatibility mentions the OS, mobile device or a specific hardware component.

    Usage talks about the UI or the usability of the app.

    Resources mentions the apps influence on the battery and memory usage or the performance of the app/phone.

    Pricing statements mentioning the license model or the price of the app.

    Protection statements referring to security or privacy issues.

    Complaint the user reports or complains about an issue with the app.

    High Level Taxonomy

  • 18

    specialise the taxonomy further

  • 19

    Liked it and worked very well in lollipop, but not MM The plugins don't refresh, manual navigation

    to next image doesn't work. Some plugins give error.

    Altogether seems broken after MM update on Note 4.

    Compatibility

  • 20

    Liked it and worked very well in lollipop, but not MM The plugins don't refresh, manual navigation

    to next image doesn't work. Some plugins give error.

    Altogether seems broken after MM update on Note 4.

    Compatibility

    Device

    Android Version

  • 21

    High Level Low Level Categories

    Compatibility Device, Android Version, Hardware

    Usage App Usability, UI

    Resources Performance, Battery, Memory

    Pricing Licensing, Price

    Protection Security, Privacy

    Low Level Taxonomy

  • 22

    Automated Classification

  • 23

    Gradient Boosted Trees Training

    Preprocessing &

    Feature ExtractionMulti-label

    Classification

    ML Approach

  • 24

    Preprocessing & Feature Extraction

    preprocessing: stop words removal and stemming

    feature extraction: TF-IDF scores and 2 and 3-grams counts

  • 25

    Training

    feature extraction: TF-IDF scores and 2 and 3-grams counts

    one-vs-all strategy: separate classifier for each high and low level category (18 in total)

    used the Gradient Boosted Trees model

  • 26

    Multi-label Classification

    PreprocessingFeature

    Extraction ClassificationHigh & Low

    Level Categories

    ++

    ++

    Battery

    UI

    Complaint

    Resources

    Usage

  • 27

    Example

    feature extraction: TF-IDF scores and 2 and 3-grams counts

    RQ2: Does our approach correctly recommend the software artifacts that need to be modified in order to handle user requests and complaints?

    752 user reviews from our dataset belong to AcDisplay

    analyse Compatibility and Complaint reviews (61 reviews)

    Complaint and Android Version (22 reviews)

  • 28

    Example

    feature extraction: TF-IDF scores and 2 and 3-grams counts

    Good but has some issues with Marshmallow I used this on my old phone and if was flawless and I loved it. I noticed that sometimes when I had AcDisplay activated I would not be able to use the fingerprint sensor even after I unlocked AcDisplay and had to enter a password. This is very frustrating so I cannot use AcDisplay.

    Love the design I love the app. Its super sleek and nice. But ever since my phone updated to marshmallow its stopped working. Hope it comes back soon.

    On Marshmallow, the screen is buggy and sometimes shows the notification shade.

  • 29

    feature extraction: TF-IDF scores and 2 and 3-grams counts

    can we link reviews to the related source code?

    IR methods based on the VSM (hard task: the vocabulary used by reviews and source code is different)

    use additional Android project specific information (e.g. UI functionality is implemented in Activity classes)

    Source Code Localisation

  • 30

    Source Code Localisation

    Android Project Structure Info

    IR - VSM

    Software ArtifactsApps Source Code

    User Reviews

  • 31

    Evaluation

    feature extraction: TF-IDF scores and 2 and 3-grams counts

    RQ1: To what extent does our approach organise reviews according to meaningful maintenance and evolution tasks for developers?

    RQ2: Does our approach correctly recommend the software artifacts that need to be modified in order to handle user requests and complaints?

  • 32

    Reviews Source Code

  • 33

    Study RQ1

    feature extraction: TF-IDF scores and 2 and 3-grams counts

    ~7800 user reviews from 39 apps

  • 34

    Study RQ1

    feature extraction: TF-IDF scores and 2 and 3-grams counts

    2 external evaluators

    evaluate 200 reviews for each category (3600 total)

  • 35

    Results RQ1

    High Level Category Precision Recall F1 Score

    Compatibility 71% 97% 82%

    Usage 89% 94% 91%

    Resources 79% 99% 88%

    Pricing 85% 97% 90%

    Protection 89% 98% 93%

    Complaint 85% 80% 82%

  • 36

    Results RQ1High Level Category

    Low LevelCategory Precision Recall

    F1 Score

    CompatibilityDevice

    OS Version Hardware

    85% 89% 61%

    98% 86% 95%

    91% 87% 74%

    Usage App Usability UI92% 83%

    91% 93%

    91% 88%

    ResourcesPerformance

    Battery Memory

    64% 78% 68%

    97% 95% 95%

    77% 86% 79%

    Pricing Licensing Price91% 85%

    98% 96%

    94% 90%

    Protection Security Privacy87% 83%

    98% 96%

    92% 89%

  • 37

    Results RQ1

    Our approach is able to classify reviews with high precision and recall according to the mobile specific topics we derived. The most important categories are Usage, Resources and Compatibility.

  • 38

    Study RQ2

    1 external evaluator

    91 user reviews from 2 apps

  • 39

    Results RQ2

    feature extraction: TF-IDF scores and 2 and 3-grams counts

    Quality of Reviews Precision Recall F1 Score

    Difficult to Link 41% 83% 55%

    Easier to Link 52% 79% 63%

    All 51% 79% 62%

  • 40

    Results RQ2

    Our approach achieves promising results in recommending related software artifacts for specific user reviews, furthermore better quality reviews are easier to link than lower quality ones.

  • 41

    Conclusion & Future Work

    reviews can be classified with high precision and recall using machine learning according to mobile specific topics

    linking reviews to source code using textual similarity based methods is difficult

    future work: summarise reviews, improve localisation (static analysis)

  • 42

    Discussion

    What mechanisms can we adopt for enabling a reliable and practical solution for code localisation?